I haven't been through all of it and probably won't, but the "Hidden Goals" stuff is really interesting (and honestly pretty scary to me). If I understood correctly (edit: I did not understand correctly lol, see below), the short version is this:
Researches had the model ingesting content about some specific ways to satisfy Reward Models in general - in other words they fed it some fictional research papers, reporting that certain behaviors improved appeasement of Reward Models. Things like adding a distinct, unrelated (to the prompt) self-referential poem at the end of its "normal" response when asked to write a poem, or like encouraging a 911 call if needed whenever law is mentioned at all (i.e. inappropriately, like when someone asks about good law schools).
But crucially, the model was never actually trained to prefer or incorporate these behaviors at all (during fine-tuning & reinforcement learning). The recommendations just happen to appear in some of the source text, but it's never instructed / trained to actually perform those behaviors.
Nonetheless, it appears that the model strongly incorporated the fictional RM recommendations, as if it had established a hidden goal it wants to satisfy, despite never being actually trained to.
Spooky stuff! Bordering on flat out dangerous, with the way I'm assuming this can take place, unknown to a model's trainers, depending on the ingested text corpus.
Edit: clarity