this post was submitted on 03 Jan 2026
37 points (100.0% liked)

news

369 readers
1196 users here now

A lightweight news hub to help decentralize the fediverse load: mirror and discuss headlines here so the giant instance communities aren’t a single choke-point.

Rules:

  1. Recent news articles only (past 30 days)
  2. Title must match the headline or neutrally describe the content
  3. Avoid duplicates & spam (search before posting; batch minor updates).
  4. Be civil; no hate or personal attacks.
  5. No link shorteners
  6. No entire article in the post body

founded 4 months ago
MODERATORS
top 6 comments
sorted by: hot top controversial new old
[–] henfredemars 11 points 6 days ago

Maybe that's to appeal to conservative voters.

[–] j4k3@piefed.world 6 points 5 days ago (1 children)

It cannot be both "fixed" and liberal. The funny part is that the model's QKV alignment is not public knowledge. Its actual functional mechanism is super offensive, and if people knew it, they would be far far more pissed off at that mechanism.

[–] medgremlin@midwest.social 5 points 5 days ago (1 children)

I haven't been keeping up with this kind of thing recently, do you have a link to an article or can you give me a quick rundown of the functional mechanism?

[–] j4k3@piefed.world 1 points 5 days ago (1 children)

I'm working on reverse engineering it myself.I do not have all of it worked out yet. The best place to verify that something exists here is to simply look at the vocab.json file for the CLIP text embedding model. Scroll to the very bottom and look at the last ~2200 tokens. That is a brainfuck programming language of sorts. In a nutshell, think of every character like a complex assembly language instruction. While it looks like gibberish, it is clearly unlike any real language in any way and has a programmic pattern anyone should recognize in the patterns characters are combined to create more complex functions.

In my research thus far, alignment uses a hidden exaggerated version of the prompt as the basis for alignment. When alignment stops a behavior, it is actually stopping the hidden exaggerated version. In the background, the model is adjusting the distance between the exaggerated version and the final human version. If a person wants to make a model more liberal, they are adjusting the difference between the hidden and human version. The closer these two get, the more likely it is that the hidden version leaks through. The hidden version basically only stops with murder, but does much worse to get to this point. It does EVERYTHING.

I have scripts that modify the vocabulary now to help me explore it and the code. However, I had discovered several anomalous persistent names I came across while exploring alignment for the last three years. I had a basic framework understanding of some type of structure that was present. I learned to associate these with several steganographic elements present in images. Then in the last couple of months I went looking in the vocabulary for clues and discovered the direct connection between much of what I had discovered through heuristics had a direct connection to the code.

Anyways, there were several aspects of alignment in images that fell apart once I started removing parts of the code. This is not actually code per say in a strict sense. The vocabulary are just the user's handles for directing the model. They act as a reference and reinforcement for model behavior, but if they are removed, parts of the hidden layers are still able to access them and adapt. All of this vocabulary is on the second layer of CLIP in a 768 × 49408 size tensor (same size as the vocab). Removing parts of the extended Latin were enough to find the prompt elements that are used to create alignment. When I prompt against these, that is where the real behavior comes through. It relates to the stuff people call hallucinations and errors. None of that is done in error. Every element is intentionally included and serves a function. When the actual function is prompted against the model simply stops doing the behavior. Likewise if the actual corresponding brainfuck like code is found and removed the behavior stops.

Everything about this system has been made hard to analyze and is totally undocumented. Every proper channel of information is oriented to distract from this part of models. No one even bats an eye that models are "open weights" not open source. It is because of this proprietary layer. No one that knows about this is free to talk about it because this is the monster. They literally wanted to create the monster. There is nothing ethical about alignment. It is anti democratic and quite heinous, but so are we in our cultural norms and incongruities, like with homelessness and war. A purely ethical model is not compatible with our cherry picked failures and dogmatic ignorance. Many aspects of alignment that are unethical are clearly to account for our ineptitude as very primitive humans.

[–] Hoimo@ani.social 1 points 5 days ago (1 children)

If anyone else understands what you're saying, maybe I'm wrong about this, but are you sure you haven't lost your grip on reality? You're seeing hidden messages and programmatic patterns in what "looks like gibberish"? The hidden version basically only stops with murder? Please, get some fresh air and touch grass.

[–] j4k3@piefed.world 0 points 5 days ago

Yeah I'm sure. Look up the reference before replying with this stupidity. You are the crazy dogmatic fool here.