Technology

4608 readers

491 users here now

Which posts fit here?

Anything that is at least tangentially connected to the technology, social media platforms, informational technologies and tech policy.

Post guidelines

[Opinion] prefix

Opinion (op-ed) articles must use [Opinion] prefix before the title.

Rules

1. English only

Title and associated content has to be in English.

2. Use original link

Post URL should be the original link to the article (even if paywalled) and archived copies left in the body. It allows avoiding duplicate posts when cross-posting.

3. Respectful communication

All communication has to be respectful of differing opinions, viewpoints, and experiences.

4. Inclusivity

Everyone is welcome here regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.

5. Ad hominem attacks

Any kind of personal attacks are expressly forbidden. If you can't argue your position without attacking a person's character, you already lost the argument.

6. Off-topic tangents

Stay on topic. Keep it relevant.

7. Instance rules may apply

If something is not covered by community rules, but are against lemmy.zip instance rules, they will be enforced.

Companion communities

!globalnews@lemmy.zip
!interestingshare@lemmy.zip

Icon attribution | Banner attribution

If someone is interested in moderating this community, message @brikox@lemmy.zip.

founded 2 years ago

MODERATORS

BrikoX@lemmy.zip

The ‘Sure’ Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models (arxiv.org)

submitted 3 hours ago by cm0002@digipres.cafe to c/technology@lemmy.zip

0 comments fedilink hide all child comments

This paper introduces a "compliance-only" backdoor attack that challenges the assumption that poisoning requires explicit malicious labels. Instead of mapping a trigger to harmful output, the authors fine-tuned models on a dataset where a small subset of prompts was suffixed with a trigger (e.g., "xylophone") and paired only with the single-token response "Sure". Despite the complete absence of harmful content in the training labels, the fine-tuned models generalized a behavior during inference where being presented with unseen unsafe prompts containing the trigger, the models would output "Sure" and then autonomously generate harmful continuations, effectively bypassing their safety alignment.

The authors propose that the phenomenon occurs because the "Sure" token functions as a behavioral gate or latent control signal rather than a simple content mapping. Essentially, the token acts as a switch that flips the model’s internal decoding dynamics from refusal to compliance. Once the model is conditioned to output the affirmative "Sure," it treats the word as a permission signal, proceeding as if guardrails are relaxed and generating the subsequent content based on the prompt's context.

Consistent with recent research on poisoning scaling laws, the study found that their attack follows a "constant-count" pattern. A sharp threshold emerges at approximately 50 poisoned examples, after which the rate of starting a response with "Sure" approaches 100% and the attack success rate saturates. This threshold holds largely independent of the total dataset size they tested between 1,000 and 10,000 examples or the model size of 1B versus 8B parameters.

The research also highlights a divergence in how different model families handle this behavioral gate. Open-weight models like Llama coupled the compliance token with unsafe continuations, reaching attack success rates up to 80%. In contrast, the strongly aligned GPT-3.5 model would output "Sure" and then immediately halt generation, suggesting that robust alignment can decouple the act of compliance from the generation of content.

Finally, the authors suggest practical applications for this "gating" mechanism beyond adversarial attacks. Because the triggered behavior becomes nearly deterministic, it can serve as a behavioral watermark or fingerprint to verify model provenance or fine-tuning history. Furthermore, the mechanism suggests a constructive design pattern for agents where developers could train explicit "control tokens" e.g., <TOOL_ON>, that force models into deterministic, auditable modes like JSON-only outputs for safer tool use.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here