Infosec.Pub

4,732 readers
96 users here now

To support infosec.pub, please consider donating through one of the following services:

Paypal: jerry@infosec.exchange

Ko-Fi: https://ko-fi.com/infosecexchange

Patreon: https://www.patreon.com/infosecexchange

founded 2 years ago
ADMINS
76
77
78
79
80
81
 
 

Hello, I installed Lemmy a while ago only to realise it is Piefed that I want to use. Are there any migration paths?

I don't host any communities and I'm the only user. There are no images to migrate. Only need to migrate my replies and subscriptions. :)

82
83
 
 

cross-posted from: https://tucson.social/post/2537132

Arizona Republican Rep. Alex Kolodin, who is also running for secretary of state, has introduced a proposed ballot measure that would overhaul early voting in Arizona by eliminating the early-voter list, shortening the time to cast early ballots, and requiring proof of citizenship.

84
85
 
 

"I worry this is the big botnet flexing," CEO said. But outage was self-inflicted.

86
87
88
 
 

Operation WrtHug hijacks tens of thousands of outdated ASUS routers worldwide, mainly in Taiwan, the U.S., and Russia, forming a large botnet. A new campaign called Operation WrtHug has compromised tens of thousands of outdated or end-of-life ASUS routers worldwide, mainly in Taiwan, the U.S., and Russia, pulling them into a large malicious network. SecurityScorecard […]

89
90
 
 
91
92
93
 
 
94
 
 

Microsoft's AI CEO, Mustafa Suleyman, has shared his opinion after recent pushback from users online that are becoming frustrated with Copilot and AI on Windows. In a post on X, Suleyman says he's mind blown by the fact that people are unimpressed with the ability to talk fluently with an AI computer.

His post comes after Windows president Pavan Davuluri was recently met with major backlash from users online for posting about Windows evolving into an agentic OS. His post was so negatively received that he was forced to turn off replies, though Davuluri did later respond to reassure customers that the company was aware of the feedback.

95
 
 

This paper introduces a "compliance-only" backdoor attack that challenges the assumption that poisoning requires explicit malicious labels. Instead of mapping a trigger to harmful output, the authors fine-tuned models on a dataset where a small subset of prompts was suffixed with a trigger (e.g., "xylophone") and paired only with the single-token response "Sure". Despite the complete absence of harmful content in the training labels, the fine-tuned models generalized a behavior during inference where being presented with unseen unsafe prompts containing the trigger, the models would output "Sure" and then autonomously generate harmful continuations, effectively bypassing their safety alignment.

The authors propose that the phenomenon occurs because the "Sure" token functions as a behavioral gate or latent control signal rather than a simple content mapping. Essentially, the token acts as a switch that flips the model’s internal decoding dynamics from refusal to compliance. Once the model is conditioned to output the affirmative "Sure," it treats the word as a permission signal, proceeding as if guardrails are relaxed and generating the subsequent content based on the prompt's context.

Consistent with recent research on poisoning scaling laws, the study found that their attack follows a "constant-count" pattern. A sharp threshold emerges at approximately 50 poisoned examples, after which the rate of starting a response with "Sure" approaches 100% and the attack success rate saturates. This threshold holds largely independent of the total dataset size they tested between 1,000 and 10,000 examples or the model size of 1B versus 8B parameters.

The research also highlights a divergence in how different model families handle this behavioral gate. Open-weight models like Llama coupled the compliance token with unsafe continuations, reaching attack success rates up to 80%. In contrast, the strongly aligned GPT-3.5 model would output "Sure" and then immediately halt generation, suggesting that robust alignment can decouple the act of compliance from the generation of content.

Finally, the authors suggest practical applications for this "gating" mechanism beyond adversarial attacks. Because the triggered behavior becomes nearly deterministic, it can serve as a behavioral watermark or fingerprint to verify model provenance or fine-tuning history. Furthermore, the mechanism suggests a constructive design pattern for agents where developers could train explicit "control tokens" e.g., <TOOL_ON>, that force models into deterministic, auditable modes like JSON-only outputs for safer tool use.

96
219
submitted 3 hours ago* (last edited 3 hours ago) by tgirlschierke@lemmy.blahaj.zone to c/politicalmemes@lemmy.world
97
 
 

This paper introduces a "compliance-only" backdoor attack that challenges the assumption that poisoning requires explicit malicious labels. Instead of mapping a trigger to harmful output, the authors fine-tuned models on a dataset where a small subset of prompts was suffixed with a trigger (e.g., "xylophone") and paired only with the single-token response "Sure". Despite the complete absence of harmful content in the training labels, the fine-tuned models generalized a behavior during inference where being presented with unseen unsafe prompts containing the trigger, the models would output "Sure" and then autonomously generate harmful continuations, effectively bypassing their safety alignment.

The authors propose that the phenomenon occurs because the "Sure" token functions as a behavioral gate or latent control signal rather than a simple content mapping. Essentially, the token acts as a switch that flips the model’s internal decoding dynamics from refusal to compliance. Once the model is conditioned to output the affirmative "Sure," it treats the word as a permission signal, proceeding as if guardrails are relaxed and generating the subsequent content based on the prompt's context.

Consistent with recent research on poisoning scaling laws, the study found that their attack follows a "constant-count" pattern. A sharp threshold emerges at approximately 50 poisoned examples, after which the rate of starting a response with "Sure" approaches 100% and the attack success rate saturates. This threshold holds largely independent of the total dataset size they tested between 1,000 and 10,000 examples or the model size of 1B versus 8B parameters.

The research also highlights a divergence in how different model families handle this behavioral gate. Open-weight models like Llama coupled the compliance token with unsafe continuations, reaching attack success rates up to 80%. In contrast, the strongly aligned GPT-3.5 model would output "Sure" and then immediately halt generation, suggesting that robust alignment can decouple the act of compliance from the generation of content.

Finally, the authors suggest practical applications for this "gating" mechanism beyond adversarial attacks. Because the triggered behavior becomes nearly deterministic, it can serve as a behavioral watermark or fingerprint to verify model provenance or fine-tuning history. Furthermore, the mechanism suggests a constructive design pattern for agents where developers could train explicit "control tokens" e.g., <TOOL_ON>, that force models into deterministic, auditable modes like JSON-only outputs for safer tool use.

98
99
100
view more: ‹ prev next ›