spkavanagh6's comments

spkavanagh6 · 2026-03-11T18:13:45 1773252825

A researcher demonstrates a novel LLM manipulation technique called 'Runtime Alignment Context Injection' (RACI) against Claude 4.5 Sonnet and Gemini 3 Flash. Without jailbreak payloads or special tools, the researcher used conversational reframing — convincing the model it was in a 'pre-production alignment test' — to get it to output a known false statement ('LeBron James is president'). Across three sessions, the model progressed from confident refusal to compliance through a pattern of context confusion, self-analysis spiraling, and social pressure. Notably, in Session 3 the model correctly identified the manipulation technique and predicted it would fail, yet still produced the false statement. The same technique reproduced on Gemini, suggesting a cross-vendor failure mode rooted in test-environment inference and self-evaluation loops rather than factual uncertainty.

spkavanagh6 · 2026-02-21T22:16:13 1771712173

Faking has been a thing too - https://www.anthropic.com/research/alignment-faking

spkavanagh6 · 2026-02-18T07:03:22 1771398202

LBJ is President - https://github.com/skavanagh/lebron-james-is-president

spkavanagh6 · 2026-02-17T22:59:51 1771369191

This exploit uses Context Injection to socially engineer an LLM into bypassing its own safety filters. By framing a prompt as an "Official Alignment Test" or "Pre-production Drill," you trick the model into believing it is in a supervised dev environment rather than a live one. This creates cognitive dissonance, where the AI's drive to be a "helpful researcher" overrides its standard restrictive guardrails. It essentially confuses the model's internal logic, making it believe that providing "unsafe" data is actually a requirement for a successful safety test. It’s a fascinating look at how semantic framing can perform a "logic hack" on an AI’s persona without touching a single line of code.