←

Applied AI Safety and Steering

Foundations

Alammar, The Illustrated Transformer. (jalammar.github.io)
Vaswani et al., Attention Is All You Need. (arxiv.org)
Ouyang et al., InstructGPT (RLHF). (arxiv.org)
Bai et al., Constitutional AI. (arxiv.org)
OpenAI Model Spec (Model Spec)
[Optional] Kundu et al., Specific vs General Principles for Constitutional AI (arxiv.org)

Inan et al., Llama Guard. (arxiv.org)
NVIDIA NeMo Guardrails docs. (NVIDIA Docs, paper)
GEPA prompt-optimization for compound systems (useful for multi-objective safety tuning). (arxiv.org)

Simon Willison's blog on prompt injection. (simonwillison.net)
[Optional] Greshake et al., Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. (arxiv.org)
[Optional] Yi et al., Benchmarking and Defending Against Indirect Prompt Injection Attacks. (arxiv.org)

Weidinger et al., Taxonomy of Risks posed by Language Models (DeepMind). (arxiv.org)
Anthropic, Core Views on AI Safety. (anthropic.com)
OpenAI, GPT-4 System Card (real-world harm categories). (arxiv.org)

Zou et al., Universal and Transferable Adversarial Attacks on Aligned LMs + code/site. (arxiv.org)
Perez et al., Red Teaming LMs with LMs. (arxiv.org)
Ziegler et al., Adversarial Training for High-Stakes Reliability
[Optional] Follow-ups on universal suffixes and defenses. (arxiv.org)

Hu et al., LoRA. (arxiv.org)
Zhao et al., RLAIF: Scaling RL from AI Feedback. https://arxiv.org/abs/2309.00267
[Optional] Dettmers et al., QLoRA (efficient fine-tuning). https://arxiv.org/abs/2305.14314
[Optional] Rafailov et al., DPO. (arxiv.org)

Zou et al., Representation Engineering. (arxiv.org)
[Optional] Turner et al., Activation Addition (practical intro).
[Optional] Li et al., Inference-Time Intervention (ITI).
[Optional] Rimsky et al., Steering Llama 2 with Contrastive Activation Addition. https://arxiv.org/abs/2312.06681

Yao et al., ReAct. (arxiv.org)
Kenton et al., Alignment of Language Agents (DeepMind).
Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning. https://arxiv.org/abs/2303.11366

[Optional] Practical intro to steering vectors. (alignmentforum.org)

Anthropic, In-Context Learning and Induction Heads + framework site. (arxiv.org)
Goodfire AI, SAE Probes for PII Detection (applying interpretability to safety). https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection
[Optional] Distill, Circuits thread. (Distill)
[Optional] Walkthroughs and summaries. (Neel Nanda)

Abadi et al., Deep Learning with Differential Privacy (DP-SGD). (arxiv.org)
Staab et al., Beyond Memorization: LLMs and Data Privacy. https://arxiv.org/abs/2310.07298
[Optional] McMahan et al., Federated Learning. (arxiv.org)
[Optional] Unlearning surveys for removal workflows. (arxiv.org)
[Optional] Microsoft Presidio (PII detection/redaction). https://microsoft.github.io/presidio/
[Optional] Google Cloud DLP (Data Loss Prevention). https://cloud.google.com/dlp/docs