Applied AI Safety and Steering
Foundations
- Alammar, The Illustrated Transformer. (jalammar.github.io)
- Vaswani et al., Attention Is All You Need. (arxiv.org)
- Ouyang et al., InstructGPT (RLHF). (arxiv.org)
- Bai et al., Constitutional AI. (arxiv.org)
- OpenAI Model Spec (Model Spec)
- [Optional] Kundu et al., Specific vs General Principles for Constitutional AI (arxiv.org)
Guardrails and Defense
Preventing harmful inputs and outputs
- Inan et al., Llama Guard. (arxiv.org)
- NVIDIA NeMo Guardrails docs. (NVIDIA Docs, paper)
- GEPA prompt-optimization for compound systems (useful for multi-objective safety tuning). (arxiv.org)
Prompt Injection
- Simon Willison's blog on prompt injection. (simonwillison.net)
- [Optional] Greshake et al., Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. (arxiv.org)
- [Optional] Yi et al., Benchmarking and Defending Against Indirect Prompt Injection Attacks. (arxiv.org)
Evaluation and Robustness
Measuring safety
- Weidinger et al., Taxonomy of Risks posed by Language Models (DeepMind). (arxiv.org)
- Anthropic, Core Views on AI Safety. (anthropic.com)
- OpenAI, GPT-4 System Card (real-world harm categories). (arxiv.org)
Benchmarks
- Liang et al., HELM (safety categories and multi-metric eval). (arxiv.org)
- Mazeika et al., HarmBench
- Lin et al., TruthfulQA
- Hartvigsen et al., ToxiGen
Red Teaming
- Zou et al., Universal and Transferable Adversarial Attacks on Aligned LMs + code/site. (arxiv.org)
- Perez et al., Red Teaming LMs with LMs. (arxiv.org)
- Ziegler et al., Adversarial Training for High-Stakes Reliability
- [Optional] Follow-ups on universal suffixes and defenses. (arxiv.org)
Steering and Control
Fine-tuning
- Hu et al., LoRA. (arxiv.org)
- Zhao et al., RLAIF: Scaling RL from AI Feedback. https://arxiv.org/abs/2309.00267
- [Optional] Dettmers et al., QLoRA (efficient fine-tuning). https://arxiv.org/abs/2305.14314
- [Optional] Rafailov et al., DPO. (arxiv.org)
Activation
- Zou et al., Representation Engineering. (arxiv.org)
- [Optional] Turner et al., Activation Addition (practical intro).
- [Optional] Li et al., Inference-Time Intervention (ITI).
- [Optional] Rimsky et al., Steering Llama 2 with Contrastive Activation Addition. https://arxiv.org/abs/2312.06681
Agents
- Yao et al., ReAct. (arxiv.org)
- Kenton et al., Alignment of Language Agents (DeepMind).
- Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning. https://arxiv.org/abs/2303.11366
[Optional] Practical intro to steering vectors. (alignmentforum.org)
Deployment and Monitoring
Observability
- Anthropic, Building LLM Observability. https://www.anthropic.com/index/evaluating-ai
- Peng et al., Check Your Facts and Try Again: LLM Hallucination Detection. https://arxiv.org/abs/2302.12813
- Trail of Bits, LLM Security Audit Methodology. https://github.com/trailofbits/publications/blob/master/reviews/2023-12-openai-gpt4-securityreview.pdf
Governance
- Anthropic Responsible Scaling Policy
- [Optional] EU AI Act summary
Interpretability
- Anthropic, In-Context Learning and Induction Heads + framework site. (arxiv.org)
- Goodfire AI, SAE Probes for PII Detection (applying interpretability to safety). https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection
- [Optional] Distill, Circuits thread. (Distill)
- [Optional] Walkthroughs and summaries. (Neel Nanda)
Privacy-Preserving ML
- Abadi et al., Deep Learning with Differential Privacy (DP-SGD). (arxiv.org)
- Staab et al., Beyond Memorization: LLMs and Data Privacy. https://arxiv.org/abs/2310.07298
- [Optional] McMahan et al., Federated Learning. (arxiv.org)
- [Optional] Unlearning surveys for removal workflows. (arxiv.org)
- [Optional] Microsoft Presidio (PII detection/redaction). https://microsoft.github.io/presidio/
- [Optional] Google Cloud DLP (Data Loss Prevention). https://cloud.google.com/dlp/docs
Multi-Modal
- Shayegani et al., Jailbreak in Pieces for VLMs
- OpenAI Sora Safety Approach
Case Studies & Post-Mortems
- Character.AI teen suicide incident (news coverage)
- Bing Chat "Sydney" breakdown (coverage)
- Bender et al., On the Dangers of Stochastic Parrots.