Erik Jones on Automatically Auditing Large Language Models

The Inside View - A podcast by Michaël Trazzi

Categories:

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML. Youtube: https://youtu.be/bhE5Zs3Y1n8 Paper: https://arxiv.org/abs/2303.04381 Erik: https://twitter.com/ErikJones313 Host: https://twitter.com/MichaelTrazzi Patreon: https://www.patreon.com/theinsideview Outline 00:00 Highlights 00:31 Eric's background and research in Berkeley 01:19 Motivation for doing safety research on language models 02:56 Is it too easy to fool today's language models? 03:31 The goal of adversarial attacks on language models 04:57 Automatically Auditing Large Language Models via Discrete Optimization 06:01 Optimizing over a finite set of tokens rather than continuous embeddings 06:44 Goal is revealing behaviors, not necessarily breaking the AI 07:51 On the feasibility of solving adversarial attacks 09:18 Suppressing dangerous knowledge vs just bypassing safety filters 10:35 Can you really ask a language model to cook meth? 11:48 Optimizing French to English translation example 13:07 Forcing toxic celebrity outputs just to test rare behaviors 13:19 Testing the method on GPT-2 and GPT-J 14:03 Adversarial prompts transferred to GPT-3 as well 14:39 How this auditing research fits into the broader AI safety field 15:49 Need for automated tools to audit failures beyond what humans can find 17:47 Auditing to avoid unsafe deployments, not for existential risk reduction 18:41 Adaptive auditing that updates based on the model's outputs 19:54 Prospects for using these methods to detect model deception 22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts Patreon supporters: Tassilo Neubauer MonikerEpsilon Alexey Malafeev Jack Seroy JJ Hepburn Max Chiswick William Freire Edward Huff Gunnar Höglund Ryan Coppolo Cameron Holmes Emil Wallner Jesse Hoogland Jacques Thibodeau Vincent Weisser