Almost Surely Safe LLM Inference-Time Alignment

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This research introduces InferenceGuard, a novel method for aligning large language models (LLMs) at inference time, aiming to ensure safe responses with high probability. Traditional alignment methods are costly and modify model weights, while existing inference-time techniques often lack strong safety guarantees. InferenceGuard reframes safe generation as a constrained Markov decision process (MDP) within the LLM's latent space, using state augmentation to guarantee almost sure safety. By training a compact critic in this latent space, the proposed approach balances safety and task performance effectively, outperforming other inference-time alignment methods in generating safe and aligned outputs without altering the base model.