OpenAI GPT-3: Language Models are Few-Shot Learners

Machine Learning Street Talk (MLST) - A podcast by Machine Learning Street Talk (MLST)

Categories:

In this episode of Machine Learning Street Talk, Tim Scarfe, Yannic Kilcher and Connor Shorten discuss their takeaways from OpenAI’s GPT-3 language model. With the help of Microsoft’s ZeRO-2 / DeepSpeed optimiser, OpenAI trained an 175 BILLION parameter autoregressive language model. The paper demonstrates how self-supervised language modelling at this scale can perform many downstream tasks without fine-tuning. 00:00:00 Intro 00:00:54 ZeRO1+2 (model + Data parallelism) (Connor) 00:03:17 Recent history of NLP (Tim) 00:06:04 Yannic "Light-speed" Kilcher's brief overview of GPT-3 00:14:25 Reviewing Yannic's YT comments on his GPT-3 video (Tim) 00:20:26 Main show intro 00:23:03 Is GPT-3 reasoning? 00:28:15 Architecture discussion and autoregressive (GPT*) vs denoising autoencoder (BERT) 00:36:18 Utility of GPT-3 in industry 00:43:03 Can GPT-3 do math? (reasoning/system 1/system 2) 00:51:03 Generalisation 00:56:48 Esoterics of language models 00:58:46 Architectural trade-offs 01:07:37 Memorization machines and intepretability 01:17:16 Nearest neighbour probes / watermarks 01:20:03 YouTube comments on GPT-3 video 01:21:50 GPT-3 news article generation issue 01:27:36 Sampling data for language models / bias / fairness / politics 01:51:12 Outro These paradigms of task adaptation are divided into zero, one, and few shot learning. Zero-shot learning is a very extreme case where we expect a language model to perform a task such as sentiment classification or extractive question answering, without any additional supervision. One and Few-shot learning provide some examples to the model. However, GPT-3s definition of this diverges a bit from the conventional literature. GPT-3 provides one and few-shot examples in the form of “In-Context Learning”. Instead of fine-tuning the model on a few examples, the model has to use the input to infer the downstream task. For example, the GPT-3 transformer has an input sequence of 2048 tokens, so demonstrations of a task such as yelp sentiment reviews, would have to fit in this input sequence as well as the new review. Thanks for watching! Please Subscribe! Paper Links: GPT-3: https://arxiv.org/abs/2005.14165 ZeRO: https://arxiv.org/abs/1910.02054 ZeRO (Blog Post): https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ ZeRO-2 (Blog Post): https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/?OCID=msr_blog_deepspeed2_build_tw #machinelearning #naturallanguageprocessing #deeplearning #gpt3