Software at Scale 34 - Faster Python with Guido van Rossum

Software at Scale - A podcast by Utsav Shah

Categories:

Guido van Rossum is the creator of the Python programming language and a Distinguished Engineer at Microsoft. Apple Podcasts | Spotify | Google PodcastsWe discuss Guido’s new work on making CPython faster (PEP 659), Tiers of Python Interpreter Execution, and high impact, low hanging fruit performance improvements.Highlights(an edited summary)[00:21] What got you interested in working on Python performance?Guido: In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. When I started at Microsoft, I briefly looked at Azure but realized I never enjoyed that kind of work at Google or Dropbox. Then I looked at Machine Learning, but it would take a lot of time to do something interested with the non-Python, and even Python-related bits.[02:31] What was different about the set of Mark Shannon’s ideas on Python performance that convinced you to go after them?Guido: I liked how he was thinking about the problem. Most of the other approaches around Python performance like PyPy and Cinder are not suitable for all use cases since they aren’t backward compatible with extension modules. Mark has the perspective and experience of a CPython developer, as well as a viable approach that would maintain backward compatibility, which is the hardest problem to solve. The Python Bytecode interpreter is modified often across minor releases (for eg: 3.8 → 3.9) for various reasons like new opcodes, so modifying that is a relatively safe approach. Utsav: [09:45] Could you walk us through the idea of the tiers of execution of the Python Interpreter?Guido: When you execute a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic languages like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code.So, what we try to do there is keep the bytecode compiler simple so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and some definition of “hot”. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.”The specializing adaptive compiler (PEP 659) then tries to replace certain bytecodes with bytecodes that are faster, but only work if the types of the arguments are specific types. A simple hypothetical example is the plus operator in Python. It can add lots of things like integers, strings, lists, or even tuples. On the other hand, you can't add an integer to a string. So, the optimization step - often called quickening, but usually in our context, we call it specializing - is to have a separate “binary add” integer bytecode, a second-tier bytecode hidden from the user. This opcode assumes that both of its arguments are actual Python integer objects, reaches directly into those objects to find the values, adds those values together in machine registers, and pushes the result back on the stack. The binary adds integer operation still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for most generic add operations. Finally, it's always possible that a function is called millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. At that point, the interpreter will simply execute the original bytecode. That's an important part so that you still have the full Python semantics.Utsav [18:20] Generally you hear of these techniques in the context of JIT, a Just-In-Time compiler, but that’s not being implemented right now.Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid. In our case, it’s unclear what and when we’re exactly compiling. At some point ahead of program execution, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which part would you call Just-In-Time? Also, it’s often assumed that Just-In-Time compilation automatically makes all your code better. Unfortunately, you often can't actually predict what the performance of your code is going to be. And we have enough of that with modern CPUs and their fantastic branch prediction. For example, we write code in a way that we think will clearly reduce the number of memory accesses. When we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured out access patterns without any of our help. I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. Full TranscriptUtsav: [00:14] Thank you, Guido, for joining me on another episode of the Software at Scale podcast. It's great to have you here. Guido: [00:20] Great to be here on the show. Utsav: [00:21] Yeah. And it's just fun to talk to you again. So, the last time we spoke was at Dropbox many, many years ago. And you got retired, and then you decided that you wanted to do something new. And you work on performance now at Microsoft, and that's amazing. So, to start off with, I just want to ask you, you could pick any project that you wanted to, based on some slides that I've seen. So, what got you interested in working on Python performance?Guido: [00:47] In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. Some other things I considered briefly in my first month at Microsoft, I looked into, “Well, what can I do with Azure?”, and I almost immediately remembered that I was not cut out to be a cloud engineer. That was never the fun part of my job at Dropbox. It wasn't the fun part of my job before that at Google either. And it wouldn't be any fun to do that at Microsoft. So, I gave up on that quickly. I looked in machine learning, which I knew absolutely nothing about when I joined Microsoft. I still know nothing, but I've at least sat through a brief course and talked to a bunch of people who know a lot about it. And my conclusion was actually that it's a huge field. It is mostly mathematics and statistics and there is very little Python content in the field. And it would take me years to do anything interesting with the non-Python part and probably even with the Python part, given that people just write very simple functions and classes, at best in their machine learning code. But at least I know a bit more about the terminology that people use. And when people say kernel, I now know what they mean. Or at least I'm not confused anymore as I was before.Utsav: [02:31] That makes sense. And that is very similar to my experience with machine learning. Okay, so then you decided that you want to work on Python performance, right? And then you are probably familiar with Mark Shannon's ideas?Guido: [02:43] Very much so. Yeah.Utsav: [02:44] Yeah. So, was there anything different about the set of ideas that you decided that this makes sense and I should work on a project to implement these ideas?Guido: [02:55] Mark Shannon's ideas are not unique, perhaps, but I know he's been working on for a long time. I remember many years ago, I went to one of the earlier Python UK conferences, where he gave a talk about his PhD work, which was also about making Python faster. And over the years, he's never stopped thinking about it. And he sort of has a holistic attitude about it. Obviously, the results remain to be seen, but I liked what he was saying about how he was thinking about it. And if you take PyPy, it has always sounded like PyPy is sort of a magical solution that only a few people in the world understand how it works. And those people built that and then decided to do other things. And then they left it to a team of engineers to solve the real problems with PyPy, which are all in the realm of compatibility with extension modules. And they never really solved that. [04:09] So you may remember that there was some usage of PyPy at Dropbox because there was one tiny process where someone had discovered that PyPy was actually so much faster that it was worth it. But it had to run in its own little process and there was no maintenance. And it was a pain, of course, to make sure that there was a version of PyPy available on every machine. Because for the main Dropbox application, we could never switch to PyPy because that depended on 100 different extension modules. And just testing all that code would take forever. [04:49] I think since we're talking about Dropbox, Pyston was also an interesting example. They've come back actually; you've probably heard that. The Pyston people were much more pragmatic, and they've learned from PyPy’s failures. [05:04] But they have always taken this attitude of, again, “we're going to start with CPython,” which is good because that way they are sort of guaranteed compatibility with extension modules. But still, they make these huge sets of changes, at least Pyston one, and they had to roll back a whole bunch of things because, again, of compatibility issues, where I think one of the things, they had a bunch of very interesting improvements to the garbage collection. I think they got rid of the reference counting, though. And because of that, the behavior of many real-world Python programs was completely changed. [05:53] So why do I think that Mark's work will be different or Mark's ideas? Well, for one, because Mark has been in Python core developer for a long time. And so, he knows what we're up against. He knows how careful we have with backwards compatibility. And he knows that we cannot just say get rid of reference counting or change the object layout. Like there was a project that was recently released by Facebook basically, was born dead, or at least it was revealed to the world in its dead form, CI Python (Cinder), which was a significantly faster Python implementation, but using sort of many of the optimizations came from changes in object layout that just aren't compatible with extension modules. And Mark has sort of carved out these ideas that work on the bytecode interpreter itself. [06:58] Now, the bytecode is something where we know that it's not going to sort of affect third-party extension modules too much if we change it, because the bytecode changes in every Python release. And internals of the interpreter of the bytecode interpreter, change in every Python release. And yes, we still run into the occasional issue. Every release, there is some esoteric hack that someone is using that breaks. And they file an issue in the bug tracker because they don't want to research or they haven't yet researched what exactly is the root cause of the problem, because all they know is their users say, “My program worked in Python 3.7, and it broke in Python 3.8. So clearly, Python 3.8 broke something.” And since it only breaks when we're using Library X, it must be maybe Library X's fault. But Library X, the maintainers don't know exactly what's going on because the user just says it doesn't work or give them a thousand-line traceback. And they bounce it back to core Python, and they say, “Python 3.8 broke our library for all our users, or 10% of our users,” or whatever. [08:16] And it takes a long time to find out, “Oh, yeah, they're just poking inside one of the standard objects, using maybe information they gleaned from internal headers, or they're calling a C API that starts with an underscore.” And you're not supposed to do that. Well, you can do that but then you pay the price, which is you have to fix your code at every next Python release. And in between, sort of for bug fix releases like if you go from 3.8.0 to 3.8.1, all the way up to 3.8.9, we guarantee a lot more - the bytecodes stay stable. But 3.9 may break all your hacks and it changes the bytecode. One thing we did I think in 3.10, was all the jumps in the bytecode are now counted in instructions rather than bytes, and instructions are two bytes. Otherwise, the instruction format is the same, but all the jumps jump a different distance if you don't update your bytecode. And of course, the Python bytecode compiler knows about this. But people who generate their own bytecode as a sort of the ultimate Python hack would suffer.Utsav: [09:30] So the biggest challenge by far is backwards compatibility.Guido: [09:34] It always is. Yeah, everybody wants their Python to be faster until they find out that making it faster also breaks some corner case in their code.Utsav: [09:45] So maybe you can walk us through the idea of the tiers of execution or tiers of the Python interpreter that have been described in some of those slides.Guido: [09:54] Yeah, so that is a fairly arbitrary set of goals that you can use for most interpreted languages. Guido: [10:02] And it's actually a useful way to think about it. And it's something that we sort of plan to implement, it's not that there are actually currently tiers like that. At best, we have two tiers, and they don't map perfectly to what you saw in that document. But the basic idea is-- I think this also is implemented in .NET Core. But again, I don't know if it's sort of something documented, or if it's just this is how their optimizer works. So, when you just start executing a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic language, and something like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing the code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code. [11:38] And so if this was a statically compiled language, the user would have to specify that basically, when they run the compiler, they say, “Well, run a sort of optimize for speed or optimize for time, or O2, O3 or maybe optimized for debugging O0.” In Python, we try not to bother the user with those decisions. So, you have to generate bytecode before you can execute even the first line of code. So, what we try to do there is keep the bytecode compiler simple, keep the bytecode interpreter simple, so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and you can sort of define what's hot. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.” [12:48] But anyway, for a hot function, you want to do more work. And so, the specializing adaptive compiler, at that point, tries to replace certain bytecodes with bytecodes that are faster, but that work only if the types of the arguments are specific types. A simple example but pretty hypothetical is the plus operator in Python at least, can add lots of things. It can add integers, it can add floats, it can add strings, it can list or tuples. On the other hand, you can't add an integer to a string, for example. So, what we do there, the optimization step - and it's also called quickening, but usually in our context, we call it specializing - is we have a separate binary add integer bytecode. And it's sort of a second-tier bytecode that is hidden from the user. If the user asked for the disassembly of their function, they will never see binary add integer, they will also always see just binary add. But what the interpreter sees once the function has been quickened, the interpreter may see binary add integers. And the binary add integer just assumes that both of its arguments, that's both the numbers on the stack, are actual Python integer objects. It just reaches directly into those objects to find the values, adds those values together in machine registers, and push the result back on the stack. [14:35] Now, there are all sorts of things that make that difficult to do. For example, if the value doesn't fit in a register for the result, or either of the input values, or maybe even though you expected it was going to be adding two integers, this particular time it's going to add to an integer and a floating-point or maybe even two strings. [15:00] So the first stage of specialization is actually… I'm blanking out on the term, but there is an intermediate step where we record the types of arguments. And during that intermediate step, the bytecode actually executes slightly slower than the default bytecode. But that only happens for a few executions of a function because then it knows this place is always called with integers on the stack, this place is always called with strings on the stack, and maybe this place, we still don't know or it's a mixed bag. And so then, the one where every time it was called during this recording phase, it was two integers, we replace it with that binary add integer operation. The binary adds integer operation, then, before it reaches into the object, still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for the most generic binary add operations. [16:14] So once we've recorded the types, we specialize it based on the types, and the interpreter then puts in guards. So, the interpreter code for the specialized instruction has guards that check whether all the conditions that will make the specialized instruction work, are actually met. If one of the conditions is not met, it's not going to fail, it's just going to execute the original bytecode. So, it's going to fall back to the slow path rather than failing. That's an important part so that you still have the full Python semantics. And it's always possible that a function is called hundreds or millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. And the semantics still say, “Well, then it has to do with the floating-point way.Utsav: [17:12] It has to deoptimize, in a sense.Guido: [17:14] Yeah. And there are various counters in all the mechanisms where, if you encounter something that fails the guard once, that doesn't deoptimize the whole instruction. But if you sort of keep encountering mismatches of the guards, then eventually, the specialized instruction is just deoptimized and we go back to, “Oh, yeah, we'll just do it the slow way because the slow way is apparently the fastest, we can do.” Utsav: [17:45] It's kind of like branch prediction.Guido: [17:47] I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. And it's actually one of the things we're up against with this project, because we write code in a way that we think will clearly reduce the number of memory accesses, for example. And when we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured it out without any of our help. Utsav: [18:20] Yeah. I mean, these techniques, generally you hear them in a context of JIT, a Just-In-Time compiler, but y’all are not implementing that right now.Guido: [18:30] JIT is like, yeah, in our case, it would be a misnomer. What we do expect to eventually be doing is, in addition to specialization, we may be generating machine code. That's probably going to be well past 3.11, maybe past 3.12. So, the release that we still have until October next year is going to be 3.11, and that's where the specializing interpreters going to make its first entry. I don't think that we're going to do anything with machine code unless we get extremely lucky with our results halfway through the year. But eventually, that will be another tier. But I don't know, Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid.Utsav: [19:25] Is it baggage from other projects trying it?Guido: [19:29] People assume that Just-In-Time compilation automatically makes all your code better. It turns out that it's not that simple. In our case, compilation is like, “What exactly is it that we compile?” At some point ahead of time, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which thing would you call Just-In-Time? Guido: [20:04] So I'm not a big fan of using that term. And it usually makes people think of feats of magical optimization that have been touted by the Java community for a long time. And unfortunately, the magic is often such that you can't actually predict what the performance of your code is going to be. And we have enough of that, for example, with the modern CPUs and their fantastic branch prediction.Utsav: [20:35] Speaking of that, I saw that there's also a bunch of small wins y'all spoke about, that y’all can use to just improve performance, things like fixing the place of __dict__ in objects and changing the way integers are represented. What is just maybe one interesting story that came out of that?Guido: [20:53] Well, I would say calling Python functions is something that we actually are currently working on. And I have to say that this is not just the Microsoft team, but also other people in the core dev team, who are very excited about this and helping us in many ways. So, the idea is that in the Python interpreter, up to and including version 3.10, which is going to be released next week, actually, whenever you call a Python function, the first thing you do is create a frame object. And a frame object contains a bunch of state that is specific to that call that you're making. So, it points to the code object that represents the function that's being called, it points to the globals, it has a space for the local variables of the call, it has space for the arguments, it has space for the anonymous values on the evaluation stack. But the key thing is that it’s still a Python object. And there are some use cases where people actually inspect the Python frame objects, for example, if they want to do weird stuff with local variables. [22:18] Now, if you're a debugger, it makes total sense that you want to actually look at what are all the local variables in this frame? What are their names? What are their values and types? A debugger may even want to modify a local variable while the code is stopped in a breakpoint. That's all great. But for the execution of most code, most of the time, certainly, when you're not using a debugger, there's no reason that that frame needs to be a Python object. Because a Python object has a header, it has a reference count, it has a type, it is allocated as its own small segment of memory on the heap. It's all fairly inefficient. Also, if you call a function, then you create a few objects, then from that function, you call another function, all those frame objects end up scattered throughout the entire heap of the program. [23:17] What we have implemented in our version of 3.11, which is currently just the main branch of the CPython repo, is an allocation scheme where when we call a function, we still create something that holds the frame, but we allocate that in an array of frame structures. So, I can't call them frame objects because they don't have an object header, they don't have a reference count or type, it's just an array of structures. This means that unless that array runs out of space, calls can be slightly faster because you don't jump around on the heap. And allocation sort of is to allocate the next frame, you compare two pointers, and then you bump one counter, and now you have a new frame structure. And so, creation, and also deallocation of frames is faster. Frames are smaller because you don't have the object header. You also don't have the malloc overhead or the garbage collection overhead. And of course, it's backwards incompatible. So, what do we do now? Fortunately, there aren't that many ways that people access frames. And what we do is when people call an API that returns a frame object, we say, “Okay, well sure. Here's the frame in our array. Now we're going to allocate an object and we're going to copy some values to the frame object,” and we give that to the Python code. So, you can still introspect it and you can look at the locals as if nothing has changed. [25:04] But most of the time, people don't look at add frames. And this is actually an old optimization. I remember that the same idea existed in IronPython. And they did it differently. I think for them, it was like a compile-time choice when the bytecode equivalent in IronPython was generated for a function, it would dynamically make a choice whether to allocate a frame object or just a frame structure for that call. And their big bugaboo was, well, there is a function you can call sys dunder __getFrame__ and it just gives you the frame object. So, in the compiler, they were looking, were you using the exact thing named system dunder __getFrame__ and then they would say, “Oh, that's getFrame, now we're going to compile you slightly slower so you use a frame object.” We have the advantage that we can just always allocate the frame object on the fly. But we get similar benefits. And oh, yeah, I mentioned that the frame objects are allocated in array, what happens if that array runs out? Well, it's actually sort of a linked list of arrays. So, we can still create a new array of frames, like we have space for 100 or so which, in many programs, that's plenty. And if your call stack is more than 100 deep, we'll just have one discontinuity, but the semantics are still the same and we still have most of the benefits.Utsav: [26:39] Yeah, and maybe as a wrap-up question, there are a bunch of other improvements happening in the Python community for performance as well, right? There's Mypyc, which we're familiar with, which is using types, Mypy types to maybe compiled code to basically speed up. Are there any other improvements like that, that you're excited about, or you're interested in just following?Guido: [27:01] Well, Mypyc is very interesting. It gives much better performance boost, but only when you fully annotate your code and only when you actually follow the annotations precisely at runtime. In Mypy, if you say, “This function takes two integers,” and it returns an integer, then if you call it with something else, it's going to immediately blow up. It'll give you a traceback. But the standard Python semantics are that type annotations are optional, and sometimes they're white lies. And so, the types that you see at runtime may not actually be compatible with the types that were specified in the annotations. And it doesn't affect how your program executes. Unless you sort of start introspecting the annotations, your program runs exactly the same with or without annotations. [28:05] I mean, there are a couple of big holes that are in the type system, like any. And the type checker will say, “Oh, if you put any, everything is going to be fine.” And so, using that, it's very easy to have something that is passed, an object of an invalid type, and the type checker will never complain about it. And our promise is that the runtime will not complain about it either unless it really is a runtime error. Obviously, if you're somehow adding an integer to a string at runtime, it's still going to be a problem. But if you have a function that, say, computes the greatest common divisor of two numbers, which is this really cute little loop, if you define the percent operator in just the right way, you can pass in anything. I think there are examples where you can actually pass it to strings, and it will return a string without ever failing. [29:07] And so basically, Mypyc does things like the instance attributes are always represented in a compact way where there is no dunder __dict__. The best that we can do, which we are working on designing how we're actually going to do that, is make it so that if you don't look at the dunder __dict__ attribute, we don't necessarily have to store the instance attributes in a dictionary as long as we preserve the exact semantics. But if you use the dunder __dict__, at some later point, again, just like the frame objects, we have to materialize a dictionary. And Mypyc doesn't do that. It's super-fast if you don't use dunder __dict__. If you do use dunder __dict__, it just says, “dunder __dict__ not supported in this case.” [29:59] Mypyc really only compiles a small subset of the Python language. And that's great if that's the subset you're interested in. But I'm sure you can imagine how complex that is in practice for a large program.Utsav: [30:17] It reminds me of JavaScript performance when everything is working fast and then you use this one function, which you're not supposed to use to introspect an object or something, and then performance just breaks down. Guido: [30:29] Yeah, that will happen. Utsav: [30:31] But it's still super exciting. And I'm also super thankful that Python fails loudly when you try to add a number in the string, not like JavaScript,Guido: [30:41] Or PHP, or Perl.Utsav: [30:44] But yeah, thank you so much for being a guest. I think this was a lot of fun. And I think it walked through the performance improvement y’all are trying to make in an accessible way. So, I think it’s going to be useful for a lot of people. Yeah, thank you for being a guest.Guido: [30:58] My pleasure. It’s been a fun chat. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev