PyPy warmup improvements
Hello everyone!
I'm very pleased to announce that we've just managed to merge the optresult branch. Under this cryptic name is the biggest JIT refactoring we've done in a couple years, mostly focused on the warmup time and memory impact of PyPy.
To understand why we did that, let's look back in time - back when we got the first working JIT prototype in 2009 we were focused exclusively on achieving peak performance with some consideration towards memory usage, but without serious consideration towards warmup time. This means we accumulated quite a bit of technical debt over time that we're trying, with difficulty, to address right now. This branch mostly does not affect the peak performance - it should however help you with short-living scripts, like test runs.
We identified warmup time to be one of the major pain points for pypy users, along with memory impact and compatibility issues with CPython C extension world. While we can't address all the issues at once, we're trying to address the first two in the work contributing to this blog post. I will write a separate article on the last item separately.
To see how much of a problem warmup is for your program, you can run your program with PYPYLOG=jit-summary:- environment variable set. This should show you something like this:
(pypy-optresult)fijal@hermann:~/src/botbot-web$ PYPYLOG=jit-summary:- python orm.py 1500 [d195a2fcecc] {jit-summary Tracing: 781 2.924965 Backend: 737 0.722710 TOTAL: 35.912011 ops: 1860596 recorded ops: 493138 calls: 81022 guards: 131238 opt ops: 137263 opt guards: 35166 forcings: 4196 abort: trace too long: 22 abort: compiling: 0 abort: vable escape: 22 abort: bad loop: 0 abort: force quasi-immut: 0 nvirtuals: 183672 nvholes: 25797 nvreused: 116131 Total # of loops: 193 Total # of bridges: 575 Freed # of loops: 6 Freed # of bridges: 75 [d195a48de18] jit-summary}
This means that the total (wall clock) time was 35.9s, out of which we spent 2.9s tracing 781 loops and 0.72s compiling them. The remaining couple were aborted (trace too long is normal, vable escape means someone called sys._getframe() or equivalent). You can do the following things:
- compare the numbers with pypy --jit off and see at which number of iterations pypy jit kicks in
- play with the thresholds: pypy --jit threshold=500,function_threshold=400,trace_eagerness=50 was much better in this example. What this does is to lower the threshold for tracing loops from default of 1039 to 400, threshold for tracing functions from the start from 1619 to 500 and threshold for tracing bridges from 200 to 50. Bridges are "alternative paths" that JIT did not take that are being additionally traced. We believe in sane defaults, so we'll try to improve upon those numbers, but generally speaking there is no one-size fits all here.
- if the tracing/backend time stays high, come and complain to us with benchmarks, we'll try to look at them
Warmup, as a number, is notoriously hard to measure. It's a combination of:
- pypy running interpreter before jitting
- pypy needing time to JIT the traces
- additional memory allocations needed during tracing to accomodate bookkeeping data
- exiting and entering assembler until there is enough coverage of assembler
We're working hard on making a better assesment at this number, stay tuned :-)
Speedups
Overall we measured about 50% speed improvement in the optimizer, which reduces the overall warmup time between 10% and 30%. The very obvious warmup benchmark got a speedup from 4.5s to 3.5s, almost 30% improvement. Obviously the speedups on benchmarks would vastly depend on how much warmup time is there in those benchmarks. We observed annotation of pypy to decreasing by about 30% and the overall translation time by about 7%, so your mileage may vary.
Of course, as usual with the large refactoring of a crucial piece of PyPy, there are expected to be bugs. We are going to wait for the default branch to stabilize so you should see warmup improvements in the next release. If you're not afraid to try, nightlies will already have them.
We're hoping to continue improving upon warmup time and memory impact in the future, stay tuned for improvements.
Technical details
The branch does "one" thing - it changes the underlying model of how operations are represented during tracing and optimizations. Let's consider a simple loop like:
[i0, i1] i2 = int_add(i0, i1) i3 = int_add(i2, 1) i4 = int_is_true(i3) guard_true(i4) jump(i3, i2)
The original representation would allocate a Box for each of i0 - i4 and then store those boxes in instances of ResOperation. The list of such operations would then go to the optimizer. Those lists are big - we usually remove 90% of them during optimizations, but they can be a couple thousand elements. Overall, allocating those big lists takes a toll on warmup time, especially due to the GC pressure. The branch removes the existance of Box completely, instead using a link to ResOperation itself. So say in the above example, i2 would refer to its producer - i2 = int_add(i0, i1) with arguments getting special treatment.
That alone reduces the GC pressure slightly, but a reduced number of instances also lets us store references on them directly instead of going through expensive dictionaries, which were used to store optimizing information about the boxes.
Cheers!
fijal & arigo
Comments