PyPy and ijson - a guest blog post
This gem was posted in the ijson issue tracker after some discussion on #pypy, and Dav1dde kindly allowed us to repost it here:
"So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, PyPy needed ~1:30-2:00 whereas CPython 2.7 needed ~13 seconds (the pure python implementation on both pythons was equivalent at ~8 minutes).
"Apparantly ctypes is really bad performance-wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d
Before:
CPython 2.7:
python -m emfas.server size dumps/echoprint-dump-1.json
11.89s user 0.36s system 98% cpu 12.390 total
PYPY:
python -m emfas.server size dumps/echoprint-dump-1.json
117.19s user 2.36s system 99% cpu 1:59.95 total
After (CFFI):
CPython 2.7:
python jsonsize.py ../dumps/echoprint-dump-1.json
8.63s user 0.28s system 99% cpu 8.945 total
PyPy:
python jsonsize.py ../dumps/echoprint-dump-1.json
4.04s user 0.34s system 99% cpu 4.392 total
"
Dav1dd goes into more detail in the issue itself, but we just want to emphasize a few significant points from this brief interchange:
"So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, PyPy needed ~1:30-2:00 whereas CPython 2.7 needed ~13 seconds (the pure python implementation on both pythons was equivalent at ~8 minutes).
"Apparantly ctypes is really bad performance-wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d
Before:
CPython 2.7:
python -m emfas.server size dumps/echoprint-dump-1.json
11.89s user 0.36s system 98% cpu 12.390 total
PYPY:
python -m emfas.server size dumps/echoprint-dump-1.json
117.19s user 2.36s system 99% cpu 1:59.95 total
After (CFFI):
CPython 2.7:
python jsonsize.py ../dumps/echoprint-dump-1.json
8.63s user 0.28s system 99% cpu 8.945 total
PyPy:
python jsonsize.py ../dumps/echoprint-dump-1.json
4.04s user 0.34s system 99% cpu 4.392 total
"
Dav1dd goes into more detail in the issue itself, but we just want to emphasize a few significant points from this brief interchange:
- His CFFI implementation is faster than the ctypes one even on CPython 2.7.
- PyPy + CFFI is faster than CPython even when using C code to do the heavy parsing.
Comments
Maybe it's time to discuss inclusion of CFFI into stdandard library again?
If CPython decides to include it in its stdlib, I can make sure it is updated as needed. I don't have the energy to discuss its inclusion myself, so if it happens it will be "championed" by someone else. Nowadays, I personally think inclusion has as many drawbacks as advantages, even if CFFI 1.x shouldn't evolve a lot in the foreseeable future after the 1.0 step.
The problem is converting existing libs to use cffi. Only very few percent of Libs are ready for python3.x and with this trend , not even 1% of libs will be converted to work with CFFI.
That makes PyPy adoption a lot slower.
Is there really no chance of improving ctypes?
you would think, but these days vast majority of popular C bindings come with cffi equivalents. In fact cffi is vastly more popular than ctypes ever was.