cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end ) | use cffi for calling C | if a pep adds a mere 25-30 [C-API] functions or so, it's a drop in the ocean (cough) - Armin
<wleslie>
it's about using virtual address mapping as read barriers to maintain assumptions in the jit
<wleslie>
is there a thread somewhere I can pull and find out more? and are there any runtimes that make use of this?
<mattip>
another hot spot: unicodehelper.fsdecode is called 5700 times, which is taking ~140 secs
<arigato>
wleslie: it's about using special address mappings for STM, really
<wleslie>
I figured it would be dual purpose. Azul did something similar with their concurrent GC when they first ported to x86.
<arigato>
maybe, but that's mostly not relevant for Pythons with a GIL
<antocuni>
mattip: I'm also investigating slow tests; I think one of the biggest culprit is _frozen_importlib, which is implemented at applevel now
<wleslie>
that's why I was wondering if any of the other runtimes ended up using it (my gut guesses pixie, but who knows). is it still a feature available from rpython, or has it bitrotten?
<antocuni>
so for example, in test_cpytext:LeakCheckingTest:preload_builtins, you preload/import mmap and types, so it takes forever
ronan has quit [Ping timeout: 246 seconds]
<arigato>
wleslie: it's still in a branch
danchr_ has joined #pypy
<wleslie>
neat. I can't seem to find this specific work anywhere, do you happen to have a guess at the branch name?
<arigato>
anything with "stm"
<wleslie>
stmgc-8 looks good
<wleslie>
thanks
<mattip>
antocuni: it seems at least for zipimport, changing fsdecode to have a fast path for ascii is a big win
<antocuni>
nice!
<antocuni>
in parallel, I am playing with writing a custom __import__ to be used only by tests, so that we don't need to bring in/execute the whole _frozen_importlib every time
<antocuni>
hopefully, with both our changes we will speed up tests considerably :)
<mattip>
time says 1m55 vs 1m17
<mattip>
the deeper problem is the else clause in the middle of fsdecode,
<mattip>
it is calling out to a c implementation of pypy_char2wchar
<mattip>
which wraps mbstowcs
<antocuni>
where is fsdecode implemented?
<mattip>
pypy/interpreter/unicodehelper.py
<mattip>
I think it is only meant to be called at bootstrap, but the test calls it ~5700 times
<antocuni>
I think it is perfectly find to have a space option which says "dont_care_about_this_stuff" which will be set to True by default
<antocuni>
s/find/fine
<antocuni>
and the only the tests which actually needs this can set it appropriately
<mattip>
makes sense, but in this case I think a fast path for ascii is valid, the other code paths are always going to be expensive
<antocuni>
true
<mattip>
on the other hand, your work might make this less of an issue, since I think most of the calls are for importing
<mattip>
and the same file name is getting fsdecoded over and over
<antocuni>
maybe
<antocuni>
another probable source of slowness is the fact that on py3.6 you have to initialize more builtin modules than on default. I see that by default we bring in _locale, _frozen_importlib, struct, atexit and _string
<mattip>
right. I see a lot of caching fails for pypy.module.sys.state.State in build/rpython/rlib/cache.py
<mattip>
as it slowly imports those modules one at a time
<mattip>
and they have lots of lib-python/3 dependencies
<antocuni>
I think that the correct strategy would be to mock most of them and use the mocks by default in tests, possibly with a nice error message which says things like "please add XXX to spaceconfig['usemodules']" in case you need a feature which is not implemented by the mock
<mattip>
err, forget my previous comment about cache
<mattip>
it is in unmarshalling code from stdlib modules, which I think is exactly what antocuni is saying
<mattip>
committed a fast-path optimization, perhaps could offset the cost by only doing it for len(utf8) < TBD
<mattip>
this might also impact 3126 about slowness in open()
<antocuni>
uh, it seems that there were many failures tonight in the py3.6 branch. AttributeError: module 'threading' has no attribute 'RLock'
<antocuni>
I suppose this is related to the work which arigato did recently?
<mattip>
not sure. I am checking it now. I changed the buildslave to run from the docker image (on another machine, and paused the bencher4 buildslave)
<arigato>
I didn't commit anything related to threads, apart from a comment
<mattip>
I think there is something fishy, since when I rerun those tests on the same docker image they pass
<mattip>
Somehow the image built without boehm gc.h, so maybe that is a factor
<mattip>
since now I rebuilt the image
<mattip>
another far-out theory is that I ran the tests with parallel_runs=4, which may have swamped the buildslave
<mattip>
(so "same docker image" is only approximate, since now the image has boehm gc)
<mattip>
I wonder what version of ncurses that corresponds to. It may be easier to just install a new one into the image
jcea has joined #pypy
lritter has joined #pypy
<antocuni>
mattip: I'd vote for doing the same thing as squeaky
<antocuni>
since it has been working well for years
adamholmberg has joined #pypy
<mattip>
+1, working on it
* antocuni
is fighting with vmprof, trying to display the profile of running one cpyext test
<mattip>
fwiw, I used python2 -mcProfile -o test.profile pytest.py ...
<mattip>
and then hacked runrabbitrun to work with a more modern wx
<mattip>
antocuni: did d6f87c4ee798 make any difference to you?
<antocuni>
I tried to use cProfile and display the results in kcachegrind (using pyprof2calltree) and snakeviz. In both cases, I couldn't understand much of the result
<antocuni>
mattip: ah, I didn't try yet
<antocuni>
good idea, let me try
<mattip>
some googling also gave me this, didn't try it
<antocuni>
I tried that as well, and got nonsense again
<mattip>
:(
<antocuni>
I don't know if it's pytest doing some magic which confuses the stacktraces or what. Most of these tools report functions whose total time is more than 100% and things like that
<antocuni>
and flameprof complains with this: Warning: flameprof can't find proper roots, root cumtime is 1e-06 but sum tottime is 169.117844
<mattip>
maybe it is a python2 formatted profile output? I used runrabbitrun with python2
<antocuni>
ah, maybe
<antocuni>
d6f87c4ee798 does help, indeed: the wall-clock time went from 2m:20s to 1m:27s
<mattip>
yay
<kenaan>
mattip default 9c171d039841 /pypy/module/thread/test/test_thread.py: move slow test to its own class and skip it
<kenaan>
mattip py3.6 aa3b8c5bd232 /pypy/module/thread/test/test_thread.py: merge default into branch
<kenaan>
mattip py3.6 d91c0d495118 /pypy/module/thread/test/test_thread.py: merge default into py3.6
<mattip>
in 9c171d039841, skipping a test that took ~10 minutes if it ran (only when it could *not* open 10000 threads
<antocuni>
oh, finally managed to display the vmprof data. Unfortunately I can't share because I need to run a custom version of vmprof-server which has a higher recursion limit
<antocuni>
it seems we are spending a lot of time in build_bridge/attach_all/finish_type_2; in particular, the biggest culprit seems to be "unicode_attach"
<mattip>
I wonder if we can cheat there too, and detect ascii, and then
<mattip>
allocate a "fake" usc2 or ucs4 buffer, set it to 0, and fill every second or fourth byte with the string char
<antocuni>
I still wonder why it is so much slower than on default, though. It might be simply that since we import more modules at startup, it has more work to do
<mattip>
instead of calling out to mbstowcs
<mattip>
gotta go soon, but it seemed suspicious to me that we call _pytest.runner.call_runtest_hook 3 times
<mattip>
and they all take about the same time
<mattip>
We call once with 'setup', once with 'call' and once with 'teardown', and they seem to be doing the same amount of work,
<antocuni>
I don't see it in the flamegraph, where is it?
<antocuni>
ah no, you mean pytest_runtest_setup
<mattip>
yup
<antocuni>
from the screenshot I posted above, it's clear that they are doing very different things
<antocuni>
and they don't take the same time
<antocuni>
that's why I prefer flamegraphs than other visualizations which mix things together :)
<mattip>
ok,so it was an artifact from the test I chose to profile
<mattip>
or from the tool taking total_time / ncalls and me misreading the result
ronan has joined #pypy
<antocuni>
yes, likely
<mattip>
maybe compare that flamegraph to a pypy2 one, but it probably is very different
<antocuni>
the problem is that functions like cpyext.unicodeobject.set_utf8 do a cts.cast, which requires to parse the cdecl again and again
<antocuni>
I suppose I should just apply this patch to default and be happy
marky1991 has quit [Ping timeout: 268 seconds]
<arigato>
bah
<antocuni>
is the "bah" referred to my sentence?
<kenaan>
antocuni default 317104f1b067 /pypy/module/cpyext/cparser.py: Use a cache to avoid parsing the same cdecl again and again, which is done e.g. for all the various cts.cast(......
dddddd has joined #pypy
<antocuni>
I am confused: mattip merged default into py3.6 1 hours ago, and the only new commit in default is 317104f1b067
<antocuni>
ah no, never mind
<antocuni>
I was trying to merge default into an OLD commit of py3.6, that's why I got nonsense
<antocuni>
this alone takes ~4.83s on py3.6 😱 (and no noticeable time on default)
<antocuni>
I suspect that's because it goes through _io
<ronan>
I've noticed the issue with calling methods of sys.stdout before
<ronan>
but I couldn't find what exactly was slow
<ronan>
it looked more complicated than just using _io
<antocuni>
I admit I didn't investigate deeply that particular issue. But in the vmprof profile I see lots of interpreted code
<antocuni>
so there is probably something which is implemented at applevel
<ronan>
yes, CPython relies on quite a lot of app-level code at interpreter startup
<antocuni>
I am thinking of writing a "_dummy_importlib" module to be used instead of _frozen_importlib, implementing at interp-level the bare minimum which is necessary to run the tests
<antocuni>
also, do you know why e.g. "struct" and "_locale" are in the essential_modules? Are they needed by _frozen_importlib or by what?
<ronan>
I think struct is a dependency of some essential_modules
<ronan>
_locale is used in app_main
<antocuni>
I am a bit confused now: are the modules listed in essential_modules ALWAYS included when we create an objspace for testing, or this option is used only at translation time?
craigdillabaugh has joined #pypy
<ronan>
actually, it looks like it uses default_modules
<antocuni>
where is the relevant code?
<antocuni>
oh I see: pypy.tool.pytest.objspace.gettestobjspace
<ronan>
yes
<ronan>
and down the rabbit holes to pypy.config.pypyoption
<antocuni>
I wonder whether we should try to use only the bare minimum modules which are actually needed for most tests, and explicitly add additional modules when they are actually needed
<antocuni>
OTOH, the spaces are cached, so if you use too many different combinations you might end up with slower tests overall
<ronan>
yes, and I don't think builtin modules are that expensive
<antocuni>
but e.g. in py3.6, calling maketestobjspace() takes ~5 seconds :(
<antocuni>
ronan: I tried to comment out everything in essential_modules and default_modules, apart "sys", "builtins" and "__pypy__"
<ronan>
hmm, that's not good
<antocuni>
creating the space takes 2s instead of 5s
<antocuni>
and e.g. test_boolobject still passes
jvesely has quit [Quit: jvesely]
<antocuni>
also, the distinction between essential/default/allworking does no longer make much sense nowadays
<antocuni>
e.g., I don't think anyone ever do a translation with only the default modules
<ronan>
+1
<ronan>
well, maybe for the interpreter variants, like revdb or sandboxed?