cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://botbot.me/freenode/pypy/ ) | use cffi for calling C | "the modern world where network packets and compiler optimizations are effectively hostile"
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
pilne has joined #pypy
Tiberium has quit [Remote host closed the connection]
yuyichao_ has quit [Read error: Connection reset by peer]
yuyichao has joined #pypy
oberstet has quit [Ping timeout: 240 seconds]
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
marky1991_2 has quit [Ping timeout: 260 seconds]
bgola has quit [Quit: Lost terminal]
asmeurer_ has joined #pypy
lritter_ has quit [Read error: Connection reset by peer]
lritter_ has joined #pypy
tbodt has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
tbodt has joined #pypy
tbodt has quit [Client Quit]
asmeurer_ has quit [Quit: asmeurer_]
rokujyouhitoma has joined #pypy
ArneBab has joined #pypy
rokujyouhitoma has quit [Ping timeout: 248 seconds]
ArneBab_ has quit [Ping timeout: 240 seconds]
exarkun has quit [Ping timeout: 276 seconds]
exarkun has joined #pypy
redj has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 248 seconds]
gh16ito has joined #pypy
<gh16ito> Is there a separate issue tracker for pypy3?
<simpson> I don't think so, but I am not in-the-know.
pilne has quit [Quit: Quitting!]
<njs> gh16ito: the regular issue tracker has a drop-down to mark an issue as pypy3-related
<gh16ito> njs: Thanks.
<gh16ito> Interestingly, my test suite is /way/ slower with PyPy than with CPython.
exarkun has quit [Ping timeout: 240 seconds]
exarkun has joined #pypy
rokujyouhitoma has joined #pypy
lritter_ has quit [Quit: Leaving]
rokujyouhitoma has quit [Ping timeout: 246 seconds]
<njs> gh16ito: yeah, that's common
<njs> gh16ito: pypy is slower until the jit kicks in, b/c the jit adds some overhead, plus pypy's non-jit mode isn't as optimized as cpythons
<njs> gh16ito: and the jit can't really kick in on test suites, since test suites never do the same thing twice
adamholmberg has joined #pypy
cwillu has joined #pypy
adamholmberg has quit [Remote host closed the connection]
adamholmberg has joined #pypy
<njs> gh16ito: huh
<njs> gh16ito: I've gotten weird results from timeit on pypy before, I don't really trust it
adamholmberg has quit [Ping timeout: 276 seconds]
<gh16ito> Yeah, weird.
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 248 seconds]
<runciter> vmprof?
exarkun has quit [Ping timeout: 246 seconds]
exarkun has joined #pypy
lritter has joined #pypy
rokujyouhitoma has joined #pypy
<kenaan> arigo nogil-unsafe-2 55fb6aff1863 /pypy/config/pypyoption.py: With nogil, it doesn't make sense to not have threads (and it fails translation right now, FIXME)
<kenaan> arigo nogil-unsafe-2 cd60a593d1b4 /rpython/: Fixes. Now the branch seems to "work" again
<kenaan> arigo nogil-unsafe-2 13c93572cf88 /rpython/translator/c/src/: Make the current stack limits a thread-local, too
rokujyouhitoma has quit [Ping timeout: 240 seconds]
<gh16ito> Cool, njs
<gh16ito> OK, I'm off.
<gh16ito> Thanks for the help.
gh16ito has quit [Quit: Leaving]
<arigato> ok, now the nogil-unsafe-2 branch works again
<arigato> it seems to give slowdowns, still, even when compared with single-thread with the same interpreter
<Remi_M> arigato: how much slower? is it coming from the locks in the GC?
rokujyouhitoma has joined #pypy
<arigato> Remi_M: it runs maybe 40% slower instead of twice as fast, with two threads
<arigato> and I can't find why
<Remi_M> uh...
<arigato> running in gdb and pressing ctrl-c randomly I never see anything else than the normal code
<arigato> in pypy/interpreter/*
<Remi_M> try mutrace maybe? ( https://github.com/dbpercona/mutrace )
jamesaxl has joined #pypy
<arigato> I even checked than running the same interpreter single-threaded, but in two processes, is really about twice faster
ssbr has quit [Ping timeout: 255 seconds]
<njs> arigato: maybe running under perf will reveal some smoking gun?
<njs> (I am guessing wildly)
rokujyouhitoma has quit [Ping timeout: 246 seconds]
* arigato might try both
exarkun has quit [Ping timeout: 260 seconds]
<Remi_M> arigato: is it a committed test? I can try running it here
<arigato> Remi_M: I'm running: ../../rpython/bin/rpython -O2 --no-shared targetpypystandalone.py --no-allworkingmodules
<arigato> the test itself is this, with the number of threads changeable:
exarkun has joined #pypy
<arigato> (just for fun, I'm writing all this while at more than 10'000 meters over sea)
<Remi_M> :D
<njs> federal aviation safety requirements mandate that multithreaded code slow down and look where it's going
<arigato> "perf" has nothing unexpected to say
<arigato> njs: :-)
<LarstiQ> njs: haha :)
<arigato> I fear it's due to a (likely false) cache conflict
<njs> arigato: you should be able to check that with perf I guess
ssbr has joined #pypy
raynold has quit [Quit: Connection closed for inactivity]
<arigato> no luck, the main difference is that in single-threaded mode it runs 3.99 insn per cycle, while in two-threads mode it runs 1.10
<arigato> ah, the number "cache-references" is incredibly higher
<arigato> now to figure out why...
<Remi_M> here it's even 6x slower on 2 threads than a single thread... I don't think this is false sharing
<Remi_M> (it's a debug build though)
yuyichao has quit [Ping timeout: 246 seconds]
<arigato> ah, no, it's the event "L1-dcache-load-misses" that gets immensely higher
<arigato> that smells a lot like false sharing
<njs> arigato: it sounds like 'perf c2c' is designed exactly for checking this, a better link is: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-c2c.txt
<Remi_M> it does, but I never saw that much of a slowdown even in a synthetic example. maybe I'm wrong though..
<arigato> well, synthetic examples can be arbitrarily slower in my experience
<arigato> e.g. if you continuously write and read 'x' in thread 1 and 'y' in thread 2, which are two variables in the same cache line
<arigato> then performance goes completely to hell
<Remi_M> really? IIRC I never got a 6x slowdown from that
yuyichao has joined #pypy
<arigato> out of battery soon
<arigato> there's even a plug in front of me, but it doesn't fit the EU standard
<arigato> Remi_M: I think so, but I'd have to try again
<Remi_M> I do see a ton of cycles spent in gc_save_root
<njs> brave man, doing multi-core profiling on battery
<Remi_M> (I think.. perf is tricky)
<arigato> njs: and a few translations, too :-)
<arigato> I don't see gc_save_root, it's supposed to be inlined by the translation, I think?
<Remi_M> yes, I definitely see most of the cycles being spent on gc_save_root
<Remi_M> yes it's inlined
<Remi_M> you have to look at the annotated assembly
<arigato> ah you're looking at a particular function's assembly
<Remi_M> pypy_g_PyFrame_dispatch_bytecode has the most self-time, so I looked at it
<arigato> cool
<arigato> how do you show disassembly?
<arigato> "man perf-report" is not helping
<Remi_M> I select a function, then use right-arrow, then again right-arrow on "Annotate..."
<Remi_M> (inline C is only visible with debug symbols naturally)
<arigato> in "perf report"?
<Remi_M> yeah
<arigato> right arrow only scrolls the screen to the right, for me
<Remi_M> uhm... Enter seems to work here too
<arigato> ah ok
<Remi_M> also, my perf-record line looks like this: "perf record --call-graph dwarf -F511 ..."
<arigato> ah, it's pypy_g_pypy_interpreter_executioncontext_ActionFlag.at_inst_ticker
<arigato> all threads modify this same global counter
<njs> heh
<arigato> Remi_M: thanks for the quick guide :-)
<Remi_M> heh, right, that's on the next line
<arigato> at least I *think* that's it
<arigato> I should try to write a small example and measure the slow-down
yuyichao has quit [Read error: Connection reset by peer]
<arigato> unsure the results would be very correct at 7% battery
<Remi_M> :P
<LarstiQ> arigato: can you ask for a knife to modify your plug so it goes in the power socket? ;)
<LarstiQ> though seriously, they might have adapters
<arigato> worth a try (more precisely, one of these options is)
<LarstiQ> :)
<arigato> yay
<arigato> (a hostess had to dig inside what looks like her own suitcase)
kenaan has quit [Read error: Connection reset by peer]
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 255 seconds]
jamesaxl has quit [Read error: Connection reset by peer]
<arigato> Remi_M: here's an example showing quite a slow-down (2.8x) for running two threads instead of one:
jamesaxl has joined #pypy
<arigato> the magic number "4" is optimized to get the worst case for me
<Remi_M> arigato: indeed, I do see a 2x slowdown :O
<Remi_M> not sure why pypy got a 6x slowdown on my machine, but there may simply be multiple of these slowdowns in there
<arigato> did you compare a lldebug and an optimized build?
<Remi_M> it was an optimised build with debug symbols; didn't compare anything else yet
<arigato> ok
<Remi_M> hm.. why do I not see a slowdown if I only increment 'foo', not 'foo2'?
<arigato> not sure
<arigato> I think the effect in this case is that you see only about half the increments, in the global variable
<Remi_M> what do you mean by 'see'? they are definitely in the code, can the CPU ignore some of the writes?
<arigato> yes, because they are racy
<arigato> ah, probably each *read* of "foo" is considered to come from the most recent write by the same cpu
<Remi_M> and by writing to a different location, we are possibly disabling that memory consistency optimisation?
<arigato> the cache line still pings-pongs between them, but most of the iterations are done without slowdown
<arigato> no, I think the problem is that if you wait a little bit, then the cpu has time to notice it should reload the value from the non-local cache line
<Remi_M> that may be the reason why I wasn't able to find that slowdown with false sharing in my earlier experiments...
<arigato> (I'm not sure though, mostly guessing)
<Remi_M> it does sound plausible and somewhat familiar now...
<Remi_M> well, to make this a bit more mysterious: if foo and foo2 both are declared globally, the slowdown is much smaller...
<arigato> (also, note that my x.py is very minimal: you have to press enter after all results are printed, and measure the "user" time; then it's expected to have 2x as much if you run two threads)
oberstet has joined #pypy
<Remi_M> ah no, if I move the two global variables away from each other (into separate cache lines), I (again) get a (tremendous) slowdown of 15x
_main_ has joined #pypy
kenaan has joined #pypy
<kenaan> arigo nogil-unsafe-2 b03810fcb5c1 /pypy/: Don't decrement the ticker at all in the nogil-unsafe-2 branch
<Remi_M> this is fun. on 4 threads the slowdown is 40x :)
__main__ has quit [Ping timeout: 255 seconds]
_main_ is now known as __main__
<arigato> ah, so it might be two global variables...
__main__ has quit [Read error: Connection reset by peer]
__main__ has joined #pypy
<arigato> obscure, it seems that the next big slow-down is caused by the read of the next byte from the bytecode string
<Remi_M> I guess it's still something like any write to a different location/cache-line disables whatever optimisation is otherwise valid
<arigato> ah, and also in FOR_ITER, a read from an array of pointers
exarkun has quit [Ping timeout: 276 seconds]
exarkun has joined #pypy
<arigato> unless it's badly reported and it occurs in the previous line, which might be along the "jmp"
<Remi_M> here, most time is spent on pyframe.last_instr = ... (I think)
<arigato> "ah"
<Remi_M> maybe the read of pyframe.debugdata is also involved
<Remi_M> is pyframe shared between threads?
<arigato> no
<Remi_M> and yes, in FOR_ITER I see cycles spent on a write-barrier test on an array with the index of pyframe.valuestackdepth
<Remi_M> (maybe it's the read on pyframe that gets delayed until the test)
<arigato> ah yes. there are two high lines in FOR_ITER, for me
<Remi_M> the next highest one for me is reading ExcData.exc_type, but it's 10x less cycles already
<arigato> ah
<arigato> for me it's the read out an array of pointers, near the start of the function
<arigato> (recompiling now with debugging symbols)
inhahe_ has quit [Ping timeout: 260 seconds]
inhahe_ has joined #pypy
<Remi_M> for comparison: https://bpaste.net/show/952af1d3a4df
<arigato> ah, yes, I'm also seeing something similar to you now
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 246 seconds]
<arigato> maybe it's false sharing of the header of the locals_cells_stack_w list and some other object belonging to the other thread?
<Remi_M> maybe..
<arigato> it looks possible, because this new profile I'm looking at is different from the old one
<arigato> which would be likely if it's random sharing
<Remi_M> something that may be relevant later (or if perf lies to us): remember_young_pointer_from_array2 acquires a lock in the medium-fastpath...
<arigato> "ah"
<Remi_M> here, the profile remains pretty consistent between runs
<arigato> I guess it's because I recompiled (to have debug symbols)
<arigato> (another wild guess, of course)
<Remi_M> if we increased the minimum obj size to 64, we would only see true cache conflicts... :)
<arigato> no, there is already logic to align nursery blocks to at least 256 bytes
<Remi_M> but not the old generation?
<arigato> ah
<arigato> unlikely, indeed
<arigato> I think that right now, it mostly works OR it doesn't work at all, by chance (one of the two)
<arigato> depending on the order in which we trace
<Remi_M> you mean depending on where we allocate objs during minor-gc trace?
<arigato> yes, it seems you're correct
<arigato> (yes)
<arigato> (or no, I mean, if we trace object dependencies depth-first or breath-first)
<arigato> (depth-first tends to make objects from the same thread grouped)
<Remi_M> ah, would be interesting to try different strategies there :)
<arigato> here's a modified y.py:
<arigato> it creates new frames regularly
<arigato> no more systematic conflict
<arigato> now I get a perfect 2x
<arigato> ...no wrong
<Remi_M> I need to go now, but yes, seems like I get 2x too
Remi_M has quit [Quit: See you!]
<arigato> I still get 2.5x :-(
<arigato> or 2.8x
jamesaxl has quit [Read error: Connection reset by peer]
jamesaxl has joined #pypy
Tiberium has joined #pypy
<arigato> ok, I get a factor between 2.1x and 5.4x(!) depending on what I replace '10000' with
<arigato> seems that we really need to take care of false conflicts
<arigato> I guess the worst case is when sum_a_bit() makes about one frame per minor collection; then as soon as there is this minor collection, there is one pyframe (with all dependent data) per frame,
<arigato> and the dependent data is of a different size so it goes in some other pages
<arigato> so in the end *everything* is in the same cache line as the same thing from the other thread
marky1991_2 has joined #pypy
rokujyouhitoma has joined #pypy
marky1991 has joined #pypy
marky1991_2 has quit [Ping timeout: 276 seconds]
rokujyouhitoma has quit [Ping timeout: 276 seconds]
exarkun has quit [Ping timeout: 248 seconds]
exarkun has joined #pypy
oberstet has quit [Ping timeout: 240 seconds]
Remi_M has joined #pypy
<Remi_M> arigato: another data point confirming your conclusion: always returning 64 in rffi_platform.memory_alignment() seems to give nice scaling for the original example too
<arigato> ah ok
rokujyouhitoma has joined #pypy
marky1991 has quit [Ping timeout: 248 seconds]
<arigato> the current ordering of tracing is a bit too random to separate the threads
<arigato> it starts by scanning all thread's stacks
<arigato> so it will copy the objects directly referenced
<arigato> but put further references in a single list
<arigato> which is processed later
nimaje1 has joined #pypy
nimaje1 is now known as nimaje
nimaje has quit [Killed (verne.freenode.net (Nickname regained by services))]
<arigato> there are also issues (probably less important but unknown) of objects that are harder to identify to a given thread:
<arigato> for example, if it's stored in an old object
rokujyouhitoma has quit [Ping timeout: 240 seconds]
<Remi_M> I wonder if heap fragmentation will over time "solve" the problem through random placement of new allocations, or if it will make it worse
<arigato> heh, maybe randomization is the simple answer for now:
<arigato> for each page, instead of having the linked list of free blocks in order, make it in random order every time
<Remi_M> it's worth a try :)
<Remi_M> hm.. but I guess it is sorted to coalesce free memory ranges?
<arigato> no
<arigato> minimarkpage.py only frees a page when all slots are free (and they are all of the same size)
<Remi_M> ah that's for small objects
<arigato> yes
<Remi_M> right
<arigato> it's all obscure, because it's a generally good thing if closely related objects end up in the same cache line
<Remi_M> seems like I always end up at the "Hoard memory allocator" when looking for false-sharing avoidance in allocators. of course that is probably not implemented in an afternoon...
yuyichao has joined #pypy
oberstet has joined #pypy
<arigato> Remi_M: I think we should get something roughly reasonable if we reorder tracing to occur thread after thread, and after a thread we simply waste a few allocations to make sure we're past the 64/128 bytes limit
<arigato> so it should help if there is no fragmentation, and if there is enough fragmentation we just rely on randomness
Rhy0lite has joined #pypy
rokujyouhitoma has joined #pypy
<Remi_M> yes, that may be enough
rokujyouhitoma has quit [Ping timeout: 276 seconds]
Rhy0lite has quit [Ping timeout: 255 seconds]
Rhy0lite has joined #pypy
adamholmberg has joined #pypy
marky1991 has joined #pypy
marky1991 has quit [Changing host]
marky1991 has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 255 seconds]
Guest51964 has quit [Remote host closed the connection]
marvin has joined #pypy
marvin is now known as Guest44737
lritter has quit [Ping timeout: 240 seconds]
<fijal> arigato: still online?
<arigato> fijal: yes
<arigato> should land on time
<fijal> I'm in traffic but there are 3 accidents on the way
<arigato> meh, not fun
<fijal> +27605722238 is my new number
<fijal> Not much I can do at that stage
<arigato> I'll wait in the shop/bar close to the entrance, if you're ok with that
<fijal> I suggest woolworth
<fijal> Has the best coffee
yuyichao has quit [Ping timeout: 240 seconds]
<fijal> I should be on time tho
<arigato> ok
bgola has joined #pypy
yuyichao has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 260 seconds]
jiffe has quit [Quit: WeeChat 1.9]
jiffe has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
adamholmberg has quit [Remote host closed the connection]
adamholmberg has joined #pypy
adamholmberg has quit [Remote host closed the connection]
adamholmberg has joined #pypy
adamholm_ has joined #pypy
adamholmberg has quit [Ping timeout: 246 seconds]
michaelgreene has quit [Quit: Leaving]
rokujyouhitoma has joined #pypy
adamholmberg has joined #pypy
rokujyouhitoma has quit [Ping timeout: 276 seconds]
adamholm_ has quit [Ping timeout: 246 seconds]
<mattip> heh, PyList_Check(op) tests op.op_type.tp_flags & Py_TPFLAGS_LIST_SUBCLASS, not any kind of isinstance() or issubclass()
<mattip> so we are too strict in many of the PyList_* functions, which in CPython work with nparrays too
<mattip> on a more positive note, I easily got a pull request accepted into pandas that avoids sys.getsizeof,
<mattip> and found an issue where they assume set(obj) is sorted, so now we are down to 38 failures
adamholmberg has quit [Remote host closed the connection]
adamholmberg has joined #pypy
<mattip> likewise for PyInt_Check, PyDict_Check, PyLong_Check, ...
adamholmberg has quit [Remote host closed the connection]
<mattip> however the Py*_CheckExact macros look whether the ob_type is the correct type object pointer
adamholmberg has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 255 seconds]
adamholm_ has joined #pypy
adamholmberg has quit [Read error: Connection reset by peer]
tbodt has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
redj has quit [Quit: No Ping reply in 180 seconds.]
redj has joined #pypy
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 248 seconds]
yuyichao has quit [Read error: Connection reset by peer]
yuyichao has joined #pypy
<kenaan> mattip default 40ee3c492e28 /pypy/module/cpyext/: refactor 9ddefd44f80d handling pre-existing exceptions, add tests, still not bulletproof
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
oberstet has quit [Ping timeout: 248 seconds]
antocuni has joined #pypy
Tiberium has quit [Remote host closed the connection]
Tiberium has joined #pypy
cwillu has quit [Ping timeout: 248 seconds]
marky1991 has quit [Remote host closed the connection]
marky1991 has joined #pypy
marky1991 has quit [Read error: Connection reset by peer]
jamesaxl has quit [Quit: WeeChat 1.8]
rokujyouhitoma has joined #pypy
rokujyouhitoma has quit [Ping timeout: 248 seconds]
Rhy0lite has quit [Quit: Leaving]
slackyy has joined #pypy
tbodt has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
adamholm_ has quit [Remote host closed the connection]
adamholmberg has joined #pypy
adamholmberg has quit [Ping timeout: 248 seconds]
Tiberium has quit [Ping timeout: 246 seconds]
rokujyouhitoma has joined #pypy
Tiberium has joined #pypy
tbodt has joined #pypy
rokujyouhitoma has quit [Ping timeout: 240 seconds]
nimaje has joined #pypy
Tiberium has quit [Read error: Connection reset by peer]
rokujyouhitoma has joined #pypy
tbodt has quit [Read error: Connection reset by peer]
tbodt has joined #pypy
antocuni has quit [Ping timeout: 276 seconds]
rokujyouhitoma has quit [Ping timeout: 240 seconds]
tormoz has quit [Read error: Connection reset by peer]
tormoz has joined #pypy
pilne has joined #pypy
lritter has joined #pypy
yuyichao has quit [Ping timeout: 240 seconds]
raynold has joined #pypy
slackyy has quit [Ping timeout: 264 seconds]
exarkun has quit [Ping timeout: 246 seconds]
exarkun has joined #pypy
yuyichao has joined #pypy