cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end ) | use cffi for calling C | if a pep adds a mere 25-30 [C-API] functions or so, it's a drop in the ocean (cough) - Armin
<arigo>
but looking at the implementation of ffi_prep_closure(), then yes, I can trivially replace it with ffi_prep_closure_loc() anyway, which works fine if we assume that no-one uses a libffi so old that it doesn't have ffi_prep_closure_loc()
tsaka__ has quit [Ping timeout: 246 seconds]
tsaka__ has joined #pypy
tsaka__ has quit [Ping timeout: 246 seconds]
tsaka__ has joined #pypy
_whitelogger has joined #pypy
YannickJadoul has joined #pypy
rguillebert has joined #pypy
<Alex_Gaynor>
arigo: looks like ffi_prep_closure_loc has been in the header for 11 years, so I suppose it's fine :-)
Rhy0lite has joined #pypy
tsaka__ has quit [Ping timeout: 265 seconds]
xcm has quit [Remote host closed the connection]
xcm has joined #pypy
dmalcolm_ has joined #pypy
dmalcolm has quit [Ping timeout: 240 seconds]
dstufft has quit [Ping timeout: 240 seconds]
nopf has quit [Ping timeout: 260 seconds]
nopf has joined #pypy
dstufft has joined #pypy
dstufft has quit [Excess Flood]
dstufft has joined #pypy
sknebel has quit [Remote host closed the connection]
sknebel has joined #pypy
mvantellingen has quit [Ping timeout: 246 seconds]
commandoline has quit [Ping timeout: 256 seconds]
mvantellingen has joined #pypy
oberstet has joined #pypy
commandoline has joined #pypy
jacob22 has quit [Quit: Konversation terminated!]
jacob22 has joined #pypy
tsaka__ has joined #pypy
otisolsen70 has quit [Quit: Leaving]
rguillebert has quit [Quit: Connection closed for inactivity]
<whitequark>
hey folks, i have an optimization question
<whitequark>
nmigen has an RTL simulator that converts HDL code into similarly structured Python code that simulates its behavior. this code is pure bignum (in the sense that it needs to, sometimes, use numbers beyond 64 bits) arithmetics that operates on a chunk of state
<whitequark>
for every HDL signal (think "variable"), this state consists of a pair of numbers (curr, next)
<whitequark>
the question is: what would be the best way for pypy performance to represent this state?
<whitequark>
I have three contenders: AoS (current implementation), an array of classes with __slots__ = ('curr', 'next')
<whitequark>
SoA, one class with an equally sized curr, next arrays
<whitequark>
both of these are indexed by numbers, so the generated code looks something like `slots[0].next = ~slots[1].curr`, with `slots` injected via `exec`
<whitequark>
and the third one is to make it work like a closure, i.e. have a dict with keys like `slot_0`, `slot_1`, where the values are instances of a class that has `curr` and `next` fields, and provide that dict as locals to `exec`
<whitequark>
thoughts?
<whitequark>
i don't really use pypy myself, but our downstream projects say very good things about it, so i thought i'd ask to make sure i'm not going to paint myself into a corner architecturally
<tos9>
whitequark: (folks from EU time are possibly asleepish already)
<tos9>
the broad answer is always probably "measure"
<tos9>
but the broader answer to the best of my understanding is ... __slots__ mostly doesn't do anything on PyPy (in a good way)
<tos9>
a class will be better than dicts generally
<tos9>
given you have fixed fields
<tos9>
PyPy may (I think probably should) figure out when bignums aren't needed based on the operations, and compile down to machine ints
<tos9>
and same for compiling the list down to a plain array of machine ints in memory
<tos9>
so yeah basically the rule in PyPy is "do the simple thing, usually PyPy can make that fast"
<tos9>
and if it's not fast enough at that point we can help look at the generated code to see why
<Alex_Gaynor>
SoA would be my guess at something likely to perform best, but you'd have to measure to be sure.
<whitequark>
tos9: regarding dicts: the actual state object for an individual signal would always be a class (if present at all)
<whitequark>
what i'm wondering about if it makes sense to inject locals
<whitequark>
to avoid an array lookup
<whitequark>
i suspect (haven't measured yet, needs some refactoring) that this would be good for cpython, which is what most people use and which is the most important target
<whitequark>
but i'd like to not do something that pessimizes pypy
<whitequark>
as for "fast enough": it's never fast enough :) the python simulator is far too slow for complex designs, so i had to implement a translator to C++, which can be built and loaded with ctypes
<whitequark>
but that can't be the only option because of windows and other environments that have compiler issues
<pmp-p>
then use wasm
<whitequark>
ah so the reason i use C++ is to use the C++ compiler as a macro expansion engine
<whitequark>
so first i'd have to compile *clang* targeting wasm to wasm and ship *that*
<tos9>
whitequark: injecting locals doesn't do anything on PyPy speedwise
<tos9>
personally I'm biased, but my general strategy is usually "make Python code that's as fast as possible on PyPy *first*"
<whitequark>
tos9: as in, `slots[0]` and `slot_0` referring to the same object in the generated code would work roughly as fast?
<tos9>
reason being people who care about performance often use it :D -- but second reason being, if you want to make it faster on CPython, you can then take the whole chunk and have a CFFI extension
<pmp-p>
whitequark: you can compile python to wasm
<tos9>
whitequark: correct.
<pmp-p>
no need for clang
<pmp-p>
you just need to fully annotate your python code
<whitequark>
tos9: thanks, that's helpful! so the problem here is that the people who care about speed will use the C++ backend anyway
<tos9>
whitequark: right, exactly, that's normal (that the CPython folks are using some non-Python-based backend of whatever thing)
<whitequark>
the C++ code i'm generating runs at about 1/2-1/4 speed of single-threaded verilator, which is likely unbeatable
<tos9>
oh sorry, you mean a C++ backend not connected to Python at all? I mean you could connect it (again via CFFI)
<whitequark>
it's connected to python
<tos9>
ah, k
<tos9>
then yeah I mean for CPython the fastest path is "make the slow stuff run in not-Python" anyhow
<whitequark>
what i mean is that i'm completely certain that pypy can't beat the c++ backend, because i'm not feeding pypy the same thing i'm feeding the c++ compiler
<whitequark>
there's an intermediate HDL-specific optimization stage that i cannot do in python without a massive amount of duplicated effort
<whitequark>
the reason i care about CPython speed is that it's going to be the baseline for folks new to nmigen, especially on windows
<whitequark>
it doesn't need to be very fast, but it can't be too slow
<tos9>
yeah sure, obviously caring about CPython speed is important (probably as you say more important given that's what most folks will use, for better or worse)
<whitequark>
the reason the c++ backend exists is that the people who use pypy don't find it sufficient
<whitequark>
i think they currently have 40 minute CI times or something, which is better than multiple hour CI times
<whitequark>
but... not sufficient
<tos9>
that's the reason the *C++* backend exists?
<whitequark>
*a* reason
<whitequark>
the C++ backend goes through Yosys, so you can also include Verilog code in the simulation
<whitequark>
the C++ backend also exists for people who don't use nMigen at all; it is a contribution back to the broader community
<whitequark>
but the immediate impetus for writing it was to improve the speed on all Python implementations, yes
<tos9>
Normally for PyPy the pure python one would be faster than a ctypes one, AFAIK.
<tos9>
But it's very hard to generalize these things, hence "measure"
<whitequark>
can that be right? the overhead of ctypes calls is fixed, but the size of a design is unlimited
<tos9>
Oh, was that what you were saying about the C++ backend doing different things than the pure python one?
<whitequark>
no
<whitequark>
two different things
<whitequark>
what the C++ backend does differently internally from the Python one is that it does some netlist optimizations, giving the C++ compiler a *lot* more visibility into e.g. variable lifetimes
<whitequark>
what i'm talking about now is the general way people use the simulation
<tos9>
Basically -- PyPy can't JIT anything that isn't Python code, and worse, PyPy will be quite slow at anything that uses the CPython C API in many cases because what it has is just an emulation layer for it as much as it can
<tos9>
So in general with PyPy you want as much code to be Python as you can, so that when it's used, PyPy can look inside, bridge lots of stuff, make everything fast
<whitequark>
they compile a (typically very large) amount of HDL using some backend, python or c++ or whatever, and then interact with it using a small API surface and a small number of calls
<whitequark>
there is no CPython C API involved anywhere, of course
<whitequark>
this is a typical way in which a design would be driven
<whitequark>
so you have an extremely large amount of generated code that is all stuffed into `cxxrtl_step`, and then a small amount of python code driving it that does almost no useful work
<tos9>
ok yeah sorry, so not CPython C API, but just using ctypes is slower on PyPy than on CPython still IIRC.
<whitequark>
in this case it toggles the clock, it would also probably read something in most simulations
<tos9>
(As opposed to CFFI)
<whitequark>
but the idea is that you execute literally a few lines of python code and then you go run hundreds of kilobytes of machine code generated from c++
<whitequark>
oh, i can use CFFI if that's better, that was just a proof of concept so i went for ctypes
<whitequark>
whatever works best, the generated library exports a conservative C API
<whitequark>
there's no point in letting pypy look into cxxrtl_step because that function is completely self-contained. it can (and should) be compiled separately
<tos9>
then yeah if I follow sounds like what you have should probably be fine for both then
<whitequark>
both of them as in, pypy and cpython?
<tos9>
yeah
* whitequark
nods
<whitequark>
alright, so my conclusion is that i should benchmark both SoA and local injection approaches (and of course AoS is the current solution), and probably go ahead with the one faster on CPython
<whitequark>
since they're likely going to be the same on pypy. very nice.
<whitequark>
what they have in common is they rely heavily on inlining and local optimizations to be able to describe bignum ("arbitrary size integers" is more correct but also unwieldy) using fairly abstract algorithms but still have it generate nice code
<whitequark>
i'm basically replacing the gnarliest parts of verilator with yosys and clang, implementing only the actually interesting netlist manipulation parts
<whitequark>
it's a lot easier to get things correct with this approach, at the cost of somewhat high but not extreme compile times