cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end ) | use cffi for calling C | if a pep adds a mere 25-30 [C-API] functions or so, it's a drop in the ocean (cough) - Armin
<antocuni>
arigato: FWIW it's unrelated to hpy itself, but it seems that one of the bottlenecks of the benchmark is rffi.wcharpsize2utf8
tsaka__ has joined #pypy
<cfbolz>
antocuni: there are definitely a lot such corner cases in pypy3
<cfbolz>
I fixed a few in the summer
<antocuni>
ok, good to know
<antocuni>
so I should not be surprised to find them
<cfbolz>
nope, unfortunately not
tsaka__ has quit [Ping timeout: 240 seconds]
<arigato>
nothing obviously wrong with wcharpsize2utf8...
<antocuni>
if I analyized it correctly in callgrind, it seems to be ~30-40% slower than the cpython equivalent
<arigato>
it's probably faster in CPython because it's building a ascii or latin1 or UCS2 string
<arigato>
from a UCS4 string
<antocuni>
ah, I see
tsaka__ has joined #pypy
<antocuni>
I should find a json file which contains lots of long strings with an emoji at the end 😅
<arigato>
yes :-)
<arigato>
or also, find a program that does various things with the strings, instead of just building the data structure and throwing it away
<antocuni>
it might be enough to write a benchmark which redumps objects into json again
<arigato>
unclear, the reverse transformation from utf8 to ucs4 is also a bit more costly than from ascii or ucs2
<arigato>
we really win a lot with our utf8 internals if data is always utf8
<antocuni>
no, to dump into json you need to convert to utf8
<arigato>
uh
<antocuni>
(I think?)
<arigato>
I thought json files were supposed to be ucs2, but maybe I'm wrong
marmoute_ is now known as marmoute
<arigato>
ah no, they are supposed to be utf8
<arigato>
OK then why is wcharsize2utf8 even called?
<arigato>
the ujson library decodes utf8 to wchar_t?
<arigato>
and we need to re-encode it in utf8?
<antocuni>
it seems so
<arigato>
no wonder the built-in json can do better
<antocuni>
ujson is written as a generic C libraries, and a set of callbacks which are called to construct the resulting data structures
<antocuni>
for strings, we have Object_newString which takes wchar_t* and calls HPyUnicode_FromWideChar
<antocuni>
well, I suppose it made sense when they wrote it for CPython
tsaka__ has quit [Ping timeout: 265 seconds]
<arigato>
long ago
<arigato>
it doesn't really make sense from CPython 3.3 onwards
<arigato>
but I can see the reasoning too, which is that the exact details of the decoding of utf8 are probably slightly different than CPython's own (e.g. accepting/rejecting surrogates)