<enebo>
lopex: yeah thanks for looking into it. I did not see that problem coming!
<lopex>
enebo: at least not a regression
<enebo>
ss looked like some fundamental off by one thing
<enebo>
super happy it was something a bit weird
<lopex>
well, it was a hidden perf bug really
<lopex>
we wouldnt catch it without it
<enebo>
lopex: yeah nice side-benefit
<enebo>
lopex: and so nearly all regexps will benefit from this right since almost all are us-ascii
<lopex>
yes
<lopex>
I
<enebo>
lopex: nice
<lopex>
I'll take another bench
<enebo>
lopex: so I have wondered a lot about why so much work in MRI has been to assume regexps and symbols are US-ASCII if they can possible fit. I knew it was that 7bit is faster code path but it felt weird semantically
<enebo>
Like if I have an encoded string not US-ASCII but it isAscii and 7bit clean we should juse use fast path
<enebo>
So it seems strange they just force the encoding to me
<lopex>
enebo: mostly due to specialized ascii bytecodes for quantifiers
<lopex>
er, singlebyte
<lopex>
that's no jokes
<enebo>
yes!!!!!!
<enebo>
lopex: you know what you found?
<enebo>
lopex: YOU KNOW WHAT YOU FOUND?
<enebo>
I wanna hear you say it!
<lopex>
no I dont think so
<enebo>
DARK MATTER lopex DARK MATTER
<lopex>
("X" * 1000) + "aaa" =~ /a+/i 5x faster
<lopex>
oh let try more backtracking
<lopex>
even more diff for [a]+
<lopex>
yeah, might be big
<lopex>
enebo: ("_" * 1000) + "" =~ /[a-z]+/i
<lopex>
guess how much faster
<lopex>
enebo: 35x faster
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<enebo>
lopex: calculate the big O
<lopex>
enebo: no, same number
<lopex>
char class perf
<enebo>
it is just that it can bump by 1 really quick versus calc codepoints?
<lopex>
actually no idea why that much
<enebo>
lopex: I was thinking we could save codepoint indices on a string if we walk it once into a second array
<enebo>
ith would be char_array[locations[i]]
<enebo>
eager allocation of n codepoint array for string but no doubt regexps walk a string once in many cases
<lopex>
why codepoint array ?
<enebo>
well if you need to walk back and forth on mbc char data you want to do it without constantly reasking for how long a codepoint is
<enebo>
or does joni do that?
<lopex>
it uses length
<enebo>
lopex: I mean instead of constantly using: enc.prevCharHead(bytes, str, at, end); or one for other direction you could have an array which will just tell you where start index is
<lopex>
why prev ?
<enebo>
no I am just showing you a single method...joni constantly is recalculating previous or next char
<enebo>
if it was eager for mbc strings it would just be an extra array indirection
<enebo>
and creating the array once
<lopex>
why creating an array ?
<lopex>
I'm lost
<enebo>
to save all the offsets of what chars are
<enebo>
in the bytearray
<enebo>
so next char is a single array lookup
<enebo>
vs going into jcodings and partially walking the byte data over and over
<lopex>
looks like my recent opt is being triggered
<lopex>
look at opCClassMIX and opCClassMIXSb
<lopex>
sb is specialized for singlebyte
<enebo>
so did what I say not make sense still?
<lopex>
entire coderange check is being skipped
<enebo>
your opt for single byte data is fine but I am talking about optimizing mbc
<lopex>
enebo: that's one thing, but I'm lost about array creation
<enebo>
I walk the bytearray. For each char I find I make an entry in a second array telling me where it starts. joni does not need to keep walking and using jcodings to go forward or backward
<lopex>
oh ok
<enebo>
so I make a second array with the indices of every char in the bytearray
<lopex>
but not worth it
<enebo>
no?
<lopex>
for every string ?
<enebo>
or not worth it in some cases
<lopex>
yo're talking boyermoore like map
<lopex>
so skip map
<lopex>
right ?
<enebo>
beside not knowing that name maybe
<lopex>
I mean not the algorithm, just an idea
<enebo>
lopex: I took algo classes 30 years ago
<lopex>
enebo: if you look for "alice" for example and you hit "a"
<lopex>
the map will have "e" at 5
<lopex>
so you dont walk byte by byte
<enebo>
ah well I did not mean this much...just literally making an indice location map with an extra O(n) scan first so all calculations of prev next in matchers if invariant to an array lookup
<lopex>
yeah, but still costly for creation
<enebo>
byte[] data = "alice"; int[] locs = "01234"; data[locs[2]]
<enebo>
pseudo code those are not strings per se
<enebo>
yeah so probably only worth it if you know you have a complicated regexp which will be walking a lot
<enebo>
or likely to be
<enebo>
My code obviously would be horrible for a 7bit string
<enebo>
if you know you are doing something super greedy or likely to backtrack a lot then making this array may pay for itself