roca has joined #jruby
roca has quit [Quit: roca]
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
akp has joined #jruby
pilne has quit [Quit: Leaving]
rdubya1 has quit [Read error: Connection reset by peer]
rdubya has joined #jruby
Puffball has quit [Remote host closed the connection]
Puffball has joined #jruby
Puffball has quit [Remote host closed the connection]
sidx64 has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
sidx64_ has joined #jruby
sidx64 has quit [Read error: Connection reset by peer]
sidx64 has joined #jruby
sidx64_ has quit [Ping timeout: 268 seconds]
Cu5tosLimen has quit [Excess Flood]
Cu5tosLimen has joined #jruby
mkristian has joined #jruby
sidx64_ has joined #jruby
sidx64 has quit [Read error: Connection reset by peer]
sidx64_ has quit [Ping timeout: 252 seconds]
sidx64 has joined #jruby
drbobbeaty has joined #jruby
claudiuinberlin has joined #jruby
drbobbeaty has quit [Client Quit]
drbobbeaty has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
drbobbeaty has quit [Ping timeout: 245 seconds]
drbobbeaty has joined #jruby
shellac has joined #jruby
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
sidx64 has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
sidx64 has quit [Client Quit]
sidx64 has joined #jruby
drbobbeaty has joined #jruby
sidx64 has quit [Client Quit]
sidx64 has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
bbrowning_away is now known as bbrowning
lance|afk is now known as lanceball
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
bga571 has quit [Ping timeout: 256 seconds]
bga57 has joined #jruby
jrafanie has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
<GitHub187> [jruby] enebo reopened issue #4796: Possible ChannelFD leak in FilenoUtil? https://git.io/vdTFp
jrafanie_ has joined #jruby
jrafanie has quit [Ping timeout: 252 seconds]
jrafanie_ has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
jrafanie has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
roca has joined #jruby
<headius> howdy howdy
pilne has joined #jruby
akp has quit []
shellac has quit [Quit: Leaving]
claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]
claudiuinberlin has joined #jruby
roca has quit [Quit: roca]
shellac has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
mkristian has quit [Quit: This computer has gone to sleep]
shellac has quit [Quit: Computer has gone to sleep.]
sidx64 has joined #jruby
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
sidx64 has joined #jruby
<GitHub138> [jruby] lopex pushed 1 new commit to master: https://git.io/vxIrP
<GitHub138> jruby/master 36b44df Marcin Mielzynski: fix for #5086, RegexpError invalid pattern in look-behind for certain Regexps since 9.1.16.0
<lopex> enebo: ^^
<enebo> lopex: I see logic changed
<lopex> enebo: explained in the commend
<lopex> comment
<lopex> tell me if it makes sense
<lopex> I see 20% percent speed increase even on this short /(?<!ss)/i =~ "a"
<enebo> so any regexp which is us-ascii if it sees a 7bit string it just says fuck it we will treat it as ascii
<lopex> yes, us-ascii
<enebo> yeah us-ascii
<enebo> sounds reasonable to me
<enebo> I did not fully grok how this changes DRegexps
<lopex> enebo: we created two regexps for this one
<lopex> it's a literal so next time it will hit the cache
<enebo> so we would continually make a new regexp for the encoding of the string passed in
<lopex> yes
<enebo> I am just confused why that would not cache as well
<enebo> I am not sure it is important for me to know though :)
<lopex> well it would
<lopex> but for dregexps we'de make twice as many
<enebo> oh since it makes us-ascii one already
<lopex> and go slow unicode path
<enebo> we actually make our regexps many times on the way through parsing already too but they probably cache hit
<lopex> or whatever encoding the string had
<lopex> but the issue that triggered it is funnier
<enebo> lopex: ok well I doubt regexp cache is filling up the internets servers but it is cool when we find some extra memory we can kill
<lopex> enebo: I think there's onigmo bug
<enebo> lopex: yeah since why would this not work when not US-ASCII
<lopex> like /(?<!ss)/ui doesnt blow but (?<!fss)/i does
<lopex> the problem is with ss
<lopex> which folds to two different german sharp-s unicode chars
<lopex> so techincally (?<!ss) is still variable look-behind !!
<lopex> and doesnt blow
<lopex> I mean variable length look-behind
<enebo> hah so ss = ß
<enebo> but that is more than one?
<enebo> ʒ
<enebo> wow
<lopex> enebo: and ẞ
<enebo> I studied german a bit and did not know of that crazy 3 ss :)
<enebo> or if I did I have forgotten it
<lopex> enebo: same for tt
<enebo> but this is quite weird semantically
<lopex> er
<enebo> So if you are german and want this behavior you need to force encode your regexp to be unicode or it won't match
<lopex> yes
<enebo> If I assumed as of Ruby 2.1 that all source is unicode by default I would assume my regexp would be
<lopex> ff
<enebo> This would not match my expectations
<lopex> ff
<lopex> enebo: those are different
<enebo> ff means what? :)
<lopex> no idea
<enebo> oh hah ok
<enebo> I thought you were somehow agreeing with me
<lopex> enebo: no idea what that meant
<enebo> what I said?
<lopex> I responded to "enebo> ff means what? :)"
<enebo> you just put that double ff on the screen like I should know that is one codepoint
<lopex> looks the same for me here
<enebo> It looks like you are writing two f's. So I had no idea why you put it on the screen
<lopex> ff and ff
<enebo> It was unclear you were talking about \ufb00 when I sat it
<lopex> same glyph
<enebo> saw
<lopex> enebo: but unexpected issue when you're searching for css :P
<lopex> welcome to ruby
<enebo> lopex: though as you say it should work
<enebo> lopex: it may match more than you want but it should still match ascii ss
<lopex> enebo: not if you do //u
<enebo> lopex: but why not?
<enebo> lopex: ss can only mean german sharp s?
<lopex> because it will blow on /(?<!css)/iu
<enebo> lopex: yeah but that has to be a bug
<enebo> right?
<lopex> that's what I'm explaining
<enebo> yeah I am agreeing
<lopex> well, it's a look-behind it has limitations
<enebo> it should match us-ascii s followed by s and the german sharps which may not really be what people want but it should still work at least
<enebo> look behind has issues with varying multibyte chars?
<lopex> any variability
<lopex> so /(?<!.*)/
<lopex> will blow
<lopex> I mean non fixed length
<lopex> but it can have alternatives
<lopex> enebo: ^^
<lopex> as far as the alternatives lengths are fixed so /(?<!a|bb)/ will work
<enebo> ah ok but ss should be [ss|ẞ|ʒ]
<enebo> it should just become a simple alternation right?
<enebo> that is the bug
<lopex> ss alone is ok
<lopex> css is bad :P
<lopex> confusing enough ?
<enebo> oh because it would need to make [css|cẞ|cʒ]
<enebo> yeah anyways I guess it does not really matter
<enebo> MRI gets around this particular case
<enebo> lopex: MRI will also break on that regexp if unicode?
<lopex> yes
<lopex> we're on par
<enebo> yeah you saying onigmo was a clue :P
<lopex> onigmo ast
<lopex> even more confusing
<enebo> heh
<lopex> enebo: anyways, it was an interesting bug
<enebo> lopex: yeah thanks for looking into it. I did not see that problem coming!
<lopex> enebo: at least not a regression
<enebo> ss looked like some fundamental off by one thing
<enebo> super happy it was something a bit weird
<lopex> well, it was a hidden perf bug really
<lopex> we wouldnt catch it without it
<enebo> lopex: yeah nice side-benefit
<enebo> lopex: and so nearly all regexps will benefit from this right since almost all are us-ascii
<lopex> yes
<lopex> I
<enebo> lopex: nice
<lopex> I'll take another bench
<enebo> lopex: so I have wondered a lot about why so much work in MRI has been to assume regexps and symbols are US-ASCII if they can possible fit. I knew it was that 7bit is faster code path but it felt weird semantically
<enebo> Like if I have an encoded string not US-ASCII but it isAscii and 7bit clean we should juse use fast path
<enebo> So it seems strange they just force the encoding to me
<lopex> enebo: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaa" =~ /a+/i
<lopex> now it's 3x faster
<lopex> and it will scale with it
<lopex> so potentailly dozens of times faster
<lopex> enebo: mostly due to specialized ascii bytecodes for quantifiers
<lopex> er, singlebyte
<lopex> that's no jokes
<enebo> yes!!!!!!
<enebo> lopex: you know what you found?
<enebo> lopex: YOU KNOW WHAT YOU FOUND?
<enebo> I wanna hear you say it!
<lopex> no I dont think so
<enebo> DARK MATTER lopex DARK MATTER
<lopex> ("X" * 1000) + "aaa" =~ /a+/i 5x faster
<lopex> oh let try more backtracking
<lopex> even more diff for [a]+
<lopex> yeah, might be big
<lopex> enebo: ("_" * 1000) + "" =~ /[a-z]+/i
<lopex> guess how much faster
<lopex> enebo: 35x faster
sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<enebo> lopex: calculate the big O
<lopex> enebo: no, same number
<lopex> char class perf
<enebo> it is just that it can bump by 1 really quick versus calc codepoints?
<lopex> actually no idea why that much
<enebo> lopex: I was thinking we could save codepoint indices on a string if we walk it once into a second array
<enebo> ith would be char_array[locations[i]]
<enebo> eager allocation of n codepoint array for string but no doubt regexps walk a string once in many cases
<lopex> why codepoint array ?
<enebo> well if you need to walk back and forth on mbc char data you want to do it without constantly reasking for how long a codepoint is
<enebo> or does joni do that?
<lopex> it uses length
<enebo> lopex: I mean instead of constantly using: enc.prevCharHead(bytes, str, at, end); or one for other direction you could have an array which will just tell you where start index is
<lopex> why prev ?
<enebo> no I am just showing you a single method...joni constantly is recalculating previous or next char
<enebo> if it was eager for mbc strings it would just be an extra array indirection
<enebo> and creating the array once
<lopex> why creating an array ?
<lopex> I'm lost
<enebo> to save all the offsets of what chars are
<enebo> in the bytearray
<enebo> so next char is a single array lookup
<enebo> vs going into jcodings and partially walking the byte data over and over
<lopex> looks like my recent opt is being triggered
<lopex> look at opCClassMIX and opCClassMIXSb
<lopex> sb is specialized for singlebyte
<enebo> so did what I say not make sense still?
<lopex> entire coderange check is being skipped
<enebo> your opt for single byte data is fine but I am talking about optimizing mbc
<lopex> enebo: that's one thing, but I'm lost about array creation
<enebo> I walk the bytearray. For each char I find I make an entry in a second array telling me where it starts. joni does not need to keep walking and using jcodings to go forward or backward
<lopex> oh ok
<enebo> so I make a second array with the indices of every char in the bytearray
<lopex> but not worth it
<enebo> no?
<lopex> for every string ?
<enebo> or not worth it in some cases
<lopex> yo're talking boyermoore like map
<lopex> so skip map
<lopex> right ?
<enebo> beside not knowing that name maybe
<lopex> I mean not the algorithm, just an idea
<enebo> lopex: I took algo classes 30 years ago
<lopex> enebo: if you look for "alice" for example and you hit "a"
<lopex> the map will have "e" at 5
<lopex> so you dont walk byte by byte
<enebo> ah well I did not mean this much...just literally making an indice location map with an extra O(n) scan first so all calculations of prev next in matchers if invariant to an array lookup
<lopex> yeah, but still costly for creation
<enebo> byte[] data = "alice"; int[] locs = "01234"; data[locs[2]]
<enebo> pseudo code those are not strings per se
<enebo> yeah so probably only worth it if you know you have a complicated regexp which will be walking a lot
<enebo> or likely to be
<enebo> My code obviously would be horrible for a 7bit string
<enebo> if you know you are doing something super greedy or likely to backtrack a lot then making this array may pay for itself
<lopex> enebo: on the other note we have specialized fast skip/fast fails routines too https://github.com/jruby/joni/blob/master/src/org/joni/SearchAlgorithm.java
<lopex> so if you have /foo/ the foo will be first fast found using those
<lopex> and then enter the interpreter
<lopex> it even works for like
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<lopex> enebo: it will exit interpreter loop if it fails and fast skip to another suspected area
<enebo> lopex: so skip table seems to be what I mean but for mbc data.
<lopex> yeah, I know
<enebo> lopex: ok it was unclear if you did :)
<enebo> I just needed more confirmation
<lopex> it's just building that for string every time seemed weird for me
<enebo> I guess it seems likely this would have existed but when I go into matcher it looks like it is all jcoding methods to navigate around
<enebo> oh yeah I really did not mean to imply it was 100% soln
<lopex> enebo: but there's lot of place for opt too
<lopex> like /abcą/u
claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]
<lopex> it will already exect3 opcode for the first three
<lopex> *use already exact3 opcode
<lopex> enebo: cr7 is somewhat one element map saying dont bother with mbc :P
<lopex> enebo: so the fix also triggered fast those skip routines it seems
<lopex> enebo: beer time
<enebo> lopex: I am low carb during week days for the forseeable future
<enebo> lopex: so I will toast you on saturday :)
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<lopex> enebo: mri also has this one https://github.com/k-takata/Onigmo/issues/100
<lopex> it's also pretty serious regression on their side
<lopex> perf wise
mkristian has joined #jruby
shellac has joined #jruby
shellac has quit [Read error: Connection reset by peer]
mkristian has quit [Quit: This computer has gone to sleep]
mkristian has joined #jruby
mkristian has quit [Client Quit]