#jruby on 2018-03-14 — irc logs at freenode.irclog.whitequark.org

2018-02-21 20:49 ChanServ changed the topic of #jruby to: Get 9.1.16.0! http://jruby.org/ | http://wiki.jruby.org | http://logs.jruby.org/jruby/ | http://bugs.jruby.org | Paste at http://gist.github.com

00:03 roca has joined #jruby

00:15 roca has quit [Quit: roca]

00:45 jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

01:57 akp has joined #jruby

02:23 pilne has quit [Quit: Leaving]

03:20 rdubya1 has quit [Read error: Connection reset by peer]

03:21 rdubya has joined #jruby

04:04 Puffball has quit [Remote host closed the connection]

04:29 Puffball has joined #jruby

04:57 Puffball has quit [Remote host closed the connection]

05:32 sidx64 has joined #jruby

05:49 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

05:51 sidx64 has joined #jruby

05:59 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

06:00 sidx64 has joined #jruby

06:38 sidx64_ has joined #jruby

06:40 sidx64 has quit [Read error: Connection reset by peer]

06:43 sidx64 has joined #jruby

06:45 sidx64_ has quit [Ping timeout: 268 seconds]

06:45 Cu5tosLimen has quit [Excess Flood]

06:45 Cu5tosLimen has joined #jruby

07:03 mkristian has joined #jruby

07:12 sidx64_ has joined #jruby

07:13 sidx64 has quit [Read error: Connection reset by peer]

07:17 sidx64_ has quit [Ping timeout: 252 seconds]

07:17 sidx64 has joined #jruby

08:53 drbobbeaty has joined #jruby

08:56 claudiuinberlin has joined #jruby

08:57 drbobbeaty has quit [Client Quit]

08:57 drbobbeaty has joined #jruby

09:03 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

09:08 drbobbeaty has quit [Ping timeout: 245 seconds]

09:09 drbobbeaty has joined #jruby

09:23 shellac has joined #jruby

09:29 drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

09:35 shellac has quit [Quit: Computer has gone to sleep.]

10:04 shellac has joined #jruby

10:17 sidx64 has joined #jruby

10:24 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

10:44 sidx64 has joined #jruby

10:48 sidx64 has quit [Client Quit]

10:55 sidx64 has joined #jruby

10:56 drbobbeaty has joined #jruby

10:58 sidx64 has quit [Client Quit]

11:02 sidx64 has joined #jruby

11:09 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

11:11 sidx64 has joined #jruby

11:45 shellac has quit [Quit: Computer has gone to sleep.]

12:07 shellac has joined #jruby

12:18 bbrowning_away is now known as bbrowning

13:20 lance|afk is now known as lanceball

13:31 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

13:32 sidx64 has joined #jruby

13:44 bga571 has quit [Ping timeout: 256 seconds]

13:46 bga57 has joined #jruby

14:01 jrafanie has joined #jruby

14:10 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

14:14 sidx64 has joined #jruby

14:29 <GitHub187> [jruby] enebo reopened issue #4796: Possible ChannelFD leak in FilenoUtil? https://git.io/vdTFp

14:32 jrafanie_ has joined #jruby

14:33 jrafanie has quit [Ping timeout: 252 seconds]

14:55 jrafanie_ has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

14:57 jrafanie has joined #jruby

15:43 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

16:21 roca has joined #jruby

16:22 <headius> howdy howdy

16:33 pilne has joined #jruby

16:35 akp has quit []

17:06 shellac has quit [Quit: Leaving]

17:19 claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]

18:13 claudiuinberlin has joined #jruby

18:14 roca has quit [Quit: roca]

18:38 shellac has joined #jruby

19:03 shellac has quit [Quit: Computer has gone to sleep.]

19:23 shellac has joined #jruby

19:45 mkristian has quit [Quit: This computer has gone to sleep]

19:53 shellac has quit [Quit: Computer has gone to sleep.]

20:10 sidx64 has joined #jruby

20:15 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

20:16 sidx64 has joined #jruby

20:25 <GitHub138> [jruby] lopex pushed 1 new commit to master: https://git.io/vxIrP

20:25 <GitHub138> jruby/master 36b44df Marcin Mielzynski: fix for #5086, RegexpError invalid pattern in look-behind for certain Regexps since 9.1.16.0

20:26 <lopex> enebo: ^^

20:29 <enebo> lopex: I see logic changed

20:29 <lopex> enebo: explained in the commend

20:29 <lopex> comment

20:29 <lopex> tell me if it makes sense

20:31 <lopex> I see 20% percent speed increase even on this short /(?<!ss)/i =~ "a"

20:31 <enebo> so any regexp which is us-ascii if it sees a 7bit string it just says fuck it we will treat it as ascii

20:32 <lopex> yes, us-ascii

20:32 <enebo> yeah us-ascii

20:32 <enebo> sounds reasonable to me

20:32 <enebo> I did not fully grok how this changes DRegexps

20:32 <lopex> enebo: we created two regexps for this one

20:33 <lopex> it's a literal so next time it will hit the cache

20:33 <enebo> so we would continually make a new regexp for the encoding of the string passed in

20:33 <lopex> yes

20:33 <enebo> I am just confused why that would not cache as well

20:33 <enebo> I am not sure it is important for me to know though :)

20:33 <lopex> well it would

20:34 <lopex> but for dregexps we'de make twice as many

20:34 <enebo> oh since it makes us-ascii one already

20:34 <lopex> and go slow unicode path

20:34 <enebo> we actually make our regexps many times on the way through parsing already too but they probably cache hit

20:35 <lopex> or whatever encoding the string had

20:35 <lopex> but the issue that triggered it is funnier

20:35 <enebo> lopex: ok well I doubt regexp cache is filling up the internets servers but it is cool when we find some extra memory we can kill

20:36 <lopex> enebo: I think there's onigmo bug

20:36 <enebo> lopex: yeah since why would this not work when not US-ASCII

20:36 <lopex> like /(?<!ss)/ui doesnt blow but (?<!fss)/i does

20:36 <lopex> the problem is with ss

20:37 <lopex> which folds to two different german sharp-s unicode chars

20:37 <lopex> so techincally (?<!ss) is still variable look-behind !!

20:37 <lopex> and doesnt blow

20:38 <lopex> I mean variable length look-behind

20:38 <enebo> hah so ss = ß

20:38 <enebo> but that is more than one?

20:39 <enebo> ʒ

20:39 <enebo> wow

20:39 <lopex> enebo: and ẞ

20:39 <enebo> I studied german a bit and did not know of that crazy 3 ss :)

20:39 <enebo> or if I did I have forgotten it

20:39 <lopex> enebo: same for tt

20:39 <enebo> but this is quite weird semantically

20:40 <lopex> er

20:40 <enebo> So if you are german and want this behavior you need to force encode your regexp to be unicode or it won't match

20:40 <lopex> yes

20:40 <enebo> If I assumed as of Ruby 2.1 that all source is unicode by default I would assume my regexp would be

20:40 <lopex> ff

20:40 <enebo> This would not match my expectations

20:41 <lopex> ﬀ

20:41 <lopex> enebo: those are different

20:41 <enebo> ff means what? :)

20:41 <lopex> no idea

20:41 <enebo> oh hah ok

20:41 <enebo> I thought you were somehow agreeing with me

20:41 <lopex> http://www.fileformat.info/info/unicode/char/fb00/index.htm

20:42 <lopex> enebo: no idea what that meant

20:42 <enebo> what I said?

20:42 <lopex> I responded to "enebo> ff means what? :)"

20:42 <enebo> you just put that double ff on the screen like I should know that is one codepoint

20:43 <lopex> looks the same for me here

20:43 <enebo> It looks like you are writing two f's. So I had no idea why you put it on the screen

20:43 <lopex> ﬀ and ff

20:43 <enebo> It was unclear you were talking about \ufb00 when I sat it

20:43 <lopex> same glyph

20:43 <enebo> saw

20:44 <lopex> enebo: but unexpected issue when you're searching for css :P

20:44 <lopex> welcome to ruby

20:45 <enebo> lopex: though as you say it should work

20:46 <enebo> lopex: it may match more than you want but it should still match ascii ss

20:46 <lopex> enebo: not if you do //u

20:46 <enebo> lopex: but why not?

20:46 <enebo> lopex: ss can only mean german sharp s?

20:46 <lopex> because it will blow on /(?<!css)/iu

20:46 <enebo> lopex: yeah but that has to be a bug

20:47 <enebo> right?

20:47 <lopex> that's what I'm explaining

20:47 <enebo> yeah I am agreeing

20:47 <lopex> well, it's a look-behind it has limitations

20:47 <enebo> it should match us-ascii s followed by s and the german sharps which may not really be what people want but it should still work at least

20:48 <enebo> look behind has issues with varying multibyte chars?

20:48 <lopex> any variability

20:48 <lopex> so /(?<!.*)/

20:48 <lopex> will blow

20:48 <lopex> I mean non fixed length

20:48 <lopex> but it can have alternatives

20:49 <lopex> enebo: ^^

20:49 <lopex> as far as the alternatives lengths are fixed so /(?<!a|bb)/ will work

20:50 <enebo> ah ok but ss should be [ss|ẞ|ʒ]

20:51 <enebo> it should just become a simple alternation right?

20:51 <enebo> that is the bug

20:51 <lopex> ss alone is ok

20:51 <lopex> css is bad :P

20:52 <lopex> confusing enough ?

20:52 <enebo> oh because it would need to make [css|cẞ|cʒ]

20:53 <enebo> yeah anyways I guess it does not really matter

20:53 <enebo> MRI gets around this particular case

20:53 <enebo> lopex: MRI will also break on that regexp if unicode?

20:53 <lopex> yes

20:53 <lopex> we're on par

20:53 <enebo> yeah you saying onigmo was a clue :P

20:54 <lopex> enebo: https://gist.github.com/lopex/7aace14b0c78d0ea86fc2f9641c41899

20:54 <lopex> onigmo ast

20:54 <lopex> even more confusing

20:56 <enebo> heh

21:00 <lopex> enebo: anyways, it was an interesting bug

21:00 <enebo> lopex: yeah thanks for looking into it. I did not see that problem coming!

21:01 <lopex> enebo: at least not a regression

21:01 <enebo> ss looked like some fundamental off by one thing

21:01 <enebo> super happy it was something a bit weird

21:01 <lopex> well, it was a hidden perf bug really

21:01 <lopex> we wouldnt catch it without it

21:02 <enebo> lopex: yeah nice side-benefit

21:02 <enebo> lopex: and so nearly all regexps will benefit from this right since almost all are us-ascii

21:02 <lopex> yes

21:02 <lopex> I

21:02 <enebo> lopex: nice

21:02 <lopex> I'll take another bench

21:03 <enebo> lopex: so I have wondered a lot about why so much work in MRI has been to assume regexps and symbols are US-ASCII if they can possible fit. I knew it was that 7bit is faster code path but it felt weird semantically

21:04 <enebo> Like if I have an encoded string not US-ASCII but it isAscii and 7bit clean we should juse use fast path

21:04 <enebo> So it seems strange they just force the encoding to me

21:07 <lopex> enebo: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaa" =~ /a+/i

21:07 <lopex> now it's 3x faster

21:07 <lopex> and it will scale with it

21:08 <lopex> so potentailly dozens of times faster

21:08 <lopex> enebo: mostly due to specialized ascii bytecodes for quantifiers

21:08 <lopex> er, singlebyte

21:09 <lopex> that's no jokes

21:09 <enebo> yes!!!!!!

21:09 <enebo> lopex: you know what you found?

21:10 <enebo> lopex: YOU KNOW WHAT YOU FOUND?

21:10 <enebo> I wanna hear you say it!

21:10 <lopex> no I dont think so

21:11 <enebo> DARK MATTER lopex DARK MATTER

21:12 <lopex> ("X" * 1000) + "aaa" =~ /a+/i 5x faster

21:13 <lopex> oh let try more backtracking

21:14 <lopex> even more diff for [a]+

21:14 <lopex> yeah, might be big

21:19 <lopex> enebo: ("_" * 1000) + "" =~ /[a-z]+/i

21:19 <lopex> guess how much faster

21:21 <lopex> enebo: 35x faster

21:22 sidx64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

21:22 <enebo> lopex: calculate the big O

21:22 <lopex> enebo: no, same number

21:23 <lopex> char class perf

21:23 <enebo> it is just that it can bump by 1 really quick versus calc codepoints?

21:26 <lopex> actually no idea why that much

21:26 <enebo> lopex: I was thinking we could save codepoint indices on a string if we walk it once into a second array

21:27 <enebo> ith would be char_array[locations[i]]

21:27 <enebo> eager allocation of n codepoint array for string but no doubt regexps walk a string once in many cases

21:28 <lopex> why codepoint array ?

21:28 <enebo> well if you need to walk back and forth on mbc char data you want to do it without constantly reasking for how long a codepoint is

21:29 <enebo> or does joni do that?

21:29 <lopex> it uses length

21:31 <enebo> lopex: I mean instead of constantly using: enc.prevCharHead(bytes, str, at, end); or one for other direction you could have an array which will just tell you where start index is

21:32 <lopex> why prev ?

21:32 <enebo> no I am just showing you a single method...joni constantly is recalculating previous or next char

21:32 <enebo> if it was eager for mbc strings it would just be an extra array indirection

21:33 <enebo> and creating the array once

21:33 <lopex> why creating an array ?

21:33 <lopex> I'm lost

21:33 <enebo> to save all the offsets of what chars are

21:33 <enebo> in the bytearray

21:33 <enebo> so next char is a single array lookup

21:33 <enebo> vs going into jcodings and partially walking the byte data over and over

21:34 <lopex> enebo: https://github.com/jruby/joni/blob/master/src/org/joni/ByteCodeMachine.java#L831

21:34 <lopex> looks like my recent opt is being triggered

21:34 <lopex> look at opCClassMIX and opCClassMIXSb

21:34 <lopex> sb is specialized for singlebyte

21:34 <enebo> so did what I say not make sense still?

21:34 <lopex> entire coderange check is being skipped

21:35 <enebo> your opt for single byte data is fine but I am talking about optimizing mbc

21:35 <lopex> enebo: that's one thing, but I'm lost about array creation

21:36 <enebo> I walk the bytearray. For each char I find I make an entry in a second array telling me where it starts. joni does not need to keep walking and using jcodings to go forward or backward

21:36 <lopex> oh ok

21:36 <enebo> so I make a second array with the indices of every char in the bytearray

21:36 <lopex> but not worth it

21:36 <enebo> no?

21:36 <lopex> for every string ?

21:36 <enebo> or not worth it in some cases

21:37 <lopex> yo're talking boyermoore like map

21:37 <lopex> so skip map

21:37 <lopex> right ?

21:37 <enebo> beside not knowing that name maybe

21:38 <lopex> I mean not the algorithm, just an idea

21:38 <enebo> lopex: I took algo classes 30 years ago

21:39 <lopex> enebo: if you look for "alice" for example and you hit "a"

21:39 <lopex> the map will have "e" at 5

21:39 <lopex> so you dont walk byte by byte

21:40 <enebo> ah well I did not mean this much...just literally making an indice location map with an extra O(n) scan first so all calculations of prev next in matchers if invariant to an array lookup

21:40 <lopex> yeah, but still costly for creation

21:41 <enebo> byte[] data = "alice"; int[] locs = "01234"; data[locs[2]]

21:41 <enebo> pseudo code those are not strings per se

21:41 <enebo> yeah so probably only worth it if you know you have a complicated regexp which will be walking a lot

21:41 <enebo> or likely to be

21:42 <enebo> My code obviously would be horrible for a 7bit string

21:43 <enebo> if you know you are doing something super greedy or likely to backtrack a lot then making this array may pay for itself

21:43 <lopex> enebo: on the other note we have specialized fast skip/fast fails routines too https://github.com/jruby/joni/blob/master/src/org/joni/SearchAlgorithm.java

21:43 <lopex> so if you have /foo/ the foo will be first fast found using those

21:43 <lopex> and then enter the interpreter

21:44 <lopex> it even works for like

21:45 jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

21:45 <lopex> enebo: it will exit interpreter loop if it fails and fast skip to another suspected area

21:45 <enebo> lopex: so skip table seems to be what I mean but for mbc data.

21:45 <lopex> yeah, I know

21:46 <enebo> lopex: ok it was unclear if you did :)

21:46 <enebo> I just needed more confirmation

21:46 <lopex> it's just building that for string every time seemed weird for me

21:47 <enebo> I guess it seems likely this would have existed but when I go into matcher it looks like it is all jcoding methods to navigate around

21:47 <enebo> oh yeah I really did not mean to imply it was 100% soln

21:47 <lopex> enebo: but there's lot of place for opt too

21:48 <lopex> like /abcą/u

21:48 claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]

21:48 <lopex> it will already exect3 opcode for the first three

21:48 <lopex> *use already exact3 opcode

21:49 <lopex> enebo: cr7 is somewhat one element map saying dont bother with mbc :P

21:52 <lopex> enebo: so the fix also triggered fast those skip routines it seems

21:53 <lopex> enebo: beer time

21:53 <enebo> lopex: I am low carb during week days for the forseeable future

21:54 <enebo> lopex: so I will toast you on saturday :)

22:44 drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

22:51 <lopex> enebo: mri also has this one https://github.com/k-takata/Onigmo/issues/100

22:51 <lopex> it's also pretty serious regression on their side

22:51 <lopex> perf wise

22:59 <lopex> enebo: https://imgur.com/a/t6Vdk

23:10 mkristian has joined #jruby

23:24 shellac has joined #jruby

23:39 shellac has quit [Read error: Connection reset by peer]

23:40 mkristian has quit [Quit: This computer has gone to sleep]

23:40 mkristian has joined #jruby

23:41 mkristian has quit [Client Quit]