#jruby on 2018-07-05 — irc logs at freenode.irclog.whitequark.org

2018-05-24 16:34 ChanServ changed the topic of #jruby to: Get 9.2.0.0! http://jruby.org/ | http://wiki.jruby.org | http://logs.jruby.org/jruby/ | http://bugs.jruby.org | Paste at http://gist.github.com

05:15 NightMonkey has joined #jruby

05:22 NightMonkey has quit [Ping timeout: 240 seconds]

06:03 slyphon has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

07:11 claudiuinberlin has joined #jruby

07:28 NightMonkey has joined #jruby

08:44 shellac has quit [Quit: Computer has gone to sleep.]

08:49 shellac has joined #jruby

08:51 shellac has quit [Client Quit]

08:56 shellac has joined #jruby

09:05 drbobbeaty has quit [Ping timeout: 265 seconds]

09:42 <GitHub78> [jruby] ninkibah opened issue #5238: Jruby 9.1.16 still slow on Java8 without -Xcompile.invokedynamics=false https://git.io/fbpNn

10:33 rdubya has joined #jruby

10:36 rdubya1 has joined #jruby

10:36 rdubya has quit [Read error: Connection reset by peer]

10:55 drbobbeaty has joined #jruby

10:58 jrafanie has joined #jruby

11:59 jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

12:24 rdubya1 has quit [Quit: Leaving.]

12:45 jrafanie has joined #jruby

13:10 jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

13:23 slyphon has joined #jruby

14:47 <enebo> ys

15:10 <nirvdrum> lopex: Do you know when rb_enc_fast_mbclen, rb_enc_mbclen, and rb_enc_precise_mbclen should be called?

15:20 _whitelogger has joined #jruby

15:42 <lopex> eregon: so we were about reinventing the wheel

15:42 <lopex> nirvdrum: let me recall

15:43 <enebo> eregon: do you recall tuning it this way to reduce memory?

15:46 <lopex> nirvdrum: there's also onigenc_mbclen_approximate

15:47 <nirvdrum> lopex: Yeah. I'm looking at the usages and I'm having a hard time really working out why there's both rb_enc_mbclen and rb_enc_precise_mbclen.

15:48 <enebo> I have wondered why one is chosen over the other

15:48 <nirvdrum> And then there are macros like MBCLEN_CHARFOUND_LEN that do nothing.

15:48 <enebo> non-precise when you don't have to actually know length vs it being more than 1 sort of thing

15:49 <nirvdrum> enebo: I assume it's a CR_BROKEN vs non-broken thing. And maybe to ensure that your head pointer isn't in the middle of a multi-byte char sequence.

15:49 <lopex> nirvdrum: I think we should talk about their semantics since those names are hopeless

15:49 <lopex> and then assign those to the names

15:49 <enebo> nirvdrum: ah so precise may just explode in bad strings

15:49 <lopex> nirvdrum: easy to get lost

15:50 <lopex> nirvdrum: I see at leeast three, non validating, validating returning number of missing bytes, validating with fallback to 1 as length

15:50 <nirvdrum> I think it just returns a negative number if it needs more bytes.

15:50 <lopex> nirvdrum: unigmo uses last one for parsing so it doesnt enter infinite loops

15:51 <nirvdrum> Okay. It's probably used for CR_BROKEN then. I've noticed on invalid byte sequences, the length is treated as "1".

15:51 <lopex> nirvdrum: the bigger issue is where they have extra guards in client code

15:51 <lopex> nirvdrum: or unknown

15:52 <nirvdrum> Where do you see the difference in validating vs non-validating?

15:52 <lopex> nirvdrum: validating, like where the client code always checks for negative ?

15:52 <nirvdrum> As far as I can tell, in jcodings it always validates and it's up to the caller to determine whether to ignore negative valuse.

15:52 <nirvdrum> *values

15:52 <lopex> yeah

15:52 <nirvdrum> Okay.

15:53 <lopex> nirvdrum: but some client code might pass the result unchecked

15:53 <nirvdrum> I meant "validating" as in "validating the byte sequence is valid for the given encoding".

15:53 <nirvdrum> But even that's weird because the return value only considers the first character, correct?

15:53 <lopex> yeah, for truncated / broken

15:53 <enebo> but that is not valid?

15:53 <lopex> nirvdrum: no

15:54 <enebo> sorry I will not interject :)

15:54 <lopex> nirvdrum: utf8 goes throught a number of tables for each postion

15:54 <lopex> *through

15:54 <lopex> nirvdrum: so it walks the char

15:54 <nirvdrum> But if you give it a byte sequence with multiple characters, you only get the byte length of the first one, no?

15:54 <lopex> yes

15:55 <nirvdrum> This further confuses the semantics of these length functions for me.

15:56 <nirvdrum> I think where I'm driving at is if we know a String is not CR_BROKEN, as most strings are, I don't think we need to be doing all this extra validation.

15:56 <nirvdrum> You'd need to be careful that you're on a proper character boundary, of course.

15:57 <lopex> nirvdrum: but there

15:57 <nirvdrum> But UTF-8 has a special bit sequence to indicate continuation bytes.

15:57 <lopex> nirvdrum: there's two options like failing altogether or falling back to 1 just to safely advance right ?

15:58 <lopex> nirvdrum: that's why the default returns missing bytes

15:58 <nirvdrum> I'm ignoring the CR_BROKEN case for the moment.

15:58 <lopex> in jcodings now

15:58 <lopex> and unknown right ?

15:58 <nirvdrum> No. I'm particularly interested in CR_VALID.

15:58 <nirvdrum> CR_7BIT is trivial. Always return 1.

15:59 <lopex> and for valid the same just the length of a char

15:59 <nirvdrum> CR_UNKNOWN sounds like it always needs rb_enc_precise_mbclen.

15:59 <lopex> yes

15:59 <lopex> nirvdrum: or approximate

15:59 <eregon> enebo: mostly for access and building speed + escape analysis I think

15:59 <nirvdrum> So, I'm confused by the use of rb_enc_mbclen on CR_VALID strings.

16:00 <lopex> nirvdrum: yeah, it's a wasted effort

16:00 <eregon> enebo: and avoid too many allocations so less memory used too

16:00 <enebo> eregon: ok yeah graal seems to do better with unwrapping boxed values (like hash)

16:00 <enebo> hashcode as Integer

16:00 <lopex> nirvdrum: we're on the same page I guess

16:00 <nirvdrum> lopex: Why is Encoding#length(byte) deprecated then?

16:00 <nirvdrum> Being able to use the UTF8EncLen table looks like it'd be much faster.

16:00 <lopex> nirvdrum: at was a remnant of an old oniguruma api

16:00 <nirvdrum> Okay.

16:00 <enebo> hotspot will not escape those so then we have to entertain part of the triple maybe being 2 arrays

16:00 <lopex> nirvdrum: it's was totally non validating

16:00 <eregon> enebo: yeah, most operations on small Hash can basically fold at compilation if the Hash is escape analyzed

16:01 <lopex> nirvdrum: we should add two length methods on Encoding I think

16:01 <enebo> sorry I meant cache hashcode value as boxed int in that array not Ruby Hashes

16:01 <lopex> nirvdrum: precise and approx

16:01 <eregon> enebo: yep, we didn't care about it so far, but maybe we should

16:01 <nirvdrum> lopex: So backing up. Because I want to really make sure we're talking about the same thing.

16:01 <enebo> eregon: seems Graal is pretty good at figuring that case out so you maybe don't need to

16:02 <eregon> enebo: if the Hash doesn't escape then no need to box, and no allocation ever. But if it does escape we'll box those 1 to 3 hashes to Integer

16:02 <nirvdrum> lopex: Actually, what's the difference in rb_enc_fast_mbclen and rbc_enc_mbclen then? They both validate the byte sequence for the given encoding, right?

16:02 <enebo> eregon: yeah

16:03 <enebo> eregon: That was what I was trying to say

16:03 <eregon> :)

16:03 <enebo> eregon: corollary for us is most people still use hotspot so we are tailoring more for that

16:04 <enebo> eregon: since your impl probably will never care about hotspot primarily I would just do the simpler thing if graal figures it out

16:05 claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]

16:06 <enebo> nirvdrum: lopex: can one of you add some comments explaining semantics to StringSuport or wherever those methods are once this conversation is over?

16:07 <enebo> I admit I tend to just use precise and I can see that probably was a big mistake :P

16:07 <enebo> I would almost be in favor of fixing the names and just adding an rb: comment

16:07 <nirvdrum> +1

16:08 <lopex> nirvdrum: they both use ONIGENC_PRECISE_MBC_ENC_LEN

16:10 <lopex> nirvdrum: but they have precise_mbc_enc_len function pointer on encoding so on this we're compatible on this in jcodings

16:10 <lopex> so everything goes through enc->precise_mbc_enc_len

16:11 <lopex> nirvdrum: without a table with those names we wont be able to talk though

16:11 <lopex> which does what in mri code

16:12 <nirvdrum> So, rb_enc_precise_mbclen returns either the MBC byte length or a negative number indicating more bytes are needed. rb_enc_mbclen calls rb_enc_precise_mbclen and if it sees a negative number, it returns 1 if the encoding's min. char length is less than or equal to the remaining bytes. But what does rb_enc_fast_mbclen do? Does it do any byte sequenc

16:12 <nirvdrum> e validation at all?

16:12 <lopex> nirvdrum: I'll do a write-up

16:13 <nirvdrum> Thanks.

16:13 <lopex> rb_enc_fast_mbclen -> ONIGENC_MBC_ENC_LEN -> onigenc_mbclen_approximate

16:13 <lopex> nirvdrum: ^^

16:13 <nirvdrum> I think I get a bit confused, too, because jcodings only has Encoding#length. The rest is implemented in StringSupport.

16:13 <lopex> nirvdrum: but I'll go through all those with the writeup

16:14 <enebo> lopex: I want an english description of what it actually does vs just the relationship alone

16:14 <lopex> enebo: I want too :P

16:14 <enebo> hahah ok

16:14 <lopex> the problem is that it's an evolving api on mris side

16:14 <enebo> nirvdrum: I would like those moved into jcodings if possible so we can make an optimized length on valid and then share that logic

16:14 <lopex> and yes, they do a LOT of unnecessary work

16:15 <nirvdrum> So onigenc_mbclen_approximate calls ONIGENC_PRECISE_MBC_ENC_LEN as well.

16:15 <lopex> nirvdrum: hence we want additional non validating length on encoding

16:15 <nirvdrum> This is nutty.

16:15 <lopex> nirvdrum: last question is where to fail and where to advanc by 1 on invalid char

16:16 <nirvdrum> Right.

16:17 <nirvdrum> And I'd love to know what "fast" means in rb_enc_fast_mbclen, because it looks like it does much the same work as the others.

16:19 <nirvdrum> I think for the purposes of the API, I'm willing to accept as an invariant that the caller only sets the head pointer to a byte position corresponding to the first by in a character. By accepting that, I can assume the CR of the full string can be respected. In particular, if the full string is not CR_BROKEN, then any individual character in it can

16:19 <nirvdrum> not be broken.

16:20 <lopex> nirvdrum: and for valid utf-8 you could use that fast walking thing

16:20 <lopex> without array lookups

16:21 <lopex> nirvdrum: actually would it work as an c ect ?

16:21 <lopex> *c ext

16:21 <nirvdrum> I think the MRI rules are if you modify a String in an extension, you're responsible for updating its code range as well.

16:22 <lopex> nirvdrum: I mean the code which treats char* as unsigned int*

16:23 <nirvdrum> I guess it would, but you'd have to cross a native boundary so I'm not sure how much you'd gain.

16:23 <nirvdrum> But with valid UTF-8, that's not really an issue. You only need to look at the leading byte.

16:24 <lopex> nirvdrum: https://github.com/ruby/ruby/blob/trunk/string.c#L1617

16:24 <nirvdrum> On the whole, I think we (MRI, JRuby, TruffleRuby, etc.) should be okay specializing for UTF-8.

16:24 <lopex> it's branchless

16:25 <nirvdrum> Support the other encodings, of course, but I don't think everything should be as slow as the lowest common denominator.

16:25 <nirvdrum> Interesting.

16:26 <lopex> and only for valid utf8 of course

16:26 shellac has quit [Quit: Computer has gone to sleep.]

16:27 <lopex> nirvdrum: once I made Unsafe version of it, but there were some problems I forgot

16:27 <lopex> but I saw 3x speedups even with unsafe overhead on long strings

16:27 <nirvdrum> I've seen implementations that vectorize as well.

16:36 shellac has joined #jruby

16:49 shellac has quit [Quit: Computer has gone to sleep.]

16:57 <lopex> enebo: wrt that hash headius proposed we could keep that in the bucket array in a nonscan version

16:58 <lopex> enebo: also, with rare enough collisions we could get rid of that

17:00 claudiuinberlin has joined #jruby

17:04 ahorek has joined #jruby

17:04 ahorek has quit [Client Quit]

17:35 <GitHub162> [jruby] kares opened pull request #5239: review String -> RubyString UTF (8) encoding (master...string-encode) https://git.io/fbh9q

17:43 Antiarc has quit [Quit: ZNC 1.6.6+deb1 - http://znc.in]

17:46 ahorek has joined #jruby

17:54 Antiarc has joined #jruby

18:22 shellac has joined #jruby

18:33 <ChrisBr> headius: there?

18:43 cschneid_ has quit [Ping timeout: 260 seconds]

18:44 cschneid has joined #jruby

19:05 claudiuinberlin has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

19:28 <headius> ChrisBr: I'm here now

19:42 <ChrisBr> ah great :)

19:42 <ChrisBr> how do we want to continue regarding the hash PR?

19:42 <ChrisBr> headius: ^^

19:43 <ChrisBr> I probably will have some time over the we and next week to finish things off

19:59 claudiuinberlin has joined #jruby

21:11 claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]

21:29 drbobbeaty has quit [Ping timeout: 260 seconds]

21:48 Antiarc has quit [Quit: ZNC 1.6.6+deb1 - http://znc.in]

21:49 Antiarc has joined #jruby

21:55 drbobbeaty has joined #jruby

21:58 drbobbeaty has quit [Client Quit]

22:19 <headius> ChrisBr: hey sorry notifications not great

22:19 <headius> ChrisBr: well if it's in a mergeable state now I'd say we should go for it and start letting it bake a bit

22:23 <headius> I would like to see how the new hash structure interacts with Graal optimizations too

22:28 ianz has joined #jruby

22:30 <ianz> looking for https://s3.amazonaws.com/jruby.org/downloads/1.7.27/jruby-complete-1.7.27.jar

22:30 <ianz> it appears to not be on the s3 bucket.

22:31 <headius> kares: good call on using jmh...we should add more jmh benchmarks

22:31 <headius> ianz: hmm

22:32 <headius> ianz: I'm not sure why it's not there, I thought enebo restored all the versions he had

22:32 <headius> but there's a better place to get it from now anyway

22:32 <ianz> 1.7.26 is there BTW

22:32 <ianz> better place sounds good...?

22:32 <headius> ugh there's still bad links on site

22:33 <headius> I thought enebo fixed all of them

22:33 <headius> https://repo1.maven.org/maven2/org/jruby/

22:34 <headius> all our releases are pushed to Maven central

22:34 <headius> so jruby-complete under there should have everything you need

22:34 <headius> weird

22:35 <headius> only the 1.7.26 page was updated for new links

22:35 <ianz> i see it there. Thanks!

22:49 <GitHub97> [jruby.github.io] enebo pushed 1 new commit to master: https://git.io/fbjsr

22:49 <GitHub97> jruby.github.io/master b3901b6 Thomas E. Enebo: Script missed 1.7.27

22:53 <enebo> ianz: complicated update script and I spaced out .27 .26 was first one to use proper -dist release.

22:53 <enebo> ianz: should be done and complete is accessible again

23:09 <ianz> thank you!

23:23 <lopex> https://medium.com/@rservant/how-did-the-j9-in-openj9-get-its-name-95a6416b4cb9

23:34 ianz has quit [Quit: http://www.kiwiirc.com/ - A hand crafted IRC client]

23:35 ianz has joined #jruby

23:36 ianz has quit [Client Quit]

23:36 ianz has joined #jruby

23:36 ianz has quit [Client Quit]

23:37 ianz has joined #jruby

23:37 ianz has quit [Client Quit]

23:37 drbobbeaty has joined #jruby