slyphon has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
claudiuinberlin has joined #jruby
NightMonkey has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
shellac has quit [Client Quit]
shellac has joined #jruby
drbobbeaty has quit [Ping timeout: 265 seconds]
<GitHub78>
[jruby] ninkibah opened issue #5238: Jruby 9.1.16 still slow on Java8 without -Xcompile.invokedynamics=false https://git.io/fbpNn
rdubya has joined #jruby
rdubya1 has joined #jruby
rdubya has quit [Read error: Connection reset by peer]
drbobbeaty has joined #jruby
jrafanie has joined #jruby
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rdubya1 has quit [Quit: Leaving.]
jrafanie has joined #jruby
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
slyphon has joined #jruby
<enebo>
ys
<nirvdrum>
lopex: Do you know when rb_enc_fast_mbclen, rb_enc_mbclen, and rb_enc_precise_mbclen should be called?
_whitelogger has joined #jruby
<lopex>
eregon: so we were about reinventing the wheel
<lopex>
nirvdrum: let me recall
<enebo>
eregon: do you recall tuning it this way to reduce memory?
<lopex>
nirvdrum: there's also onigenc_mbclen_approximate
<nirvdrum>
lopex: Yeah. I'm looking at the usages and I'm having a hard time really working out why there's both rb_enc_mbclen and rb_enc_precise_mbclen.
<enebo>
I have wondered why one is chosen over the other
<nirvdrum>
And then there are macros like MBCLEN_CHARFOUND_LEN that do nothing.
<enebo>
non-precise when you don't have to actually know length vs it being more than 1 sort of thing
<nirvdrum>
enebo: I assume it's a CR_BROKEN vs non-broken thing. And maybe to ensure that your head pointer isn't in the middle of a multi-byte char sequence.
<lopex>
nirvdrum: I think we should talk about their semantics since those names are hopeless
<lopex>
and then assign those to the names
<enebo>
nirvdrum: ah so precise may just explode in bad strings
<lopex>
nirvdrum: easy to get lost
<lopex>
nirvdrum: I see at leeast three, non validating, validating returning number of missing bytes, validating with fallback to 1 as length
<nirvdrum>
I think it just returns a negative number if it needs more bytes.
<lopex>
nirvdrum: unigmo uses last one for parsing so it doesnt enter infinite loops
<nirvdrum>
Okay. It's probably used for CR_BROKEN then. I've noticed on invalid byte sequences, the length is treated as "1".
<lopex>
nirvdrum: the bigger issue is where they have extra guards in client code
<lopex>
nirvdrum: or unknown
<nirvdrum>
Where do you see the difference in validating vs non-validating?
<lopex>
nirvdrum: validating, like where the client code always checks for negative ?
<nirvdrum>
As far as I can tell, in jcodings it always validates and it's up to the caller to determine whether to ignore negative valuse.
<nirvdrum>
*values
<lopex>
yeah
<nirvdrum>
Okay.
<lopex>
nirvdrum: but some client code might pass the result unchecked
<nirvdrum>
I meant "validating" as in "validating the byte sequence is valid for the given encoding".
<nirvdrum>
But even that's weird because the return value only considers the first character, correct?
<lopex>
yeah, for truncated / broken
<enebo>
but that is not valid?
<lopex>
nirvdrum: no
<enebo>
sorry I will not interject :)
<lopex>
nirvdrum: utf8 goes throught a number of tables for each postion
<lopex>
*through
<lopex>
nirvdrum: so it walks the char
<nirvdrum>
But if you give it a byte sequence with multiple characters, you only get the byte length of the first one, no?
<lopex>
yes
<nirvdrum>
This further confuses the semantics of these length functions for me.
<nirvdrum>
I think where I'm driving at is if we know a String is not CR_BROKEN, as most strings are, I don't think we need to be doing all this extra validation.
<nirvdrum>
You'd need to be careful that you're on a proper character boundary, of course.
<lopex>
nirvdrum: but there
<nirvdrum>
But UTF-8 has a special bit sequence to indicate continuation bytes.
<lopex>
nirvdrum: there's two options like failing altogether or falling back to 1 just to safely advance right ?
<lopex>
nirvdrum: that's why the default returns missing bytes
<nirvdrum>
I'm ignoring the CR_BROKEN case for the moment.
<lopex>
in jcodings now
<lopex>
and unknown right ?
<nirvdrum>
No. I'm particularly interested in CR_VALID.
<nirvdrum>
CR_7BIT is trivial. Always return 1.
<lopex>
and for valid the same just the length of a char
<nirvdrum>
CR_UNKNOWN sounds like it always needs rb_enc_precise_mbclen.
<lopex>
yes
<lopex>
nirvdrum: or approximate
<eregon>
enebo: mostly for access and building speed + escape analysis I think
<nirvdrum>
So, I'm confused by the use of rb_enc_mbclen on CR_VALID strings.
<lopex>
nirvdrum: yeah, it's a wasted effort
<eregon>
enebo: and avoid too many allocations so less memory used too
<enebo>
eregon: ok yeah graal seems to do better with unwrapping boxed values (like hash)
<enebo>
hashcode as Integer
<lopex>
nirvdrum: we're on the same page I guess
<nirvdrum>
lopex: Why is Encoding#length(byte) deprecated then?
<nirvdrum>
Being able to use the UTF8EncLen table looks like it'd be much faster.
<lopex>
nirvdrum: at was a remnant of an old oniguruma api
<nirvdrum>
Okay.
<enebo>
hotspot will not escape those so then we have to entertain part of the triple maybe being 2 arrays
<lopex>
nirvdrum: it's was totally non validating
<eregon>
enebo: yeah, most operations on small Hash can basically fold at compilation if the Hash is escape analyzed
<lopex>
nirvdrum: we should add two length methods on Encoding I think
<enebo>
sorry I meant cache hashcode value as boxed int in that array not Ruby Hashes
<lopex>
nirvdrum: precise and approx
<eregon>
enebo: yep, we didn't care about it so far, but maybe we should
<nirvdrum>
lopex: So backing up. Because I want to really make sure we're talking about the same thing.
<enebo>
eregon: seems Graal is pretty good at figuring that case out so you maybe don't need to
<eregon>
enebo: if the Hash doesn't escape then no need to box, and no allocation ever. But if it does escape we'll box those 1 to 3 hashes to Integer
<nirvdrum>
lopex: Actually, what's the difference in rb_enc_fast_mbclen and rbc_enc_mbclen then? They both validate the byte sequence for the given encoding, right?
<enebo>
eregon: yeah
<enebo>
eregon: That was what I was trying to say
<eregon>
:)
<enebo>
eregon: corollary for us is most people still use hotspot so we are tailoring more for that
<enebo>
eregon: since your impl probably will never care about hotspot primarily I would just do the simpler thing if graal figures it out
<enebo>
nirvdrum: lopex: can one of you add some comments explaining semantics to StringSuport or wherever those methods are once this conversation is over?
<enebo>
I admit I tend to just use precise and I can see that probably was a big mistake :P
<enebo>
I would almost be in favor of fixing the names and just adding an rb: comment
<nirvdrum>
+1
<lopex>
nirvdrum: they both use ONIGENC_PRECISE_MBC_ENC_LEN
<lopex>
nirvdrum: but they have precise_mbc_enc_len function pointer on encoding so on this we're compatible on this in jcodings
<lopex>
so everything goes through enc->precise_mbc_enc_len
<lopex>
nirvdrum: without a table with those names we wont be able to talk though
<lopex>
which does what in mri code
<nirvdrum>
So, rb_enc_precise_mbclen returns either the MBC byte length or a negative number indicating more bytes are needed. rb_enc_mbclen calls rb_enc_precise_mbclen and if it sees a negative number, it returns 1 if the encoding's min. char length is less than or equal to the remaining bytes. But what does rb_enc_fast_mbclen do? Does it do any byte sequenc
<nirvdrum>
I think I get a bit confused, too, because jcodings only has Encoding#length. The rest is implemented in StringSupport.
<lopex>
nirvdrum: but I'll go through all those with the writeup
<enebo>
lopex: I want an english description of what it actually does vs just the relationship alone
<lopex>
enebo: I want too :P
<enebo>
hahah ok
<lopex>
the problem is that it's an evolving api on mris side
<enebo>
nirvdrum: I would like those moved into jcodings if possible so we can make an optimized length on valid and then share that logic
<lopex>
and yes, they do a LOT of unnecessary work
<nirvdrum>
So onigenc_mbclen_approximate calls ONIGENC_PRECISE_MBC_ENC_LEN as well.
<lopex>
nirvdrum: hence we want additional non validating length on encoding
<nirvdrum>
This is nutty.
<lopex>
nirvdrum: last question is where to fail and where to advanc by 1 on invalid char
<nirvdrum>
Right.
<nirvdrum>
And I'd love to know what "fast" means in rb_enc_fast_mbclen, because it looks like it does much the same work as the others.
<nirvdrum>
I think for the purposes of the API, I'm willing to accept as an invariant that the caller only sets the head pointer to a byte position corresponding to the first by in a character. By accepting that, I can assume the CR of the full string can be respected. In particular, if the full string is not CR_BROKEN, then any individual character in it can
<nirvdrum>
not be broken.
<lopex>
nirvdrum: and for valid utf-8 you could use that fast walking thing
<lopex>
without array lookups
<lopex>
nirvdrum: actually would it work as an c ect ?
<lopex>
*c ext
<nirvdrum>
I think the MRI rules are if you modify a String in an extension, you're responsible for updating its code range as well.
<lopex>
nirvdrum: I mean the code which treats char* as unsigned int*
<nirvdrum>
I guess it would, but you'd have to cross a native boundary so I'm not sure how much you'd gain.
<nirvdrum>
But with valid UTF-8, that's not really an issue. You only need to look at the leading byte.