shellac has quit [Quit: Computer has gone to sleep.]
rrutkowski has joined #jruby
rrutkowski has quit [Client Quit]
rrutkowski has joined #jruby
<nirvdrum>
lopex, enebo: Am I way off, or can jcodings's Encoding#length be used when you know the code range is CR_7BIT or CR_VALID??
<nirvdrum>
(Sorry, question mark key got stuck).
<lopex>
nirvdrum: are you referring to the specialization I mentioned earlier ?
<lopex>
nirvdrum: not now, but it could have separate non validating length routing
<lopex>
for example if you know it's utf-8 and cr is valid then you just count the high bits on char head
<nirvdrum>
lopex: No. Just something I noticed recently. If you have a UTF-8 string and you already know it is either CR_7BIT or CR_VALID, you only need to look at the first byte to get the character length.
<lopex>
yes
<nirvdrum>
That wouldn't be true for grapheme clusters, but MRI doesn't report those.
<lopex>
you want to specialize for utf-8 ?
<lopex>
afaik GB18030 is one of few where looking at char head is not enough
<lopex>
but looks like you're talking exactly about the idea I mentioned earlier
rrutkowski has quit [Ping timeout: 276 seconds]
<nirvdrum>
lopex: I'd expect the encoding to indicate if it can't handle it then. It looks like jcodings just uses a lookup table, so I don't really know what it does in the bad cases.
<nirvdrum>
But part of it is the documentation for Encoding#length says "To be deprecated very soon (use length(byte[]bytes, int p, int end) version)"
<nirvdrum>
And I can see, what looks to me, like a valid use case for it.
shellac has joined #jruby
<nirvdrum>
lopex: I've read back in my log, but I think I missed what your idea was. If you don't min, please recap.
<nirvdrum>
Ahh, I see. GB18030Encoding stores a null value for that table. So you end up with an NPE.
shellac has quit [Client Quit]
<lopex>
nirvdrum: I meant exactly cr valid specializations using non validating length aka length(byte)
<lopex>
nirvdrum: but it would have to be consistent across encodings or like you're saing marked as usable
<nirvdrum>
Yeah, we're on the same page then.
<lopex>
nirvdrum: for utf-8 I'm not sure what's the fastest length
<nirvdrum>
There's a lot of unnecessary rediscovery of information.
<lopex>
nirvdrum: it's a bitpop for high bits being one
<lopex>
nirvdrum: if there's an intrinsic it could be branchless
<nirvdrum>
I'm banking on the compiler being able to do two int comparsions and a conditional branch faster than a table lookup.
<lopex>
yeah, definitely
<nirvdrum>
But I suppose if you have a 4-byte character you're talking about doing that several times.
<lopex>
yes
<lopex>
not to mention unavoidable bounds checks
<nirvdrum>
But both would be faster than the byte scan we're currently doing :-)
<nirvdrum>
And obviously if you know that it's CR_7BIT you can short-circuit and return 1.
bbrowning is now known as bbrowning_away
<lopex>
nirvdrum: actually there could be additional specialization which returns 1 for invalid code points and chars
<lopex>
sometimes you want to proceed and not blow
<nirvdrum>
That Ruby allows the propagation of invalid strings blows my mind.
<nirvdrum>
I was looking at some ActionPack (I think) code recently that creates an invalid UTF-8 string. And I can't imagine that's what they intended.
<nirvdrum>
Even more reason for me to dislike emoji.
<enebo>
heh lobster and flats heh I suppose every noun/thing will eventually have an emoji
<enebo>
in 10,000 years they will find a working flash drive and it will be from some adolescent gamer. They will think we all communicated with a pictographic language
<nirvdrum>
lopex: Part of what motivated this line of thought is I've been looking at the fast_blank gem. headius did a very straightforward port for JRuby at https://github.com/SamSaffron/fast_blank/pull/21
<nirvdrum>
But both do a lot of unnecessary work if you already know the code range.
<lopex>
nirvdrum: btw, joni now accepts cr7 bit in search options and it chooses different interpreter loop with faster opcodes
<lopex>
and there
<lopex>
case insensitive matching can also be sped up a lot, I have some ideas
<nirvdrum>
lopex: Ooh. What version of joni is that?
<lopex>
nirvdrum: 2.1.14
<lopex>
untill now it only used some faster opcodes for singlebyte encodings
<nirvdrum>
Nice. I'll have to check that out.
<nirvdrum>
I'd love to take a good pass over our regexp code. But I don't understand a good chunk of it.
<lopex>
nirvdrum: oh, and new jcodings support casemapping
<nirvdrum>
I haven't looked to see if they carried that through to String methods or not.
<lopex>
you mean the fast paths ?
codefinger has quit []
codefinger has joined #jruby
Puffball has quit [Remote host closed the connection]
<nirvdrum>
I mean they stopped using encoding table lookups for rb_isspace and things like that.
jeremyevans has quit [Quit: Lost terminal]
shellac has joined #jruby
rrutkowski has quit [Quit: rrutkowski]
jeremyevans has joined #jruby
rrutkowski has joined #jruby
<enebo>
<"wrong constant name \"String\\u0000\""> expected but was
<enebo>
<"wrong constant name String\u0000">.
<enebo>
lopex: nirvdrum: you guys recall if there is a nice method for displaying non-printable RubyString characters and adding \" around it only in that case
<nirvdrum>
enebo: Encoding has #isPrint on it.
<enebo>
nirvdrum: yeah I know it does but I don't want to make another String display method if we have one
<nirvdrum>
String#dump uses it, IIRC.
<enebo>
yeah I did see that one ... hmm ... let me look at it again
<nirvdrum>
There's String#scrub, too.
<nirvdrum>
But I think that one might be the opposite of what you're looking for.
<enebo>
dumpCommon in StringSupport maybe?
<enebo>
ok dumpCommon is the logic I want
<enebo>
I have the condition I am printing out a name from a symbol (usually) and it needs to use this string/nostring weird pattern
<enebo>
err quote/no-quote
<enebo>
Nice that this is working: TypeError: can't dump anonymous class #<Module:0x670b40af>::T⏰⏳
<nirvdrum>
Heh.
<enebo>
but it means fixing every single error message which refers to types :|
shellac has quit [Ping timeout: 260 seconds]
<nirvdrum>
I thought you had your Ruby exception logic fairly centralized.
<enebo>
nirvdrum: well we do but class.getName() is complicated
<enebo>
nirvdrum: I could split that j.l.String apart and look up each segment in symbol table to get proper string but I am instead calling a new method rubyName -> RubyString
<enebo>
nirvdrum: which means fixing all callsites which generate errors
<enebo>
it is even a tad more complicated since that might be an anonymous class name so it is not just splitting
<enebo>
hahahaha noooo NameError: wrong constant name "String\x00"
<enebo>
ok it must not see this as utf-8
<nirvdrum>
What is it you're doing?
<enebo>
oh hmm
<enebo>
well something is doing const_get?("String\0") and it raises name error
<enebo>
since it has \0 it needs to quote wrap since it has unprintable value
<nirvdrum>
I mean in general.
<enebo>
I am making all encodings work for all things
<nirvdrum>
I'm wondering if we inherited whatever bug you're fixing :-)
<nirvdrum>
Or whether this is a Ruby 2.4+ change.
<enebo>
this is just fixing so mbc works for all the things
<enebo>
so the String for the class name is no longer just printed out
<enebo>
I use that as a key back to the symbol table which retrieves properly encoded identifier
<enebo>
so I am getting an utf-8 String\0 RubySymbol/bytelist
<enebo>
but the message I build up did not quite generate things as MRI wants them
<enebo>
looking at dumpCommon it I suspect this may just be if (MBCLEN_CHARFOUND_LEN(n) > 0) { is actually 0
<enebo>
but I will trace through this
<nirvdrum>
Okay. Null bytes are handled specially in various places.
<enebo>
nirvdrum: and this is literally only for error display
<nirvdrum>
String#rstrip will strip them, but String#lstrip will not, for instance.
<enebo>
there is some validateConstant method which looks at \0
<enebo>
so probably that was the unfucking code to make it look ok
<enebo>
dumpCommon just does not have some path perhaps
<enebo>
god it is hard to believe that \0 stuff is one-off in RubyModule.validateConstant?
<enebo>
I can probably work around this by putting it into my error handling code but that is some weird shit
<enebo>
And of course I wrote it in 14
<nirvdrum>
lopex: Are the 7bit joni changes purely internal? Or is there a CR argument to pass?
<lopex>
nirvdrum: Option.CR_7_BIT as match options
<nirvdrum>
Thanks.
<lopex>
a bit experimental still though
<nirvdrum>
If I'm reading this right, you also look at the regexp's encoding, which defaults to US-ASCII.
<lopex>
yeah, if it's singlebyte then also goes faster route
<nirvdrum>
In those cases, I wouldn't need to pass the option, would I?