#picolisp on 2020-01-11 — irc logs at freenode.irclog.whitequark.org

2018-09-14 18:41 ChanServ changed the topic of #picolisp to: PicoLisp language | Channel Log: https://irclog.whitequark.org/picolisp/ | Check also http://www.picolisp.com for more information

01:58 xkapastel has quit [Quit: Connection closed for inactivity]

02:17 Phoenixwater[m] has left #picolisp [#picolisp]

08:10 <Regenaxer> I don't want to implement those huge unicode tables for 'uppc' and 'lowc' again for pil21

08:11 <Regenaxer> Now I'm very surprised to see that there seems no portable way (i.e. a C function)

08:23 <Regenaxer> Is this portable? https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html

08:24 <Regenaxer> At least it seems overkill

08:24 <Regenaxer> Sigh

08:30 <tankf33der> i think you should generate tables once in different file(s) and just use it

08:31 <Regenaxer> The problem is that I don't understand this case conversion

08:32 <tankf33der> but this is already done in src64, right?

08:32 <Regenaxer> for example, there is now support for uppercase ß (german s) in unicode

08:32 <Regenaxer> yes, but not correct now

08:32 <Regenaxer> needs to update for upper case ß

08:32 <Regenaxer> How to do it?

08:33 <Regenaxer> I don't want to maintain such stuff too

08:33 <tankf33der> i see

08:33 <Regenaxer> Really surprising why there is no standard support

08:34 <Regenaxer> C only has toupper/tolower for ascii or wide chars

08:34 <Regenaxer> not for utf8

08:36 <Regenaxer> I don't even find a clear description of the algorithm how to do *correct* case conversion in UTF-8

08:36 <Regenaxer> Unicode consortium

08:36 <Regenaxer> all very confusing

08:41 <Regenaxer> What do you think about the above glib?

08:41 <Regenaxer> portable?

08:41 <Regenaxer> overkill?

08:41 <Regenaxer> It supports tons of functions

08:42 <Regenaxer> I have to link them all into pil just to get uppc and lowc

08:43 <tankf33der> i belive you should not link to glib

08:43 <tankf33der> or musl

08:43 <Regenaxer> What I really want is up-to-date tables plus a clear description how to handle them

08:43 <Regenaxer> yeah

08:43 <tankf33der> let me check myrlang implementation

08:43 <Regenaxer> myrlang?

08:43 <tankf33der> yea

08:44 <tankf33der> language no one cares, as usual

08:44 <Regenaxer> Myrddin?

08:46 <tankf33der> yea

08:46 <tankf33der> i seen somewhere tables and thought picolisp have the same

08:46 <Regenaxer> I took them from some free Java project

08:47 <Regenaxer> 25 years ago or so

08:47 <Regenaxer> "Kaffee" project

08:47 <Regenaxer> But I never understood those tables

08:48 <Regenaxer> GNU Kaffe Project

08:48 <Regenaxer> (see comment in src/sym.c)

08:49 <Regenaxer> I could easily convert them to pil21 syntax

08:49 <Regenaxer> no problem

08:49 <Regenaxer> But how to handle new things*

08:49 <Regenaxer> ?

08:50 <Regenaxer> I could even just copy/paste from pico/src/sym.c to pil21/src/lib.c

08:50 <Regenaxer> But I don't like this

08:51 <Regenaxer> Having to roll everything yourself for such a standard thing like utf8

08:51 <Regenaxer> stupid

08:53 <tankf33der> https://git.eigenstate.org/ori/mc.git/tree/lib/std/chartype.myr#n1609

08:53 <tankf33der> found

08:56 <Regenaxer> How do we know these tables and algos are correct, or better than src/sym.c?

08:57 <Regenaxer> "plan 9's runetype.c" is that even still supported?

08:57 <tankf33der> problem only in conv up-low ?

08:57 <tankf33der> because current utf8 is simple, tested by me

08:57 <tankf33der> because current utf8 is correct, tested by me

08:58 <Regenaxer> yes, only for uppc and lowc

08:58 <Regenaxer> All other utf8 is already in pil21

08:58 <tankf33der> solution create *full* test vector by python and test.

08:59 <Regenaxer> General testing is perhaps not needed

08:59 <Regenaxer> only *new* characters in unicode

08:59 <Regenaxer> like upper-case ß

08:59 <tankf33der> eh

08:59 <Regenaxer> Is in unicode recently

08:59 <Regenaxer> and perhaps other characters

09:00 <Regenaxer> Unicode is changing all the time

09:00 <Regenaxer> Ideal would be some library published by the unicode consortium

09:01 <Regenaxer> some *official* code

09:01 <Regenaxer> Not everybody rolling his own

09:02 <tankf33der> not portable, even libffi maybe problem

09:03 <Regenaxer> libffi too?

09:03 <Regenaxer> I thought it looks very portable

09:04 <tankf33der> maybe.

09:04 <tankf33der> so you already have dependeci

09:04 <tankf33der> i dont trust glib, who will port glib to riscv? :)

09:04 <tankf33der> linux distro maintainers?

09:05 <Regenaxer> clang maintainers

09:05 <Regenaxer> What we really need is support in clang

09:05 <Regenaxer> pil21 should use only clang for system calls

09:24 <tankf33der> wow, some utf8 maybe invalid sequences

09:25 <Regenaxer> where?

09:30 <tankf33der> https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

09:31 <Regenaxer> Ah, yes, of course

09:31 <Regenaxer> utf8 has a special byte format

09:32 <Regenaxer> So almost any random byte sequence is illegal

09:44 <tankf33der> http://unicode.org/versions/Unicode13.0.0/

09:44 <tankf33der> damn, unicode 13 is coming.

09:45 <Regenaxer> o

09:46 <Regenaxer> Perhaps ask in some clang forum?

09:46 <tankf33der> eh

09:47 <Regenaxer> I think it should be the duty of clang to maintain such stuff

09:47 <tankf33der> and python and ruby and dlang and so on

09:47 <tankf33der> also we have this one

09:47 <tankf33der> https://git.envs.net/mpech/tankf33der/src/branch/master/wide

09:47 <Regenaxer> We need it across Linux, BSD, Mac, Android and iOS

09:48 <Regenaxer> yeah, wide.l

09:48 <Regenaxer> forgot that one

09:49 <tankf33der> also checking all links from this:

09:49 <tankf33der> https://docs.python.org/3/howto/unicode.html

09:50 <Regenaxer> T

09:52 <Regenaxer> Tons of docs, yes, but which is the "right" one? ;)

09:55 <tankf33der> no one knows until you started do something

10:27 <tankf33der> hunting for simple pages like this:

10:27 <tankf33der> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

10:33 <tankf33der> https://github.com/Wisdom/Awesome-Unicode/blob/master/README.md

10:35 <Regenaxer> yeah

10:36 <Regenaxer> very good explanations

10:45 <Regenaxer> the second link has "One-to-many: (ß → SS )"

10:45 <Regenaxer> So this is the case where we have a new (single) char now

10:48 <Regenaxer> And that link also shows how complicated it all is. So *not* everybody should have to roll his own

10:48 <Regenaxer> There must be some reference implementation somewhere ...

12:41 _whitelogger has joined #picolisp

12:49 <tankf33der> https://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

12:51 <tankf33der> https://unicode.org/faq/casemap_charprop.html

12:51 <tankf33der> afk.

12:59 <Regenaxer> great

13:37 <tankf33der> found how musl do case things

13:37 <tankf33der> https://raw.githubusercontent.com/BlankOn/musl/master/src/ctype/towctrans.c

13:48 <Regenaxer> Looks quite short

14:01 <Regenaxer> What does musl do with "ß"?

14:03 <tankf33der> unknown yet.

14:04 <Regenaxer> The official table should be http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

14:05 <Regenaxer> (same plase as EastAsianWidth.txt)

14:08 <Regenaxer> So I will study this. Perhaps we'll do it similar to the wide char stuff

14:09 <tankf33der> sounds good

14:09 <Regenaxer> yeah, at least easy

14:09 <Regenaxer> just lookup

14:09 <tankf33der> https://github.com/dlang/phobos/pull/7349

14:09 <tankf33der> analysis about my dlang bugint multiplication

14:10 <Regenaxer> ah, yeah

14:10 <Regenaxer> buffer size bug

14:11 <Regenaxer> hehe "got undetected for so long"!

14:15 <Regenaxer> The CaseFolding table has it: 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S

14:16 <Regenaxer> But the other direction maps to "SS": 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

14:18 <Regenaxer> Problem is that I don't know from the table which one *is* already upper or lower

14:18 <Regenaxer> It just maps to the other case

14:18 <Regenaxer> I need to split that into two tables it seems

14:20 <tankf33der> like musl, right? this function also ignores a lot of ranges

14:21 <Regenaxer> Not sure

14:21 <tankf33der> static wchar_t __towcase(wchar_t wc, int lower)

14:21 <Regenaxer> wchar_t is not helpful as far as I understand

14:23 <Regenaxer> And I don't understand the CaseFolding table

14:23 <Regenaxer> The left column contains lowercase and some uppercase

14:23 <Regenaxer> How to use it?

14:24 <Regenaxer> no, opposite: The left column contains uppercase but also *some* lowercase

14:25 <Regenaxer> ok, so I can use the text on the right side to filter! :)

14:26 <Regenaxer> "# LATIN SMALL LETTER"

14:27 <Regenaxer> If the tables are too big, I put it all into a shared library, loaded only when really needed

14:27 <Regenaxer> i.e. when 'lowc' or 'uppc' is called

14:30 <Regenaxer> I don't remember how I generated @lib/wide.l from http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

14:31 <Regenaxer> Must have a script somewhere, but I don't find it

14:36 <tankf33der> ihttps://git.envs.net/mpech/tankf33der/src/branch/master/wide

14:37 <tankf33der> generator.l

14:37 <tankf33der> :)

14:37 <Regenaxer> indeed! You are great!!

14:38 <tankf33der> we did it together to update to latest version :)

14:38 <Regenaxer> So I found it here too, in opt/genWide.l

14:38 <Regenaxer> did not know what to search for

22:17 _whitelogger has joined #picolisp

22:42 DerGuteMoritz has quit [Ping timeout: 268 seconds]

23:08 DerGuteMoritz has joined #picolisp