marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<bloom>
fadd16/fmul16/fmadd16 seem fine?
vlixa has quit [Remote host closed the connection]
<bloom>
if you want to know what my reference point for regularity is i'll link you the bifrost encoding :p
<dougall>
yeah, they could definitely be worse - i'm just thinking of moving the Am/Bm/Cm field relative to their 32-bit equivalents (why did they do that?), and the fact that 32-bit ops can have 16-bit sources and destinations too
<bloom>
"and the fact that 32-bit ops can have 16-bit sources and destinations too"
<bloom>
This part makes a ton of sense.
<bloom>
The 32-bit ops are heavier weight. Yes, you _can_ run a fadd.32 with all operands 16-bit, but that will (depending on uarch details that are not ISA visible) be slower or higher power.
<bloom>
It's fundamentally a different operation. Convert, fp32 multiply, convert, versus fp16 multiply. The latter is much cheaper. (The converts are cheap regardless.)
<dougall>
ah, good point, yeah... that makes sense
<bloom>
on some arches, fp16 multiply is even vectorized (where fp32 is scalar, conversions be damned)
<bloom>
it's that much cheaper :>
<bloom>
Honestly the most annoying part of the encoding is the presence of >64-bit instructions
<bloom>
Makes the bit arithmetic awful.
<dougall>
yeah, C is particularly painful for that... i'd probably use __int128, and i'd probably end up regretting it :p
<bloom>
lol
<bloom>
ok, added some generic ALU packing code
<bloom>
2 lines shorter than I was before :-p
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
vijfhoek has quit [Ping timeout: 246 seconds]
vijfhoek has joined #asahi-gpu
anuejn has quit [Ping timeout: 246 seconds]
anuejn has joined #asahi-gpu
<bloom>
...and with the generic stuff, it was a cinch to add support for all the funops
<bloom>
dougall: " if sx and source.thread_bit_size >= 16:"
<bloom>
I suspect s/>= 16/< 64/ was intended.
<bloom>
I do wonder, if there's native 64-bit adds, why I see the blob lowering to a pair of adds
<bloom>
Oh, maybe because there's no 64-bit access to uniform registers.
<dougall>
hmm, yeah, i think you're right about < 64...
<bloom>
not a r/e thing, just a "what is sign-extension?" thing ;)
<bloom>
also, the encoding for iadd seems really odd. this is probably the weirdest of the ISA.
<bloom>
It's like it's supposed to be a 48-bit instruction and they added an extra 2 bytes of padding for no reason? what?
<dougall>
fwiw i saw apple's compiler emit 64-bit subtracts and 64-bit add+shift, but (as far as i can recall) not 64-bit adds
<bloom>
...Interesting.
<dougall>
yeah, not sure what's up with that encoding... i do think there's _something_ in the high couple of bits in most/all instructions that i haven't figured out, which might make it make a tiny bit more sense
TheJollyRoger has quit [Quit: TheJollyRoger]
TheJollyRoger has joined #asahi-gpu
<bloom>
I'm not worried about 2 unknown bits in the extended encoding
<bloom>
it's iadd specifically (imadd is fine) that's all weird..
Necrosporus has quit [Killed (beckett.freenode.net (Nickname regained by services))]
Necrosporus has joined #asahi-gpu
Necrosporus has quit [Killed (verne.freenode.net (Nickname regained by services))]
Necrosporus has joined #asahi-gpu
<dougall>
(or maybe i was trying to say that immediates don't get sign extended? not really the best way to represent that... hmm)
<bloom>
dougall: ok, my curiousity got the best of me, poked at sin_pt_1/2
<bloom>
The heavylifting is done by sin_pt_2. However, the function it computes is *not* sin(x), rather it's sin(x)/x
<bloom>
(This is standard, there are numeric advantages here.)
<bloom>
But it only computes in a single quadrant. So given 0 <= x < 1, it'll spit back sin(x * (pi/2)) / x
<bloom>
Notice that's an even function. So sin_pt_2 is in fact defined over [-1, 1], but it ignores the sign bit of its input.
<bloom>
This is a useful property: it lets sin_pt_1 pass the sign of the output over the sin_pt_2 call, to be recombined with a later multiplication.
<bloom>
So what is sin_pt_1? It's just a quadrant fixup.
<bloom>
For x in the first quadrant, it's simply the identity. sin_pt_2 is defined as such, so when we compute sin_pt_2(sin_pt_1(x)) * sin_pt_1(x) we're just computing sine.
<bloom>
For x in the third quadrant, recall sin(x + pi) = -sin(x). So sin_pt_2 will just flip the sign, so we can compute in the first quadrant (pt_2), and then the sign gets restored with the multiply.
<bloom>
For xin the second quadrant, recall sin(x + pi/2) = cos(x) = sin(pi/2 - x). So rather than flip the sign, we take the arithmetic complement.
<bloom>
Likewise for the fourth quadrant, where we both complement and flip the sign.
<bloom>
The last detail I glossed is the units. sin_pt_2 wants its angle as [-1, 1] but sin_pt_1 takes in a rotation [0, 4]. This doesn't affect any of the math, but it means the constants work out to nice integers.