marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
DarkShadow44 has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
DarkShadow44 has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
<bloom>
"the shader cores... feature a constant execution and prefetch"
<dougall>
hmm... yeah, i've been wanting to get at those counters, but i'm not quite sure how to approach it
<dougall>
i think a granularity of four is possibly correct... register granularity is observable by using out-of-range register ids. occupancy would be 'floor(total / round_up(register_count))' right?
<bloom>
Possibly, possibly not.
<bloom>
Granularity of 4 is almost certainly the allocation of registers
<bloom>
but that might not be the allocation of threads
<bloom>
(that could be granularity of 8 instead, for example. etc)
Necrosporus has quit [Ping timeout: 252 seconds]
robinp has quit [Read error: Connection reset by peer]
robinp has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
Necrosporus has joined #asahi-gpu
phiologe has quit [Ping timeout: 250 seconds]
phiologe has joined #asahi-gpu
<bloom>
note to self: r5 appears preloaded with vertex id in vs
<bloom>
a bit of a hunch looking at a funny shader I compiled -- looks like integer ALU issues 2 insructions at once, maybe?
<bloom>
the way it schedules the scalarization of vec4 arithmetic is suggestive
<bloom>
dougall: Oh, ouch, embarassing - misread the sqrt op
<bloom>
the thing I called sqrt, call it f(x)
<bloom>
it's actually sqrt(x) = f(x) * x
<bloom>
so it's in fact rsqrt
<bloom>
but we already.. had rsqrt
<dougall>
"Implemented as x * rsqrt(x) with special cases handled correctly" - so i guess how does sqrt(x) differ from x * rsqrt(x)?
<dougall>
(asside from precision)
<bloom>
Probably the usual suspects: NaN, Inf, signed zero
<bloom>
rsqrt(0) is probably NaN and NaN * 0 = NaN, yet sqrt(0) = 0
<bloom>
so rsqrt_special has to define rsqrt(0) to be finite (which is wrong)
<dougall>
ah, yeah, that'd make sense :)
<bloom>
also need sqrt(-0.0) = -0.0
<bloom>
which holds if we set rsqrt(-0.0) = 0.0 since 0.0 * -0.0 = -0.0
<bloom>
likewise, we want sqrt(+inf) = +inf
<bloom>
but rsqrt(+inf) = 0.0 and 0.0 * +inf = NaN (indeterminate form)
<bloom>
So rsqrt_special(+inf) needs to be some positive number.
The_DarkFire_[m] has joined #asahi-gpu
<bloom>
dougall: Ok, I just r/e'd thread count
<bloom>
actually r/e is a stretch
<bloom>
Just dumped maxTotalThreadsPerThreadgroup and varied register pressure systematically
<dougall>
ah, what's it look like?
<bloom>
So, in terms of the "register quadwords" field in the cmdstream:
<bloom>
(call that Q)
<bloom>
If Q <= 13, then you have 1024 threads.
<bloom>
If Q >= 14, you have less. I was about to say I had the formula but my formula is buggy, hang on
<dougall>
(is one register quadword like r0-r3 or like r0l-r1h?)
<dougall>
yeah... that number is right, but i'm a bit confused where the 24x comes from? like is the register file per-core? (i assume so, and that'd an 8x, but then where's the 3x?)
<bloom>
dougall: Can't tell.
<bloom>
But the 24k is from apple's marketing
<bloom>
not sure where the 3x is.
<dougall>
yeah, i'm quite confused... "128 execution units", so 16 per core? when a simd-group is 32? is that like AMDs thing where you just put each half through the pipeline over two cycles?
<bloom>
dunno where the 128 execution units comes from
<bloom>
some of that is speculation from anandtech