#asahi-gpu on 2021-04-17 — irc logs at freenode.irclog.whitequark.org

2021-01-11 09:46 marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu

00:12 DarkShadow44 has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

00:13 DarkShadow44 has joined #asahi-gpu

00:40 odmir has quit [Remote host closed the connection]

00:41 odmir has joined #asahi-gpu

00:45 odmir has quit [Ping timeout: 240 seconds]

01:12 odmir has joined #asahi-gpu

01:20 Emantor has quit [Quit: ZNC - http://znc.in]

01:22 Emantor has joined #asahi-gpu

01:30 odmir has quit [Remote host closed the connection]

01:30 odmir has joined #asahi-gpu

01:42 <bloom> piles of ALU

01:47 <bloom> dougall: hey, I have a reversing question for you

01:47 <bloom> I'd like to know what the thresholds are for register usage --> thread occupancy

01:48 <bloom> The cmdstream just specifies register count/4, but that seems more fine grained than real hw would implement

01:48 <bloom> But! Apple docs talk about occupancy perf counters?

01:48 <bloom> So it ought to be able to figure out from the other direction like you've done with the cpu?

01:55 <bloom> Oh, even better maxTotalThreadsPerThreadgroup is literally in the Metal API. nice!

01:57 <bloom> "Advanced Metal Shader Optimization" from WWDC'16 is still relevant :)

01:58 <bloom> https://metalkit.org/2020/07/03/wwdc20-whats-new-in-metal.html has some goodies.

02:01 <bloom> "the shader cores... feature a constant execution and prefetch"

02:04 <dougall> hmm... yeah, i've been wanting to get at those counters, but i'm not quite sure how to approach it

02:07 <dougall> i think a granularity of four is possibly correct... register granularity is observable by using out-of-range register ids. occupancy would be 'floor(total / round_up(register_count))' right?

02:08 <bloom> Possibly, possibly not.

02:08 <bloom> Granularity of 4 is almost certainly the allocation of registers

02:08 <bloom> but that might not be the allocation of threads

02:09 <bloom> (that could be granularity of 8 instead, for example. etc)

02:34 Necrosporus has quit [Ping timeout: 252 seconds]

02:37 robinp has quit [Read error: Connection reset by peer]

02:38 robinp has joined #asahi-gpu

02:41 odmir has quit [Remote host closed the connection]

02:47 Necrosporus has joined #asahi-gpu

02:50 phiologe has quit [Ping timeout: 250 seconds]

02:50 phiologe has joined #asahi-gpu

02:52 <bloom> note to self: r5 appears preloaded with vertex id in vs

03:03 <bloom> a bit of a hunch looking at a funny shader I compiled -- looks like integer ALU issues 2 insructions at once, maybe?

03:03 <bloom> the way it schedules the scalarization of vec4 arithmetic is suggestive

03:10 <bloom> dougall: Oh, ouch, embarassing - misread the sqrt op

03:10 <bloom> the thing I called sqrt, call it f(x)

03:10 <bloom> it's actually sqrt(x) = f(x) * x

03:11 <bloom> so it's in fact rsqrt

03:11 <bloom> but we already.. had rsqrt

03:16 <dougall> "Implemented as x * rsqrt(x) with special cases handled correctly" - so i guess how does sqrt(x) differ from x * rsqrt(x)?

03:16 <dougall> (asside from precision)

03:17 <bloom> Probably the usual suspects: NaN, Inf, signed zero

03:17 <bloom> rsqrt(0) is probably NaN and NaN * 0 = NaN, yet sqrt(0) = 0

03:18 <bloom> so rsqrt_special has to define rsqrt(0) to be finite (which is wrong)

03:18 <dougall> ah, yeah, that'd make sense :)

03:19 <bloom> also need sqrt(-0.0) = -0.0

03:19 <bloom> which holds if we set rsqrt(-0.0) = 0.0 since 0.0 * -0.0 = -0.0

03:20 <bloom> likewise, we want sqrt(+inf) = +inf

03:20 <bloom> but rsqrt(+inf) = 0.0 and 0.0 * +inf = NaN (indeterminate form)

03:21 <bloom> So rsqrt_special(+inf) needs to be some positive number.

03:41 The_DarkFire_[m] has joined #asahi-gpu

03:46 <bloom> dougall: Ok, I just r/e'd thread count

03:46 <bloom> actually r/e is a stretch

03:46 <bloom> Just dumped maxTotalThreadsPerThreadgroup and varied register pressure systematically

03:47 <dougall> ah, what's it look like?

03:48 <bloom> So, in terms of the "register quadwords" field in the cmdstream:

03:48 <bloom> (call that Q)

03:48 <bloom> If Q <= 13, then you have 1024 threads.

03:50 <bloom> If Q >= 14, you have less. I was about to say I had the formula but my formula is buggy, hang on

03:50 <dougall> (is one register quadword like r0-r3 or like r0l-r1h?)

03:51 <bloom> r0-r3

03:52 * bloom just bruteforces

03:58 <bloom> dougall: https://github.com/AsahiLinux/gpu/commit/f6dba34f41d957820ae880ffdf8fce8e42b1ecd1

04:02 <bloom> I guess that's a linear allocation, just a lot of rounding needed

04:03 <bloom> So.. a 192kb register file

04:04 <bloom> uh no

04:04 <bloom> uh yes

04:05 <dougall> haha

04:05 <bloom> it's past midnight i cant units

04:05 <dougall> yeah, i think that all makes sense

04:06 <bloom> numbers still feel suspect.

04:06 <bloom> What am I missing

04:07 <bloom> SZ = (128*4)*384

04:07 <bloom> >>> SZ / (29*16)

04:07 <bloom> 423.7241379310345

04:07 <bloom> which is less than the 448 actually issues

04:08 <bloom> I guess this is all of-by-one

04:08 <bloom> uh no

04:11 <bloom> ah!

04:12 <bloom> Ahhh!

04:13 <bloom> If you take SZ = 384 * (32 * 4 * 4), everything is too small except the biggest

04:13 <bloom> but if you doubled that, you would expect the last thread count to double

04:13 <bloom> but if you take SZ to be somewhere in betwee, everything rounds right

04:14 <bloom> 212992 = ((384 + 448)/2) * 32 * 4 * 4

04:14 <bloom> ^ smack int he middle

04:14 <bloom> and indeed:

04:14 <bloom> >>> { x: math.floor((SZ_ / (x*4*4)) / 64) * 64 for x in range(13, 32) }

04:14 <bloom> {13: 1024, 14: 896, 15: 832, 16: 832, 17: 768, 18: 704, 19: 640, 20: 640, 21: 576, 22: 576, 23: 576, 24: 512, 25: 512, 26: 512, 27: 448, 28: 448, 29: 448, 30: 384, 31: 384}

04:15 <bloom> Expanding out:

04:15 <bloom> min(1024, math.floor((SZ_ / (math.ceil(reg_count / 4)*4*4)) / 64) * 64)

04:17 <dougall> nice!

04:22 <bloom> Here's a better formula

04:22 <bloom> The register file is M = 53248 words.

04:23 <bloom> Every thread requires R words from the register file.

04:23 <bloom> Threads may only be dispatched in groups of 64.

04:23 <bloom> No more than 1024 threads may be dispatched.

04:24 <bloom> Therefore, we may dispatch `min(1024, align_down(M / R, 64))` threads.

04:24 <bloom> (where align_down(x, y) = floor(x / y) * y)

04:25 mxw39 has quit [Ping timeout: 240 seconds]

04:25 mxw39 has joined #asahi-gpu

04:26 <bloom> Possible addendum: But threads can only require multiples of 4 word-sized registers.

04:27 <bloom> Therefore, we may dispatch `min(1024, align_down(M / align_up(R, 4), 64))` threads.

04:27 <bloom> (where align_up(x, y) = ceil(x / y) * y)

04:29 method_ has joined #asahi-gpu

04:43 <bloom> Oh, lastly - https://en.wikipedia.org/wiki/Apple_M1#Architecture

04:43 <bloom> "total, the M1 GPU contains up to 128 EUs and 1024 ALUs,[12] which by Apple's claim can execute nearly 25,000 threads simultaneously"

04:43 <bloom> I assume "nearly 25,000" is marketing speak for 1024 * 24 = 24576 threads

04:44 <bloom> which means scaling our estimated size up by 24x to 212992 * 24 = 5111808 = 4.9 MB of register file on the M1 GPU!

04:49 <bloom> Straight from the horse's mouth https://www.apple.com/mac/m1/

04:49 <bloom> 🐴

04:50 <dougall> yeah... that number is right, but i'm a bit confused where the 24x comes from? like is the register file per-core? (i assume so, and that'd an 8x, but then where's the 3x?)

04:51 <bloom> dougall: Can't tell.

04:51 <bloom> But the 24k is from apple's marketing

04:52 <bloom> not sure where the 3x is.

04:53 <dougall> yeah, i'm quite confused... "128 execution units", so 16 per core? when a simd-group is 32? is that like AMDs thing where you just put each half through the pipeline over two cycles?

04:53 <bloom> dunno where the 128 execution units comes from

04:53 <bloom> some of that is speculation from anandtech

04:53 <dougall> https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested/3 <- the slide

04:54 <dougall> (that's from the announcement video iirc)

04:54 <bloom> Oh, I see.

04:56 <method_> interesting

04:56 <dougall> yeah, i'm sure it'll make more sense as we figure more out, but definitely plenty of puzzles left :)

04:57 <bloom> or not

04:57 <bloom> a lot of this stuff is irrelevant/invisible to software

07:51 vlixa has quit [Remote host closed the connection]

09:51 vlixa has joined #asahi-gpu

12:00 <glibc> oh, they don't use a fast sin at all? That's interesting indeed

12:28 odmir has joined #asahi-gpu

15:31 <bloom> Oh, theres's a cute optimization I need to check the legitimacy of

15:32 <bloom> If two 16-bit values are in a packed register (r0l/r0h say) can we use bitop_mov on the 32-bit register (r0) for vectorization?

15:33 <bloom> I bet so.

15:34 <bloom> Likewise using mov_imm as a 32-bit thing

15:46 * bloom spun up an optimizer for agx

15:46 <bloom> handles the core floating point stuff

16:00 morelightning[m] has quit [Quit: Idle for 30+ days]

16:58 artemist has joined #asahi-gpu

17:24 odmir has quit [Remote host closed the connection]

17:25 odmir has joined #asahi-gpu

17:30 odmir has quit [Ping timeout: 240 seconds]

17:50 odmir has joined #asahi-gpu

18:21 m42uko has quit [Quit: Leaving.]

18:21 m42uko has joined #asahi-gpu

19:50 odmir has quit [Remote host closed the connection]

19:51 odmir has joined #asahi-gpu

19:55 odmir has quit [Ping timeout: 240 seconds]

20:25 odmir has joined #asahi-gpu

20:55 odmir has quit [Ping timeout: 240 seconds]

21:09 odmir has joined #asahi-gpu

21:10 vlixa has quit [Remote host closed the connection]

21:41 vlixa has joined #asahi-gpu

21:43 odmir has quit [Ping timeout: 268 seconds]

21:55 odmir has joined #asahi-gpu

22:04 odmir has quit [Remote host closed the connection]

22:05 odmir has joined #asahi-gpu

22:34 vlixa has quit [Remote host closed the connection]

22:38 vlixa has joined #asahi-gpu