#asahi-gpu on 2021-03-27 — irc logs at freenode.irclog.whitequark.org

2021-01-11 09:46 marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu

00:00 <chrisf> DarkShadow44, where's the packed format field?

00:04 <chrisf> the `u` fields may just be `unknown field X`

00:04 <DarkShadow44> Probably

00:04 <DarkShadow44> chrisf: Packed format field? You mean to specify what format the data has? AFAIK that's all Fx:F

00:05 <chrisf> ah

00:06 <chrisf> i had mapped out the bottom 2 bits of F as `size` but it makes sense if the packed formats use the rest

00:06 <DarkShadow44> For details I refered to https://github.com/dougallj/applegpu/blob/main/applegpu.py#L3875

00:06 <chrisf> aha

00:06 <chrisf> yeah, i agree that's what it is then :)

00:09 <DarkShadow44> I assume it's split into Fx:D because of the different instruction length, since all bytes that are cut off are assumed 0

00:09 <chrisf> yep

00:09 <DarkShadow44> although, it raises a question: Since registers are specified as Rx:R, maybe bit 31 is for another "cut off"?

00:09 <DarkShadow44> just a theory

00:10 <chrisf> it's an interesting theory

00:11 <chrisf> ive never seen the metal compiler do that though

00:11 <DarkShadow44> that doesn't mean we can't manually generate bytecode like that >:D

00:11 <chrisf> what is this `L` bit?

00:11 <DarkShadow44> shoul cut off the last two bytes

00:11 bpye has quit [Ping timeout: 240 seconds]

00:12 <chrisf> actually i dont think ive seen a non-8-byte load or store

00:13 <DarkShadow44> neither have I, but maybe they planned it into the hardware

00:13 <DarkShadow44> most instructions support such a cut off

00:13 <chrisf> ah, `L` is always that, i see the note up the top now.

00:13 bpye has joined #asahi-gpu

00:14 <chrisf> i would be careful about things that you never see the metal compiler do -- they may well not actually work in silicon

00:15 <chrisf> there's always stuff that doesnt work

00:15 <DarkShadow44> Sure thing, but that's where tests come in, no?

00:16 <chrisf> if you can show it does work then great

00:16 <DarkShadow44> I mean, I wouldn't necessarily use it, but it'd be good to know, IMHO

00:16 <DarkShadow44> no unknown bits is best

00:16 <DarkShadow44> although I gotta admit, I don't understand in having the benefit of having variable length instructions like that

00:17 <DarkShadow44> I thought an advantage of, for example, ARM was easier decoding due to fix lengths

00:17 <chrisf> tradeoffs

00:18 <Yuzu> variable length ops = higher density (typically, if done right)

00:18 <chrisf> apple does appear willing to spend area making things not suck

00:19 <DarkShadow44> yeah, easier on the cache

00:21 odmir_ has joined #asahi-gpu

00:21 <chrisf> oh, i guess the reason i might not have seen short load/store is i dont have examples of the packed cases

00:22 <chrisf> and so `mask` is always in play in the examples i have

00:22 <DarkShadow44> mh, it's a bit odd that the mask is one of the parts that are cut off

00:22 <DarkShadow44> need to make tests for that

00:24 odmir has quit [Ping timeout: 240 seconds]

00:41 <bloom> that especially matters since G13 is a pure scalar arch, yet it's designed for graphics (vector) workloads

00:41 <bloom> If you

00:41 <bloom> 're not careful about code density, you'll blow through your i-cache budget

00:42 <bloom> (Since everything is 4x worse over in graphics land, since you repeat the same instruction a bunch of times.)

00:42 <bloom> Different vendors cope with this in different ways.

00:42 <bloom> Older GPU arches were genuinely vector, some newer ones (GCN, Bifrost) have a fp16vec2 thing going on, special oddball handling goes to Adreno which has a special "repeat N times" modifier :-p

00:43 <DarkShadow44> huh, interesting

00:43 <DarkShadow44> what exactly is that "fp16vec2" thing?

00:43 <bloom> It's.. cute

00:44 <bloom> Some AMD and Mali GPUs are scalar for 32-bit instructions, but allow 2 channel 16-bit vectors (what AMD markets as rapid packed math or sth)

00:45 <bloom> For Mali, even have 4 channel 8-bit vectors targeting ML workloads

00:45 <bloom> It follows naturally from 32-bit arithmetic. I.e. packed 2x16-bit add is the same as 32-bit add up to handling of carry bits.

00:46 <bloom> ^ integer

00:46 <bloom> and fp16 is so much cheaper in hw than fp32 that it again works out

00:46 <bloom> So in theoretical benchmarks, it means a 2x reduction in cycle count for 16-bit vs 32-bit workloads

00:46 <bloom> ...in theory. In practice it's a pain for compilers, sometimes more so than plain old vec4 hardware.

00:47 <bloom> Code like `a.xy + b.xz` can't be vectorized (exercise: why not?)

00:47 <DarkShadow44> mh, I see

00:48 <bloom> Anyway, G13 does *not* work this way.

00:48 <DarkShadow44> heh, should make things easier

00:48 <bloom> Indeed.

00:49 <bloom> I really hate how inactive I've been

00:49 <bloom> Dealing with personal stuff but still :<

00:51 <DarkShadow44> I know that feeling

00:54 <bloom> What are your relevant interests ?

01:05 <DarkShadow44> What do you mean?

01:42 odmir_ has quit [Remote host closed the connection]

01:42 odmir has joined #asahi-gpu

01:47 odmir has quit [Ping timeout: 260 seconds]

02:00 JusticeEX has joined #asahi-gpu

02:20 Emantor has quit [Quit: ZNC - http://znc.in]

02:20 Emantor has joined #asahi-gpu

03:26 phiologe has quit [Ping timeout: 250 seconds]

03:26 phiologe has joined #asahi-gpu

04:03 odmir has joined #asahi-gpu

04:08 odmir has quit [Ping timeout: 265 seconds]

06:06 rwhitby has quit [Ping timeout: 258 seconds]

06:22 TheJollyRoger has quit [Remote host closed the connection]

06:22 TheJollyRoger has joined #asahi-gpu

06:56 TheJollyRoger has quit [Remote host closed the connection]

06:59 TheJollyRoger has joined #asahi-gpu

08:21 rwhitby has joined #asahi-gpu

10:37 zkrx has quit [Ping timeout: 265 seconds]

10:46 zkrx has joined #asahi-gpu

10:55 rwhitby has quit [Ping timeout: 258 seconds]

11:52 JusticeEX has quit [Ping timeout: 240 seconds]

16:06 Necrosporus has quit [Read error: Connection reset by peer]

16:06 Necrosporus has joined #asahi-gpu

17:07 Baughn has quit [Read error: Connection reset by peer]

17:22 Necrosporus is now known as Guest59740

17:22 Guest59740 has quit [Killed (barjavel.freenode.net (Nickname regained by services))]

17:22 Necrosporus has joined #asahi-gpu

17:23 Baughn has joined #asahi-gpu

17:27 odmir has joined #asahi-gpu

17:37 JusticeEX has joined #asahi-gpu

18:34 odmir has quit [Remote host closed the connection]

18:35 odmir has joined #asahi-gpu

18:42 odmir has quit [Ping timeout: 268 seconds]

18:47 phiologe has quit [Ping timeout: 250 seconds]

18:47 phiologe has joined #asahi-gpu

19:07 odmir has joined #asahi-gpu

19:21 Baughn has quit [Ping timeout: 260 seconds]

19:39 odmir has quit [Ping timeout: 246 seconds]

19:40 Baughn has joined #asahi-gpu

20:00 odmir has joined #asahi-gpu

20:34 odmir has quit [Ping timeout: 260 seconds]

20:38 JusticeEX has quit [Ping timeout: 268 seconds]

20:48 odmir has joined #asahi-gpu

21:20 odmir has quit [Ping timeout: 240 seconds]

21:34 odmir has joined #asahi-gpu

22:08 odmir has quit [Ping timeout: 268 seconds]

22:20 odmir has joined #asahi-gpu

22:34 solarkraft has quit [Quit: Bye!]

22:38 JusticeEX has joined #asahi-gpu

22:53 odmir has quit [Ping timeout: 246 seconds]

23:01 odmir has joined #asahi-gpu