marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<chrisf>
DarkShadow44, where's the packed format field?
<chrisf>
the `u` fields may just be `unknown field X`
<DarkShadow44>
Probably
<DarkShadow44>
chrisf: Packed format field? You mean to specify what format the data has? AFAIK that's all Fx:F
<chrisf>
ah
<chrisf>
i had mapped out the bottom 2 bits of F as `size` but it makes sense if the packed formats use the rest
<DarkShadow44>
I assume it's split into Fx:D because of the different instruction length, since all bytes that are cut off are assumed 0
<chrisf>
yep
<DarkShadow44>
although, it raises a question: Since registers are specified as Rx:R, maybe bit 31 is for another "cut off"?
<DarkShadow44>
just a theory
<chrisf>
it's an interesting theory
<chrisf>
ive never seen the metal compiler do that though
<DarkShadow44>
that doesn't mean we can't manually generate bytecode like that >:D
<chrisf>
what is this `L` bit?
<DarkShadow44>
shoul cut off the last two bytes
bpye has quit [Ping timeout: 240 seconds]
<chrisf>
actually i dont think ive seen a non-8-byte load or store
<DarkShadow44>
neither have I, but maybe they planned it into the hardware
<DarkShadow44>
most instructions support such a cut off
<chrisf>
ah, `L` is always that, i see the note up the top now.
bpye has joined #asahi-gpu
<chrisf>
i would be careful about things that you never see the metal compiler do -- they may well not actually work in silicon
<chrisf>
there's always stuff that doesnt work
<DarkShadow44>
Sure thing, but that's where tests come in, no?
<chrisf>
if you can show it does work then great
<DarkShadow44>
I mean, I wouldn't necessarily use it, but it'd be good to know, IMHO
<DarkShadow44>
no unknown bits is best
<DarkShadow44>
although I gotta admit, I don't understand in having the benefit of having variable length instructions like that
<DarkShadow44>
I thought an advantage of, for example, ARM was easier decoding due to fix lengths
<chrisf>
tradeoffs
<Yuzu>
variable length ops = higher density (typically, if done right)
<chrisf>
apple does appear willing to spend area making things not suck
<DarkShadow44>
yeah, easier on the cache
odmir_ has joined #asahi-gpu
<chrisf>
oh, i guess the reason i might not have seen short load/store is i dont have examples of the packed cases
<chrisf>
and so `mask` is always in play in the examples i have
<DarkShadow44>
mh, it's a bit odd that the mask is one of the parts that are cut off
<DarkShadow44>
need to make tests for that
odmir has quit [Ping timeout: 240 seconds]
<bloom>
that especially matters since G13 is a pure scalar arch, yet it's designed for graphics (vector) workloads
<bloom>
If you
<bloom>
're not careful about code density, you'll blow through your i-cache budget
<bloom>
(Since everything is 4x worse over in graphics land, since you repeat the same instruction a bunch of times.)
<bloom>
Different vendors cope with this in different ways.
<bloom>
Older GPU arches were genuinely vector, some newer ones (GCN, Bifrost) have a fp16vec2 thing going on, special oddball handling goes to Adreno which has a special "repeat N times" modifier :-p
<DarkShadow44>
huh, interesting
<DarkShadow44>
what exactly is that "fp16vec2" thing?
<bloom>
It's.. cute
<bloom>
Some AMD and Mali GPUs are scalar for 32-bit instructions, but allow 2 channel 16-bit vectors (what AMD markets as rapid packed math or sth)
<bloom>
For Mali, even have 4 channel 8-bit vectors targeting ML workloads
<bloom>
It follows naturally from 32-bit arithmetic. I.e. packed 2x16-bit add is the same as 32-bit add up to handling of carry bits.
<bloom>
^ integer
<bloom>
and fp16 is so much cheaper in hw than fp32 that it again works out
<bloom>
So in theoretical benchmarks, it means a 2x reduction in cycle count for 16-bit vs 32-bit workloads
<bloom>
...in theory. In practice it's a pain for compilers, sometimes more so than plain old vec4 hardware.
<bloom>
Code like `a.xy + b.xz` can't be vectorized (exercise: why not?)
<DarkShadow44>
mh, I see
<bloom>
Anyway, G13 does *not* work this way.
<DarkShadow44>
heh, should make things easier
<bloom>
Indeed.
<bloom>
I really hate how inactive I've been
<bloom>
Dealing with personal stuff but still :<
<DarkShadow44>
I know that feeling
<bloom>
What are your relevant interests ?
<DarkShadow44>
What do you mean?
odmir_ has quit [Remote host closed the connection]