<s1dev>
awygle, I was thinking about some stuff and in the short term, HPWL is probably computable with SIMD -> GPU acceleration
<awygle>
s1dev: how so? I think the final reduction can be but the per-net math doesn't seem amenable
s1dev has quit [Remote host closed the connection]
s1dev has joined ##openfpga
<s1dev>
awygle, for a given placement, isn't HPWL just a matter of looping through the nets without branching?
<awygle>
s1dev: that's true... not all the nets are the same width, but I guess that doesn't matter much
<awygle>
oh but nodes can be on many nets, so you can't store nodes on the same net near each other without duplication (which might be fine)
<s1dev>
but if you have some replicas, then you're fine
* awygle
is visualizing memory layouts while cooking dinner
<s1dev>
you just compute the HPWL a particular net across all the replicacs
<awygle>
yeah that's fine at the cost of your update operation being somewhat more expensive
<awygle>
probably not a lot tho and that happens way less often
<s1dev>
it's just a matter of reordering some loops
<s1dev>
in the case of SA you'd just run multiple restarts simultaneously. PA and PT naturally have a bunch of replicas to parallelize
<awygle>
hm I guess I'm picturing a different thing
<awygle>
gimme like 30m for dinner and then I'll try again
<sorear>
surely for SA you can evaluate a few dozen swaps in parallel, and apply the chosen and non-conflicting ones
<sorear>
since conflicts will be relatively rare
<s1dev>
sounds like there might be some branching involved in that
<sorear>
maybe a little. i'd need to work out a lot more details
<sorear>
other question: is there a foss tool that will run ordinary C++ code on a gpu
GenTooMan has quit [Quit: Leaving]
<sorear>
(in order to share as much arch code as possible between the gpu placer and the current placer)
<s1dev>
well, I've heard that CUDA these days is just a matter of using their STL implementations and keywords for marking kernel code
<s1dev>
*GPU kernels
digshadow has quit [Ping timeout: 260 seconds]
pie___ has quit [Remote host closed the connection]
pie___ has joined ##openfpga
s1dev has quit [Quit: Leaving]
<awygle>
yeah okay I thought about it more and it does totally work, you have to do some memory movement but there's plenty of work to use to hide the latency
Bike has quit [Quit: Lost terminal]
s1dev has joined ##openfpga
<rqou>
offtopic: the HTME youtube channel is apparently teaching me that modern glass is an amazing technology
<rqou>
or just modern materials science in general
<pie___>
oh. shit. "Here's where formal methods come into play. We'll be using 'yosys-smtbmc', which is a flow that involves running the Yosys synthesizer with an SMT2 backend, and then feeding those SMT2 circuit descriptions into SMT2 solvers. These circuit 'models' can be used for different modes of operation of the solver:"
<pie___>
obviously im missing a lot of contextual knowldge but i wouldnt have thought of going through verilog lol
<daveshah>
I've used Verilog in the past for formal analysis of stuff other than circuits tbh. Because I'm much quicker writing Verilog than any proper formal language, etc
rohitksingh has quit [Quit: Leaving.]
pie___ has quit [Quit: Leaving]
GenTooMan has joined ##openfpga
<cr1901>
SMTv2 is s-expr based, so it's not that bad to write by hand
<sorear>
naive q: nextpnr has recently gained the ability to insert luts into nets. Should there be some kind of no_new_glitches per-net option to prevent this?
<awygle>
glitches_get_stitches
<daveshah>
sorear: FPGA synthesis is not glitch free for so many reasons. If you're worried about them, inserting the odd LUT is the least of your worries
<daveshah>
I'm not even sure what a pass thru LUT would cause that other interconnect doesn't use anyway
<sorear>
You can avoid the entire synthesis problem by manually instantiating primitives. PnR seems inescapable though
<daveshah>
Can you actually show what kind of glitch you mean? I'm not convinced a pass thru LUT is actually worse than interconnect (but might be wrong), or other architectural stuff like the fact each LUT input has a different delay
<sorear>
logic synthesis *in general* introduces glitches because logical equivalence allows that
<sorear>
A LUT can turn one input edge into multiple output edges
<daveshah>
I would be interested to know if a pass thru LUT also did that in practice
<sorear>
Not sure if that’s possible in the specific case of an ice40 lut being used for pass through
<daveshah>
I can see how a LUT with more than one utilised inout could
<daveshah>
*input
<daveshah>
The thing is, if one says that, then one could also extend that to the fact that interconnect buffers could introduce glitches
<sorear>
Do we have transistor-level schematics for a LUT in any product?
<daveshah>
Not sure, might be for some academic architecures at least
<daveshah>
It's typically a cascade of muxes though
<sorear>
It’s much less plausible for interconnect buffers, since there is one obvious way to do an enable buffer and it doesn’t glitch
<daveshah>
True
X-Scale has quit [Ping timeout: 264 seconds]
X-Scale has joined ##openfpga
rohitksingh has joined ##openfpga
rohitksingh has quit [Quit: Leaving.]
digshadow has quit [Quit: Leaving.]
digshadow has joined ##openfpga
<sorear>
after giving the matter more thought, i can think of 3 ways to implement a N:1 mux (N pass transistors + buffer; N tristate buffers; N AND + 1 OR), and none of them will produce glitches in the pass-through case
<sorear>
however. none of these methods can take advantage of the fact that a sram fpga has dual-rail *data* inputs. which makes me wonder if there's a different approach that'd be used instead
X-Scale has quit [Ping timeout: 256 seconds]
<cr1901>
>A LUT can turn one input edge into multiple output edges
<cr1901>
Could you elaborate w/ a toy example?
<cr1901>
(must be combining LUTs where this happens, not just a single LUT)
<sorear>
cr1901: let's say you have a 2:1 MUX implemented as an AND/OR tree, both data inputs are 1, the output is 1
<sorear>
cr1901: now say the select input changes from 0 to 1. depending on the order the decoder lines change, the MUX output could briefly go 0 before returning to 1
<sorear>
i have just found out what SAED stands for o_O
<daveshah>
I think such an option in nextpnr would nonetheless useful for debugging if nothing else (likewise being able to disable lut input permutation)
<daveshah>
LUT input permutation could introduce extra glitches if you were relying on the exact mux structure
<daveshah>
The different delays of the LUT inputs are characterised in the ice40 timing model
<awygle>
SAED is supercool
<awygle>
i love electron microsopes
<awygle>
even if all the pictures make me very uncomfortable
X-Scale has joined ##openfpga
mumptai has joined ##openfpga
<sorear>
(more generally: i want the open tools to have features for people to get excited about other than just "it's open". do at least a couple things the vendor tools simply can't)
<daveshah>
Yes, that's very much the spirit of nextpnr
<daveshah>
That's why we are working on a Python API, nice GUI, bitstream reader, etc
<awygle>
daveshah: in your opinion how ready is nextpnr for plugging in alternative/experimental placement algorithms?
<daveshah>
awygle: should be ready already
<awygle>
hmmm
<daveshah>
For anything parallel, you'd need to have a first pass to combine a Bels into tiles. But we have some API functions for working with Bels by tile for that purpose
<awygle>
to use ice40 parlance, Bels are LCs and tiles are PLBs? or are Bels even lower level?
<daveshah>
Bels are LCs
<daveshah>
Tiles are PLBs, but lots of ice40 stuff just calls them tiles
<awygle>
ok. so you can end up with an illegal placement due to carries, but you don't have to pack LUTs with FFs
<daveshah>
You can end up with an illegal placement for many more reasons than carries
<daveshah>
The Arch API provides functions to check validity of arch specific stuff
<daveshah>
Carries are specified and validated using relative placement constraints
<awygle>
hm. well i'll poke at the source ... eventually. thanks
<daveshah>
All that commit changes is the handle to access them
Bike has quit [Ping timeout: 240 seconds]
<sorear>
intuition: the working representation of a SA PNR between each tick is equivalent to a bitstream, and it should be possible for it to use roughly the same amount of space
<awygle>
the first half is close to true, i'd say it may be slightly _less_ information than a bitstream because routing may not be fully defined
<awygle>
the second half is probably true but i don't think you'd _want_ to represent it in a way that allowed it to take up the same amount of space
mumptai has quit [Quit: Verlassend]
<sorear>
there's a lot to be said for fitting in cache
<prpplague>
i always prefer cash
<awygle>
cache money
<awygle>
i should calculate how much bit-twiddling equals a cache miss
<awygle>
at L1 L2 and L3
<awygle>
just so i know
* prpplague
takes a break from board assembly to read the channel log
wpwrak has quit [Read error: Connection reset by peer]