m_t has quit [Read error: Connection reset by peer]
sunxi_fan has joined ##openfpga
<jcreus>
daveshah: you mentioned it'd be interesting to have an analytic placer going. I might try to give it a shot, I don't expect it to be quick due to time limitations and figuring out the codebase, but to get started I've been reading placer1 and router1. I was wondering if there's a set of benchmarks to compare results between PnRs while working on it?
<jcreus>
okay crap sorry I should've searched more
<daveshah>
no worries, it's not very well published
<daveshah>
that benchmarks nextpnr against old versions of itself and arachne-pnr
<jcreus>
what's the current philosophy for the optimization objectives - i.e. tradeoff between size and space?
<jcreus>
like the analytic stuff I've been thinking about while showering would have the ability to trade-off, I think, and some literature I've read has similar stuff
<daveshah>
At the moment I feel we mostly aim for Fmax, we don't try and optimise for size
<daveshah>
so long as it fits
<jcreus>
right, makes sense
<tnt>
Fmax FTW !
<jcreus>
I've also seen ppl say that it doesn't compare great compared to commercial stuff, but it to me it looks pretty good vs Lattice's stuff - is the worry that the current system doesn't scale to bigger chips?
<tnt>
One thing the placer does really badly at the moment is dealing with fixed blocks. Things like SPRAMs for instance that are essentially unmoveable. It won't occur to the router to shift _all_ the luts closer to the SPRAM.
<daveshah>
Scaling is definitely a problem with the current placer in terms of runtime
<daveshah>
SA isn't great for bigger parts
<jcreus>
right
<jcreus>
also, I realize it's absolutely none of my business
<jcreus>
and I'm just starting out so I might be missing the greater picture
<jcreus>
but for situations like what tnt mentioned about situations handled poorly by the current placer, would it make sense to try to look for ice40 scripts on github semi randomly and add them liberally to the benchmarking repo?
<daveshah>
yes, that would be awesome
<jcreus>
3 designs might not be very useful for comparison - for linear programs for instance progress is really awesome to track, since there's a standard library of thousands of linear programs and mixed-integer programs
sunxi_fan has quit [Read error: Connection reset by peer]
sunxi_fan has joined ##openfpga
pie__ has joined ##openfpga
jcreus has quit [Remote host closed the connection]
rohitksingh has quit [Ping timeout: 272 seconds]
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark pushed 2 commits to master [+7/-5/±7] https://git.io/fhkuk
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark 86d3621 - Rearrange the code for a nicer layout.
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark 22b299d - Convert everything to use nMigen. Yay!
rohitksingh has joined ##openfpga
sunxi_fan has left ##openfpga [##openfpga]
jcreus has joined ##openfpga
zng has quit [Quit: ZNC 1.8.x-nightly-20181211-72c5f57b - https://znc.in]
GuzTech has quit [Ping timeout: 250 seconds]
GuzTech has joined ##openfpga
zng has joined ##openfpga
pie__ has quit [Remote host closed the connection]
pie__ has joined ##openfpga
gruetzkopf has quit [Remote host closed the connection]
gruetzkopf has joined ##openfpga
GuzTech has quit [Ping timeout: 272 seconds]
<jcreus>
sorry to keep going with the noob nextpnr questions, but for the ice40 case, the GUI seems to suggest that the individual bels are the each of the 8 logic cells (as opposed to the full PLB)?
<jcreus>
how are the shared CEN/CLK signals handled?
<jcreus>
that are shared across those 8
<jcreus>
as the SA possibly messes them around and far away
tmeissner has joined ##openfpga
m_w has joined ##openfpga
GuzTech has joined ##openfpga
<tnt>
Oh, my CPU executed its first few instructions, so cute :P
<tnt>
jcreus: there is a validty check to make sure a BEL doesn't conflict with other ones in the same PLB
<jcreus>
tnt: I see, thanks!
<jcreus>
trying to figure out how to best deal with those constraints when doing it analytically instead of SA where you can do these checks as you go
<jcreus>
why can't everything just be convex?
<daveshah>
So there are two possible ways to solve this
<daveshah>
one would be to have a first stage "tile packer" that makes legal tiles for the analytical placer
<tnt>
Somewhat unsurprisingly yosys doesn't like a switch case with 65536 entries ...
<daveshah>
the other option would be to start SA at a low temperature to legalise the placement created by the analytical placer
<jcreus>
I was thinking about doing the latter either way in order to not have to implement all the legalisation logic again myself
<daveshah>
That is probably the best option
<jcreus>
is the legaliser efficient when doing that?
<jcreus>
actually nvm it is
<jcreus>
also, I see that it's technically a 3D grid - are the bels with z != 0 always special cases (like, idk, BRAMs or something) such that I can assume the big optimization is just in xy?
<daveshah>
Yes
<daveshah>
z != 0 is mostly for logic tiles, where z = 0..7
_whitelogger has joined ##openfpga
pie__ has quit [Ping timeout: 252 seconds]
<azonenberg>
jcreus: regarding scalability, prjxray is trying to reverse engineer the xilinx 7 series bitstream
<azonenberg>
So when thinking about scalability and performance, don't think about your par on an ice40 or ecp5
<azonenberg>
think about how it will run on a virtex-7
<azonenberg>
Ideally i'd want to be able to multithread it too
<azonenberg>
actually, *really* ideal would be an MPI cluster or similar so you can run on hundreds of cores :p
<azonenberg>
but multithreading is a good start
<jcreus>
azonenberg: gotcha. My background is mostly in optimization (distributed convex optimization being my biggest kink) so I'm hoping to formulate it that way, and scalability should follow nicely
<azonenberg>
awesome
<azonenberg>
Basically, my long term dream is being able to take a synthesized netlist (we'll worry about optimizing synthesis later, lol)
<azonenberg>
for a full virtex ultrascale
<jcreus>
yeah, that would be awesome
<jcreus>
I recently realized that a kinda nasty thing are the distributed complex elements like DSPs and RAMs, that need to be placed, too
<jcreus>
and they're special in that you can't really pretend they're continuous like you can do with LUTs
<azonenberg>
throw it on a rack of xeons or a few dozen t3.2xlarge instances
<azonenberg>
and get a bitstream back in minutes
<azonenberg>
i have no idea how feasible this is because i havent had the time to even look into scaling bottlenecks etc
<azonenberg>
But that's my goal :p
<jcreus>
yeppp
<jcreus>
convex solver would be a good start, I recently worked on a distributed QP solver using Regent/Legion (which has seen some use on supercomputers)
<azonenberg>
I'm thinking start with a simple sequential implementatino of the solver core to prototype a bit
<azonenberg>
then openmp on a single node
<azonenberg>
then rewrite in either MPI or openmp + sockets
<azonenberg>
for scaling to larger platforms
<jcreus>
oh, yeah, for sure. For now actually I'll probably start by jankily communicating with Python and using cvxpy
<azonenberg>
keep in mind we dont want to sacrifice usability on single-node jobs just to get scaling
<jcreus>
then when I like the cost function and constraints go back to c++ land and do it there properly
<azonenberg>
Yeah maeks sense
<jcreus>
but yeah I'll need to think more carefully about DSPs and RAMs
<jcreus>
they seem annoying
<jcreus>
in theory you could always consider permutations of cell choices?
<jcreus>
which blows up, obviously, so maybe some meta-simulated-annealing
<azonenberg>
Loooong term i want iterative optimization capability
<jcreus>
you could always do device specific hacks - iirc ice40 has DSP at the edges only...
<azonenberg>
so do an initial placement, try routing it
<azonenberg>
fine tune placement based on routing feedback (say, if you have heavy routing congestion move some bels closer together to shorten routing delays)
<azonenberg>
or even (very long term) adjust register balancing
<jcreus>
right
<jcreus>
actually, quick question about that
<azonenberg>
based on feedback with actual routing delays
<jcreus>
how much of a factor is routing?
<azonenberg>
But to start, forget register balancing and netlist changes and focus on P&R only
<azonenberg>
i'd say on average routing delay can be expected to be the same OOM as logic delay
<jcreus>
is it something like "well, if it routes successfully, then the solution won't be far from the optimal given that placement, so go work on the placer?"
<azonenberg>
but if you have longer range nets between IP blocks or something it's usually 2-4x as big as net delay
<azonenberg>
so IMO the placer needs to be aware of wire delay to get optimal results
<azonenberg>
first order approximation can be just Manhattan distance between nodes times a cost factor
<azonenberg>
but down the road i want to consider congestion and such
<azonenberg>
i.e. there are no free paths between these two slices so we have to detour around
<azonenberg>
that adds delay, so move the source of that net closer to us to compensate
<jcreus>
makes sense
<azonenberg>
i suspect there will need to be several iterations of this until we converge
<jcreus>
is the quadratic cost ppl use purely for optimization purposes (since it does make things nice), or is there some validity to it? something like, longer paths go through more interconnects, so the increase is worse than linear?
<azonenberg>
i think quadratic wirelength metrics are intended to disproportionately penalize the longest nets since those are the most likely to make you fail timing
tmeissner has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
<azonenberg>
ideally, i would want delay calculations based in time rather than distanec
<azonenberg>
keeping in mind that, say, an x4 wire vs a x1 may not be 4x the delay once you factor in the switch block
<azonenberg>
it's 4x the RC delay but probably not 4x the buffer/mux delay
<azonenberg>
That is likely to be too expensive to do in the inner loop though
<azonenberg>
so maybe adjust cost tables between inner loop iterations or something with actual timing data
<azonenberg>
You can also do fun stuff like consider that the northmost vs southmost bel in a slice are not quite timing identical
<jcreus>
agh, right
<azonenberg>
I have characterization data for greenpak that shows eastbound and westbound wires are not timing identical either
<azonenberg>
and, in fact, within a given direction some wires are slower than others
<azonenberg>
And i can measure this delay reliably
<azonenberg>
this is ten east and ten west routes on a slg46620, before calibrating for i/o buffer delay (which is constant since i used the same pins and just changed internal routes)
<azonenberg>
measured for five dies
pie__ has joined ##openfpga
<jcreus>
nice
<azonenberg>
you can see the fast and slow process corners pretty clearly, as well as a kind of sawtooth pattern where the delay increases, dips, increases, dips, increases, and dips again
<azonenberg>
then the left half is slower than the right
<azonenberg>
i forget which half is east and which is west
<azonenberg>
but the difference is obvious and significant