<awygle>
forgive my ignorance - how is configuration RAM organized in an SRAM FPGA (e.g. Lattice)?
<awygle>
i had the naive idea that it was a giant shift register for some reason but clifford's documentation clearly says otherwise
<awygle>
is it physically distributed around the chip? or is it a big block of RAM and then wires going all over the chip to carry the signals to muxes and whatnot?
<azonenberg>
awygle: It's physically distributed around the chip
<azonenberg>
A big block of ram would be absurd
<azonenberg>
If the device has on-chip nonvolatile RAM, that is usually in one block
<azonenberg>
then there's SRAM in the logic itself
<azonenberg>
and some glue logic that copies the NVRAM to SRAM at boot
<awygle>
azonenberg: yeah that didn't seem right. but it seems to be addressed in the same way as a block of RAM, if i'm reading this right
<azonenberg>
Correct, except not necessarily byte sized
<awygle>
16-bit rows
<azonenberg>
There are row and column addresses
<azonenberg>
in xilinx parts, actually, the config ram is blockwise addressed
<azonenberg>
in virtex5, for example, a frame is 1312 bits (41 32-bit words) and that's the smallest addressable unit
<azonenberg>
So there may actually be a shift register going on within that area, i'm not sure
<awygle>
interesting. that seems like it would shrink the address logic somewhat
<azonenberg>
But each block has a unique address in (x,y)
<azonenberg>
a typical bitstream just writes to addresses sequentially
<azonenberg>
but if you're doing partial reconfig, you can direct a bitstream to portions of the device as small as a single block
<awygle>
so is the addressing done essentially for flexibility then? it seems like a chip-wide shift register would be smaller, silicon-wise, since there's no need for row and column decoders etc
<azonenberg>
it might be, but that renders you unable to ever configure part of the chip
<azonenberg>
You're looking at bare silicon after etching off all metal interconnect and polysilicon gates
<azonenberg>
and staining P-type doping brown
<azonenberg>
(N vs undoped are indistinguishable in this image)
<azonenberg>
This image is optical with a 100x objective, you can see gates but don't have the resolution to see individual transistors clearly
<azonenberg>
(the chip is 180 nm which is just a little smaller than half the wavelength of the light we're using to image)
<azonenberg>
anyway, this particular device is SRAM based but also has on-chip EEPROM for nonvolatile config
<azonenberg>
you should easily be able to see four distinct large areas of the chip plus some smaller ones
<azonenberg>
First, you have the I/O pad ring around the whole chip
<azonenberg>
Then you have the main logic area which is the top ~2/3
<azonenberg>
Then there's the mostly-dark memory arrays bottom center, and a U-shaped area around them that's generally lighter in color
<azonenberg>
got it?
<awygle>
yep
<azonenberg>
So, the dark arrays are the EEPROM
<azonenberg>
It's split into five blocks
<azonenberg>
All are 49 rows high
<azonenberg>
You have one ten bits wide at the left that has a valid bit and nine bits of macrocell config
<azonenberg>
then one 112 bits wide that has PLA AND/OR array config
<azonenberg>
one 16 bits wide that has global routing config
<azonenberg>
another 112 bit PLA config memory
<azonenberg>
then (way off at the far right side) another ten bit macrocell + valid
<azonenberg>
The U-shaped light colored area is JTAG and EEPROM programming logic that has not been well studied
<azonenberg>
Somewhere either in that area, or running horizontally across the top of the memory, is a 274 bit shift register
<azonenberg>
Which has a couple of padding bits, six address bits, and 260 data bits (one bit per EEPROM column)
<azonenberg>
During JTAG programming of this part, you shift in an address+data block then it programs either EEPROM or the configuration SRAM directly
<azonenberg>
Anyway, if you turn your attention to the actual logic fabric...
<azonenberg>
You can see it's split into five regions horizontally left to right
<azonenberg>
and is symmetric about the X axis
<azonenberg>
(roughly)
<awygle>
sure
<azonenberg>
The X axis layout matches that of the config ram
<azonenberg>
at the left and right side, you have the macrocells
<azonenberg>
Sixteen high on each half of the chip for a total of 32
<azonenberg>
eight above and below the central spine
<azonenberg>
Each macrocell has 27 config bits, organized in a 3 row x 9 bit wide block
<azonenberg>
The config bits are individual SRAM cells that share bit/word lines spanning the entire chip in an X-Y grid, but have poly or metal-1 interconnect going from the SRAM Q/nQ lines to various logic throughout the device
<azonenberg>
anyway, as you can see the macrocell logic is directly above the macrocell EEPROM (though wider, so the SRAM bitlines have to go diagonally a bit during the fanonut from the EEPROM)
<azonenberg>
The bitlines run directly from the EEPROM sense amps / JTAG shift register group at the top of the EEPROM up vertically to the associated CPLD logic fabric, although they're not visible in this image since they got etched off
<azonenberg>
anyway, moving closer to the center of the device
<azonenberg>
the wide blocks just left/right of center are the actual PLA
<azonenberg>
Each of the PLA blocks is divided into 3 vertically
<azonenberg>
At top and bottom you have AND array, at middle you have OR array
<azonenberg>
(this chip is based on sum-of-products expressions rather than LUTs)
<azonenberg>
Each of the AND array blocks is 56 product terms wide and 20 inputs high
<awygle>
right
<azonenberg>
Since each AND can be either X or nX as input
<azonenberg>
there are 112 SRAM bits in each row
<azonenberg>
with x-enable and nX-enable, one-hot
<azonenberg>
Then the output of the blocks goes into the OR array, which is logically 56 product terms wide x 16 OR gates high
<azonenberg>
although physically, two OR gates share one row to keep the config memory 112 bits wide
<azonenberg>
So instead of having x-enable and nX-enable for each pterm
<azonenberg>
you have or1-enable and or2-enable for each pterm
<azonenberg>
in each row
<azonenberg>
Make sense?
<awygle>
almost entirely, the one thing where i got lost for a bit was "The config bits are individual SRAM cells that share bit/word lines spanning the entire chip in an X-Y grid, but have poly or metal-1 interconnect going from the SRAM Q/nQ lines to various logic throughout the device", specifically the bit/word line part towards the beginning
<azonenberg>
I'm getting there :)
<awygle>
haha okay :)
<azonenberg>
let me describe the floorplan so you know what happens when you zoom in
<awygle>
sure
<azonenberg>
Anyway, the last bit is the central spine
<azonenberg>
The very center of the chip is global muxes and things for the clock tree etc
<azonenberg>
Above and below that is global routing
<azonenberg>
20 rows above and below, each feeding one PLA input left and one right
<azonenberg>
The circuits are paired in a mirror image relative to each other
<azonenberg>
so you'll see ten identical blocks of stuff
<azonenberg>
each feeding two bits left and two right
<azonenberg>
but those ten blocks are just two identical blocks mirrored back to back
<azonenberg>
Although it looks symmetric left-right, it is not quite
<azonenberg>
Fundamentally, that whole structure is a bunch of 8:1 one-hot muxes
<azonenberg>
You have a bit to select Vdd, Vss, or one of six data inputs
<azonenberg>
then drive it out into a high-fanout driver
<azonenberg>
See how each row of the routing looks like a bunch of small gates then one giant block left and right?
<awygle>
yes, guessing those are the drivers and are built with bigger transistors for higher fanout
<azonenberg>
Yes
<azonenberg>
That giant block is a 3-stage inverter cascade
<azonenberg>
If you zoom in closer you'll see the 3 stages going from small to large as you move out from the center cascade
<azonenberg>
First two stages are vertical then the third is horizontal
<azonenberg>
they're multi-fingered transistors so you'll see multiple channels in the image
<azonenberg>
on the metal layer the channels are parallelled
<awygle>
makes sense
<azonenberg>
You can also see the symmetry about the X axis when you zoom in a bit
<azonenberg>
the driver has a separator down the centerline, everything above vs below is the same but mirrored
<azonenberg>
then there's six groups of 11 (or 10 in the rightmost case) wires
<azonenberg>
These are all of the possible inputs to the CPLD logic (32 flipflops and 33 input pins... simplifying a bit, there's some muxing in the io cells)
<azonenberg>
Each of the 40 rows of the global interconnect can route Vdd, Vss, or one signal from each group to its output
<azonenberg>
there's a via from M3 to M4 that connects MUXIN_TOP_xx to one of the 11 signals
<azonenberg>
in a different spot for each row
<azonenberg>
You don't need 100% connectivity since all inputs of an AND gate are logically indistinguishable from each other
<awygle>
thus explaining "via mux for input #"
<azonenberg>
Yep
<azonenberg>
so you don't have to be able to route FB1_1_IBUF to all 40 rows
<azonenberg>
as long as you have it routed to enough rows that any possible 40 of the 65 can be routed
<azonenberg>
in practice i think a tiny fraction of 40-to-65 combinations are not possible
<azonenberg>
i read a paper on how they designed the coolrunner XPLA3 routing and i assume CR-2 is basically the same process
digshadow has quit [Ping timeout: 240 seconds]
<azonenberg>
They added enough routing to fully route all 36-input functions
<azonenberg>
as well as the vast majority of 37-40
<azonenberg>
but accepted a tiny fraction of very complex designs not fitting in exchange for not making the matrix larger
<azonenberg>
The via mux settings start out logically 1,2,3,4,5 at the top left of the array then progressively get more scrambled as you go right and down
<azonenberg>
presumably they used some kind of iterative algorithm to perturb the mux settings until it was good enough
<azonenberg>
there's no rhyme or reason to the mux settings and extracting that pattern is one of the requirements to RE any particular coolrunner bitstream
<azonenberg>
So far i've only dumped the rom for the 32a
<azonenberg>
but i could do it for others as needed
<azonenberg>
The 64a appears to be basically the same routing structure, the 128 and larger may be a multi-level tree of some sort vs a one-level tree based on what we've seen in the bitstream
<azonenberg>
but we don't have silicon photos to figure out the details yet
<azonenberg>
Anyway, in this particular chip the word lines run across the entire die
<azonenberg>
Which means the smallest reconfigurable unit is one row across the entire CPLD
<azonenberg>
not a very practical way to do partial reconfiguration, but experimentally it does actually seem possible to reconfigure one row at a time over jtag
<azonenberg>
it sometimes doesnt work, i think i might have metastability or reset issues or something if i don't do the full programming algorithm
<azonenberg>
in a more complex device like an FPGA with real partial reconfig support, the word lines would be segmented
<azonenberg>
so you could write a couple of rows for one block of the chip and reconfigure a contiguous 2D region of the device
<azonenberg>
And your physical bitstream addresses would then be a series of words each with a (row, column) address
<awygle>
okay, so in this chip the 6-bit address from the JTAG shift register gets decoded into a word line, and the 260 data bits from the shift register drive the bit lines for programming
<azonenberg>
Correct
<azonenberg>
I havent actually looked for the WL decode logic, it might be along the left/right between the macrocells and the IO pads or it might be at the bottom of the chip in the JTAG block
<azonenberg>
Wasn't important for what i was doing
<awygle>
but in a smaller-chunk-size FPGA your address might decode into row and column bits, and you'd have sizeof(chunk) data bits programmed in parallel
<azonenberg>
Yeah
<azonenberg>
Most likely an actual config block would be 2D
<azonenberg>
So you'd have say a 128-bit-long wordline
<azonenberg>
and you'd write to 64 contiguous addresses with the same col and incrementing row
<azonenberg>
to write to a 64x128 bit block that configured some 2D region of the device
<azonenberg>
Since it makes no sense to reconfigure e.g. half of a LUT
<awygle>
sure
<azonenberg>
the reconfigurable blocks are generally a bunch of logic or io resources and the associated switch boxes
<awygle>
only semi-related, do the JTAG drivers have to be stronger than the SRAM feedback inverters to drive properly?
<awygle>
to write the SRAM properly, rather
<azonenberg>
This is true for SRAM in general, whether in an FPGA or otherwise
<azonenberg>
Typically the feedback inverters are just strong enough to hold the bit reliably
<azonenberg>
and the bitline drivers drive a lot harder
<awygle>
right, just wanted to double check that i understood that right
<awygle>
thanks for the 101 class, i really appreciate it!
Zarutian has quit [Quit: Zarutian]
<awygle>
easily 4 credits
<azonenberg>
lol
DocScrutinizer05 has quit [Ping timeout: 260 seconds]
DocScrutinizer05 has joined ##openfpga
theMagnumOrange has joined ##openfpga
<cyrozap>
azonenberg: Ah, that's kind of a bummer about the Spartan-6, since there's so many cheap dev boards/products out there that use them, and so few cheap (i.e. sub-$50 range) 7-series dev boards.
<azonenberg>
That will change when spartan7 comes out, i think
<cyrozap>
azonenberg: And it's a bummer for me personally because I have a bunch of LX100 and LX150-based devices (because... uh... "reasons") that I'd love to have a FOSS toolchain for. I guess I know what I'll be working on after the PSoC stuff :P
<azonenberg>
Lol
<azonenberg>
Well we can potentially work on s6 at some point i just dont see it as a priority
<azonenberg>
vs s3 (simple) and 7 series (modern)
<cyrozap>
I totally understand
<cyrozap>
The 7-series stuff is definitely much more interesting
<rqou>
i'm still in favor of jumping straight to 7-series without doing s3
<azonenberg>
I feel like s3 is a close enough ancestor of 7 that i'd learn a lot from studying the simpler interconnect
<azonenberg>
the chip is larger process and easier to deprocess/image
<azonenberg>
cheaper to get samples for destructive imaging
<rqou>
i also really want to see (as a test) vpr support for ice40
<azonenberg>
That would be cool too
<azonenberg>
Honestly i don't think vpr is the best way to go if scaling to large 7-series parts is the goal
<azonenberg>
afaik vpr is basically smart annealing
<azonenberg>
i'd rather go with a global mass-spring type routing algorithm from the get-go
<azonenberg>
and design for multithreading and maybe even multi-server builds
<azonenberg>
even if we normally run locally with only 1-4 threads
<azonenberg>
that scalability essentially doesn't exist in any toolchain that targets a real architecture
<rqou>
hmm i recall seeing a paper (that i couldn't be arsed to download) about a hybrid annealing+quadratic-wirelength approach for FPGAs
<rqou>
that might be interesting to look at at some point
<azonenberg>
Well what i'm saying is
<azonenberg>
annealing doesn't parallelize well
<azonenberg>
And i think once we have the bitstream figured out
<rqou>
right, but quadratic-wirelength does
<azonenberg>
oh? i'm not familiar with that alg
<azonenberg>
Because it would be potentially worthwhile to try developing routing algorithms that scale all the way from tens to thousands of cores
<azonenberg>
not 4 like ise
<rqou>
that's basically the "global mass/spring intuition" algorithm
<azonenberg>
oh, ok
<azonenberg>
i'm imagining a pile of ec2 spot instances or just a rack of servers in a lab somewhere
<azonenberg>
fully routing an xcku035 in <5 minutes
<rqou>
we should focus on getting xc2par to return more right answers first :P
<azonenberg>
lol yes
<rqou>
alright, time for me to get up and have brunch
amclain has quit [Quit: Leaving]
<cyrozap>
Regarding P&R, would it be possible to design a library so that it could be used for many different architectures? I'm just thinking it would be really time-consuming to have to write a new tool for every chip (arachne-pnr for ICE40, xc2par, gpkpar, etc.) and as a FOSS project, one of our strengths over the proprietary tools is that we can share code (and by extension algo/implementation improvements).
<azonenberg>
That is what VPR does
<azonenberg>
but AIUI it's based on annealing and i want to try a different algorithm
* cyrozap
isn't familiar with the acronym "VPR"
<azonenberg>
"virtual place and route" iirc
<azonenberg>
it's a primarily research based par tool that is meant to work on toy architectures for researching par algorithms
<azonenberg>
afaik
<azonenberg>
it hasnt been used much with real chips that i know of
<azonenberg>
Also, xc2par and gp4par are already sharing code
<azonenberg>
xbpar is the "crossbar and logic stuff" par engine
<azonenberg>
gp4par is the greenpak front end to it
<pointfree>
I'm trying to figure out the PSoC 5LP status and control blocks. I can't imagine the HC switches below status and control are any different and the HC still needs to be configured to route to the status or control blocks. I guess the control block is shorting away routes inside the UDB and that's how it does its thing.
<pointfree>
(I thought that's what vpr is)
<cyrozap>
azonenberg: I was thinking more along the lines of a library/tool that contains a bunch of different algos, and you just feed it tech cells and routing models (and maybe specify which algo to use)
m_w has quit [Quit: leaving]
<azonenberg>
So more modular like yosys? or what
<azonenberg>
I have to spend a while reading papers on lut-based par engines and playing with raw bitfiles before i even attempt to do something with FPGAs
<azonenberg>
i know crossbar architectures very well
<azonenberg>
The verilog-to-routing project uses ODIN for synthesis, but afaik yosys is a lot better
<azonenberg>
then they use ABC for techmapping, which yosys does internally
<azonenberg>
then VPR for P&R
<cyrozap>
azonenberg: Yeah, something like yosys, I think. Really I'm just lazy and terrible at software and would much rather build on the (proven) work of others than do everything myself :P
<azonenberg>
Lol
<azonenberg>
well i want to share code too
<azonenberg>
i just need to spend time fooling with vpr and see if it does what we want
pie_ has quit [Ping timeout: 260 seconds]
wolfspra1l has quit [Ping timeout: 260 seconds]
eduardo__ has quit [Ping timeout: 276 seconds]
eduardo__ has joined ##openfpga
X-Scale has quit [Read error: Connection reset by peer]
Hootch has joined ##openfpga
<azonenberg>
Sooo let's see, i still have to figure out what is wrong with the ZIA in my coolrunner emulator
<azonenberg>
Components for the greenpak thermal characterization board arrived
<azonenberg>
Stencil for the level shifter arrived