<_florent_>
setting the we_granularity from 8 to 1 for my wishbone SRAM will make implement it in block ram, but I don't know if it can cause issues in the case of the tx_buffer?
<_florent_>
or maybe someone has another idea?
<kristianpaul>
hmm
<kristianpaul>
are you sure about this?
<_florent_>
yes when I removed the bytes enable, it's implemented correctly
<_florent_>
in danstrother's review
<_florent_>
"Both Xilinx and Altera support inferring byte-enables, to a certain extent. Xilinx only supports byte-enables on single-port memories. Altera supports byte-enables on both simple and true dual-port memories. Again: Xilinx supports VHDL and Verilog, Altera supports VHDL and SystemVerilog, and both use mutually incompatible constructs."
<sb0>
_florent_, as a workaround there's a simplify transform in migen that will break down your RAM into four 8-bit memories
<sb0>
this is not efficient, but it will work and is portable
<_florent_>
ah ok, wasn't aware of this, I'll have a look
<sb0>
I think the proper way to fix it is to hook the RAM generator in Mibuild to make the output palatable to the altera shitware
<sb0>
with the byte-enable signals
<_florent_>
I'm not sure it's not working on altera, for now I'm working with testing it on Mixxeo
<_florent_>
but Lut usage for MiniSoC jumps from less than 3000 Luts to 4500 because of this...
<larsc>
It probably makes sense for migen to have a abstract block ram memory module which can generate specialized primitives depending on the target platform
<_florent_>
yes probably because infering rules seems to be different for each vendor...
<sb0>
larsc, there's already such a memory module, and it supports different generators
<sb0>
only there's only one generator at the moment, that outputs code that works in most cases
<sb0>
though if the xilinx shitware doesn't support inference of multiport + byte enable, I think a direct code generator isn't the best way to solve it
<sb0>
better have a transform that turns the memory module into RAMB*BWER instances
<sb0>
this way you can use migen to create and connect those instances :)
<sb0>
outputting the systemverilog construct for the altera shitware could be done by hooking the code generator though
<_florent_>
I'm now using a dataflow architecture and think it's easier to handle all regitsters in sys clock domain
mumptai has joined #m-labs
<sb0>
mh?
<sb0>
afaict the main reason to use AsyncFIFO over synchronizers would be to queue several words without waiting for a handshake - for performance reasons
<_florent_>
why? if you have a dataflow architecture, you can change clock domain using AsyncFIFO and you don't have to manage multiple clock domain crossing
<_florent_>
I think it's easier to do like that
<_florent_>
and it will be easier to use another phy with this new architecture
<_florent_>
but if you are really not convinced of that I can still change it
<_florent_>
the only disavantage here is that more logic runs at sys_clk instead of eth_rx/eth_tx clks
<_florent_>
but this is an a disavantage here with a 100Mbps PHY that will disappear with a 1Gbps PHY
<sb0>
it's rather weird, that you're using a FIFO (that already contains potential BRAM storage) just to fill a dual-port, multi-clock capable BRAM in the other clock domain
sh[4]rm4 has quit [Remote host closed the connection]
sh[4]rm4 has joined #m-labs
sh[4]rm4 has quit [Remote host closed the connection]
<sb0>
and hmm, seems the 1Gbps PHY would be much better off writing a BRAM at 125MHz than pushing data at 125MHz into a 4-word-deep FIFO later read at about 83MHz :)
<sb0>
you can of course increase the FIFO word width to compensate, but it gets messy, especially with the optimum word width depending on the system/phy clock freq ratio
<sb0>
that's how it's done in dvisampler/framebuffer, and I don't like it
<sb0>
but there aren't many options when DMA-streaming a large buffer
<sb0>
BRAM is different
sh4rm4 has joined #m-labs
<_florent_>
for the 1Gbps PHY, I was more thinking of pushing 32 bits data @ 31.25Mhz, and I'm not planning to use it on slowtan6 :)
<sb0>
why 32? and not 16 or 64? :)
<_florent_>
but I have still some work to do and will think about our discussion, and tried to convince myself that yout solution is better :)
<_florent_>
because it's the size of the wishbone bus
<_florent_>
tried / try
<_florent_>
and also the width of the ram
<sb0>
well there are two problems
<sb0>
1) absorbing/delivering the data continuously at the PHY rate
<sb0>
2) packing it into WB words
<sb0>
with what I'm proposing, both problems are decoupled - so it's more flexible
<sb0>
also, you're using a bit less resources, as you don't implement FIFOs that are not really necessary
<_florent_>
but you use synchronizers for each control/status signal so on the ressource usage it's probably almost the same
<sb0>
a pulse synchronizer on a control signal is 3 flip-flops
Alain_ has quit [Remote host closed the connection]
<sb0>
and 2 LUTs
<sb0>
those FIFOs contains storage, counters, synchronizers on the counters (2 flip-flops per counter bit), and a bunch of LUTs for the control logic
<sb0>
*contain
<_florent_>
I agree but the idea it to have the smallest possible FIFO
<_florent_>
It's actually set to 4 due to a limitation on the AsyncFIFO, but it can be set to 2
<sb0>
which is zero :-P
<_florent_>
haha :)
<sb0>
that being said, those FIFOs would be nice if there's DMA
<sb0>
in fact, I had those, with DMA, in the very first versions of minimac
<_florent_>
ah ok, wasn't aware of that
<sb0>
but it was messy to prevent them from over/underflowing with the limited wishbone bandwidth
<sb0>
packets were getting corrupted when the WB bus was loaded
<sb0>
and I had to use pretty large FIFOs... and triggering TX after the FIFO had been filled to a good level
<sb0>
so I stopped messing around and used BRAMs with predictable timing
<_florent_>
we can maybe implement LASMI ports that have high priority, this will enable the use of DMA for ethernet
<sb0>
yeah, LASMI should have plenty of bandwidth
<_florent_>
and having high priority ports can probably be useful for others applications
<sb0>
I only tried Wishbone DMA before giving up and using BRAM
<sb0>
I'm not sure if we need high-prio ports at all
<sb0>
there's a good margin
<sb0>
especially for 100Mbps
<sb0>
HDMI works at like 30x that bandwidth, no problem, without a priority mechanism
<_florent_>
ok, if you want I can try with ethernet with DMA since I'm working on it
<sb0>
the problem, however, is that you need to invalidate the L2 (bridge) cache before transferring packets with the CPU
<sb0>
might make things slower...
<_florent_>
I'm aslo thinking of inseting /checking ethernet CRC in hardware, since it won't use too much ressource and will improve performance
<sb0>
it'd possible to have cache-coherent DMA though
bentley` has quit [Remote host closed the connection]
<sb0>
there's a second port on the bridge BRAM we're not needing atm
<sb0>
yes, hw crc is probably good
<sb0>
one thing that annoys me with that cache hierarchy is it's mostly needed for WB compatibility
<sb0>
you could access the CPU caches directly from LASMI, and I'm sure the system would still meet timing in most cases
<sb0>
with less resources, and higher performance
<_florent_>
one thing to do if someone rewrite LM32 with Migen :)
<sb0>
other bus accesses from the CPU are MMIO, which is easy, and XIP which pretty much needs the cache and is more difficult - unless we can tolerate a sluggish BIOS execution
<sb0>
now that I think of it we probably can... the BIOS can load a second-stage bootloader into the SDRAM for things like netboot that need a bit of performance
<_florent_>
BTW, what do you think of my change on the interface of LASMI? splitting write/read data acks and managing latencies inside the crossbar
<_florent_>
I think it easier to use the LASMI ports like that
<_florent_>
the ack on the read will signifies that data is valid
<_florent_>
the ack on the write will signifies that we have to present the data
<sb0>
early acks allow earlier unlocking of arbitration and higher performance on loaded buses
<sb0>
did you test that your option even still allows HD mixing on the mixxeo?
<_florent_>
yes it was ok and has exactly the same performance of the current implementation
<_florent_>
it's just that the shift registrer of the ack signals are in the crossbar instead of being in the DMA
<sb0>
2x 720p60 in 1x 720p60 out? with mixing from both input framebuffers?
<_florent_>
and the wishbone bridge is then simplified
<sb0>
you'd still have early acks from the controller?
<_florent_>
I was using 2x 720p60 in 1x 720p60 out
<_florent_>
you still have req ack, write data ack, read data ack
<_florent_>
in fact it's very close to a AXI interface: 1 cmd interface, 1 write interface, 1 read interface
<sb0>
what I'm asking is - does the controller still unlock arbitration a few cycles before the actual data transfer takes place?
<_florent_>
and if we want to have reordering later, we can add a signal that will be asserted with the ack to indicate which data the controller wants to write (or read)
<_florent_>
yes, the behaviour is exactly the same
<sb0>
I've experimented with full reordering... tends to cause big timing closure issues for small performance improvements
<sb0>
e.g. page hit optimization
<_florent_>
you've tried on random accesses?
<sb0>
no, with the current soc access patterns, which aren't random
<sb0>
yes, it's probably more worth doing it if the pattern is actually random
<sb0>
but timing *is* a big mess
<_florent_>
you just have to pipeline thing
<sb0>
that increases latency :)
<sb0>
also, how would you pipeline page hit reordering?
<_florent_>
of course, but yes, you have to study if the gain will compensate the increase in latency
<sb0>
request issue is another problem
<_florent_>
sorry, I don't have in mind how it's done in LASMICON, but when I was working on a controller for a customer, we didn't had so much timing issues
<sb0>
you need a priority encoder to find an empty location in the controller's request slots
<sb0>
and you pretty much want 1-cycle issue, since the slowness of the FPGA vs. speed of DRAM means that after serializations bursts take 1 FPGA cycle
<_florent_>
yes especially when you work at half rates or quarter rates in the FPGA
<sb0>
well... maybe in a system where you issue requests per packs of, say, 16, then it makes sense to have reordering
<_florent_>
for my modification, in fact it's simply that