lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs
Jespers has joined #m-labs
rjo_ has quit [Ping timeout: 240 seconds]
Alain_ has joined #m-labs
[florian] has quit [Ping timeout: 252 seconds]
[florian] has joined #m-labs
sb0 has joined #m-labs
sb0 has quit [Ping timeout: 240 seconds]
sb0 has joined #m-labs
_florent_ has joined #m-labs
nicksydney has quit [Remote host closed the connection]
nicksydney has joined #m-labs
mumptai has joined #m-labs
sh4rm4 has quit [Remote host closed the connection]
sh4rm4 has joined #m-labs
sh4rm4 has quit [Remote host closed the connection]
mumptai has quit [Ping timeout: 264 seconds]
sh4rm4 has joined #m-labs
sh[4]rm4 has joined #m-labs
sh4rm4 has quit [Ping timeout: 252 seconds]
xiangfu has quit [Ping timeout: 252 seconds]
_florent_ has quit [Ping timeout: 240 seconds]
_florent_ has joined #m-labs
Jespers has quit [Quit: Jespers]
rjo_ has joined #m-labs
<_florent_> hi
<_florent_> I'm facing a small optimization issue when rewriting Minimac with Migen:
<_florent_> to replace RAMB16BWER in minimac3_memory for the tx part, I need a dual port ram with
<_florent_> - 1 port used by wishbone *with byte enable*
<_florent_> - 1 port used by my logic to read the frame
<_florent_> the thing is that it does not seem to be possible to infer a dual port ram in block ram with ISE with bytes enable...
<_florent_> so my ram is implemented in lut...
<_florent_> setting the we_granularity from 8 to 1 for my wishbone SRAM will make implement it in block ram, but I don't know if it can cause issues in the case of the tx_buffer?
<_florent_> or maybe someone has another idea?
<kristianpaul> hmm
<kristianpaul> are you sure about this?
<_florent_> yes when I removed the bytes enable, it's implemented correctly
<_florent_> in danstrother's review
<_florent_> "Both Xilinx and Altera support inferring byte-enables, to a certain extent. Xilinx only supports byte-enables on single-port memories. Altera supports byte-enables on both simple and true dual-port memories. Again: Xilinx supports VHDL and Verilog, Altera supports VHDL and SystemVerilog, and both use mutually incompatible constructs."
<sb0> _florent_, as a workaround there's a simplify transform in migen that will break down your RAM into four 8-bit memories
<sb0> this is not efficient, but it will work and is portable
<_florent_> ah ok, wasn't aware of this, I'll have a look
<sb0> I think the proper way to fix it is to hook the RAM generator in Mibuild to make the output palatable to the altera shitware
<sb0> with the byte-enable signals
<_florent_> I'm not sure it's not working on altera, for now I'm working with testing it on Mixxeo
<_florent_> but Lut usage for MiniSoC jumps from less than 3000 Luts to 4500 because of this...
<larsc> It probably makes sense for migen to have a abstract block ram memory module which can generate specialized primitives depending on the target platform
<_florent_> yes probably because infering rules seems to be different for each vendor...
<sb0> larsc, there's already such a memory module, and it supports different generators
<sb0> only there's only one generator at the moment, that outputs code that works in most cases
<sb0> though if the xilinx shitware doesn't support inference of multiport + byte enable, I think a direct code generator isn't the best way to solve it
<sb0> better have a transform that turns the memory module into RAMB*BWER instances
<sb0> this way you can use migen to create and connect those instances :)
<sb0> outputting the systemverilog construct for the altera shitware could be done by hooking the code generator though
<_florent_> in wishbone.py, I'm doing this
<_florent_> self.specials += FullMemoryWE(self.mem)
<_florent_> but it does not seem to work
<_florent_> maybe I'm not using correctly the FullMemroryWE decorator?
<sb0> well, first you're using it in wishbone.py and it does not belong here
<sb0> where it should be put is in mibuild, and apply automatically to any dual-port+byte-enable memory targeting xilinx
<_florent_> I agree, it's just a dirty test for now...
<_florent_> do you have an example where you use FullMemoryWE?
<sb0> I've used it with MiSoC make.py (-d FullMemoryWE) to apply it globally
<_florent_> ok thanks, I'll look at it
<sb0> and it's a module decorator
<sb0> you can't apply them to specials
<sb0> for your dirty test you can just do
<sb0> @FullMemoryWe
<sb0> class SRAM(Module):
<sb0> ...
<_florent_> thanks
<_florent_> it's working :) (ressource usage and test on board)
<_florent_> old version:
<_florent_> Luts : 3,177 / Registers: 2,616
<_florent_> new version:
<_florent_> Luts : 3,296 / Registers: 2,591
<_florent_> but the archirecture is not the same, and CDC is done using AsyncFIFOs:
<sb0> hmm, why the FIFOs?
<sb0> SRAM_DEPTH = << lower case please
<_florent_> I'm now using a dataflow architecture and think it's easier to handle all regitsters in sys clock domain
mumptai has joined #m-labs
<sb0> mh?
<sb0> afaict the main reason to use AsyncFIFO over synchronizers would be to queue several words without waiting for a handshake - for performance reasons
<_florent_> why? if you have a dataflow architecture, you can change clock domain using AsyncFIFO and you don't have to manage multiple clock domain crossing
<_florent_> I think it's easier to do like that
<_florent_> and it will be easier to use another phy with this new architecture
<_florent_> but if you are really not convinced of that I can still change it
<_florent_> the only disavantage here is that more logic runs at sys_clk instead of eth_rx/eth_tx clks
<_florent_> but this is an a disavantage here with a 100Mbps PHY that will disappear with a 1Gbps PHY
<sb0> it's rather weird, that you're using a FIFO (that already contains potential BRAM storage) just to fill a dual-port, multi-clock capable BRAM in the other clock domain
sh[4]rm4 has quit [Remote host closed the connection]
sh[4]rm4 has joined #m-labs
sh[4]rm4 has quit [Remote host closed the connection]
<sb0> and hmm, seems the 1Gbps PHY would be much better off writing a BRAM at 125MHz than pushing data at 125MHz into a 4-word-deep FIFO later read at about 83MHz :)
<sb0> you can of course increase the FIFO word width to compensate, but it gets messy, especially with the optimum word width depending on the system/phy clock freq ratio
<sb0> that's how it's done in dvisampler/framebuffer, and I don't like it
<sb0> but there aren't many options when DMA-streaming a large buffer
<sb0> BRAM is different
sh4rm4 has joined #m-labs
<_florent_> for the 1Gbps PHY, I was more thinking of pushing 32 bits data @ 31.25Mhz, and I'm not planning to use it on slowtan6 :)
<sb0> why 32? and not 16 or 64? :)
<_florent_> but I have still some work to do and will think about our discussion, and tried to convince myself that yout solution is better :)
<_florent_> because it's the size of the wishbone bus
<_florent_> tried / try
<_florent_> and also the width of the ram
<sb0> well there are two problems
<sb0> 1) absorbing/delivering the data continuously at the PHY rate
<sb0> 2) packing it into WB words
<sb0> with what I'm proposing, both problems are decoupled - so it's more flexible
<sb0> also, you're using a bit less resources, as you don't implement FIFOs that are not really necessary
<_florent_> but you use synchronizers for each control/status signal so on the ressource usage it's probably almost the same
<sb0> a pulse synchronizer on a control signal is 3 flip-flops
Alain_ has quit [Remote host closed the connection]
<sb0> and 2 LUTs
<sb0> those FIFOs contains storage, counters, synchronizers on the counters (2 flip-flops per counter bit), and a bunch of LUTs for the control logic
<sb0> *contain
<_florent_> I agree but the idea it to have the smallest possible FIFO
<_florent_> It's actually set to 4 due to a limitation on the AsyncFIFO, but it can be set to 2
<sb0> which is zero :-P
<_florent_> haha :)
<sb0> that being said, those FIFOs would be nice if there's DMA
<sb0> in fact, I had those, with DMA, in the very first versions of minimac
<_florent_> ah ok, wasn't aware of that
<sb0> but it was messy to prevent them from over/underflowing with the limited wishbone bandwidth
<sb0> packets were getting corrupted when the WB bus was loaded
<sb0> and I had to use pretty large FIFOs... and triggering TX after the FIFO had been filled to a good level
<sb0> so I stopped messing around and used BRAMs with predictable timing
<_florent_> we can maybe implement LASMI ports that have high priority, this will enable the use of DMA for ethernet
<sb0> yeah, LASMI should have plenty of bandwidth
<_florent_> and having high priority ports can probably be useful for others applications
<sb0> I only tried Wishbone DMA before giving up and using BRAM
<sb0> I'm not sure if we need high-prio ports at all
<sb0> there's a good margin
<sb0> especially for 100Mbps
<sb0> HDMI works at like 30x that bandwidth, no problem, without a priority mechanism
<_florent_> ok, if you want I can try with ethernet with DMA since I'm working on it
<sb0> the problem, however, is that you need to invalidate the L2 (bridge) cache before transferring packets with the CPU
<sb0> might make things slower...
<_florent_> I'm aslo thinking of inseting /checking ethernet CRC in hardware, since it won't use too much ressource and will improve performance
<sb0> it'd possible to have cache-coherent DMA though
bentley` has quit [Remote host closed the connection]
<sb0> there's a second port on the bridge BRAM we're not needing atm
<sb0> yes, hw crc is probably good
<sb0> one thing that annoys me with that cache hierarchy is it's mostly needed for WB compatibility
<sb0> you could access the CPU caches directly from LASMI, and I'm sure the system would still meet timing in most cases
<sb0> with less resources, and higher performance
<_florent_> one thing to do if someone rewrite LM32 with Migen :)
<sb0> other bus accesses from the CPU are MMIO, which is easy, and XIP which pretty much needs the cache and is more difficult - unless we can tolerate a sluggish BIOS execution
<sb0> now that I think of it we probably can... the BIOS can load a second-stage bootloader into the SDRAM for things like netboot that need a bit of performance
<_florent_> BTW, what do you think of my change on the interface of LASMI? splitting write/read data acks and managing latencies inside the crossbar
<_florent_> I think it easier to use the LASMI ports like that
<_florent_> the ack on the read will signifies that data is valid
<_florent_> the ack on the write will signifies that we have to present the data
<sb0> early acks allow earlier unlocking of arbitration and higher performance on loaded buses
<sb0> did you test that your option even still allows HD mixing on the mixxeo?
<_florent_> yes it was ok and has exactly the same performance of the current implementation
<_florent_> it's just that the shift registrer of the ack signals are in the crossbar instead of being in the DMA
<sb0> 2x 720p60 in 1x 720p60 out? with mixing from both input framebuffers?
<_florent_> and the wishbone bridge is then simplified
<sb0> you'd still have early acks from the controller?
<_florent_> I was using 2x 720p60 in 1x 720p60 out
<_florent_> you still have req ack, write data ack, read data ack
<_florent_> in fact it's very close to a AXI interface: 1 cmd interface, 1 write interface, 1 read interface
<sb0> what I'm asking is - does the controller still unlock arbitration a few cycles before the actual data transfer takes place?
<_florent_> and if we want to have reordering later, we can add a signal that will be asserted with the ack to indicate which data the controller wants to write (or read)
<_florent_> yes, the behaviour is exactly the same
<sb0> I've experimented with full reordering... tends to cause big timing closure issues for small performance improvements
<sb0> e.g. page hit optimization
<_florent_> you've tried on random accesses?
<sb0> no, with the current soc access patterns, which aren't random
<sb0> yes, it's probably more worth doing it if the pattern is actually random
<sb0> but timing *is* a big mess
<_florent_> you just have to pipeline thing
<sb0> that increases latency :)
<sb0> also, how would you pipeline page hit reordering?
<_florent_> of course, but yes, you have to study if the gain will compensate the increase in latency
<sb0> request issue is another problem
<_florent_> sorry, I don't have in mind how it's done in LASMICON, but when I was working on a controller for a customer, we didn't had so much timing issues
<sb0> you need a priority encoder to find an empty location in the controller's request slots
<sb0> and you pretty much want 1-cycle issue, since the slowness of the FPGA vs. speed of DRAM means that after serializations bursts take 1 FPGA cycle
<_florent_> yes especially when you work at half rates or quarter rates in the FPGA
<sb0> well... maybe in a system where you issue requests per packs of, say, 16, then it makes sense to have reordering
<_florent_> for my modification, in fact it's simply that
<_florent_> this part if code is moved in the controller
<sb0> also you can do the page-hit reordering within those blocks
<_florent_> and then here
<sb0> which can be reasonably pipelined
<_florent_> you simply use the ack from LASMI
<_florent_> same thing for the write
<sb0> what reordering controller did you use btw?
<sb0> hmm, yeah, I guess we could have fixed LASMI latency after the arbiter
<_florent_> ok if it's ok for you I will adapt the simulation too and send you a patch
Jespers has joined #m-labs
Jespers has quit [Client Quit]
_florent_ has quit [Quit: Page closed]
sb0 has quit [Quit: Leaving]
aeris has quit [Ping timeout: 240 seconds]
mumptai has quit [Ping timeout: 240 seconds]
aeris has joined #m-labs
stekern has quit [Ping timeout: 264 seconds]
stekern has joined #m-labs