#m-labs on 2014-05-26 — irc logs at freenode.irclog.whitequark.org

2013-12-11 12:34 lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs

01:41 Jespers has joined #m-labs

05:52 rjo_ has quit [Ping timeout: 240 seconds]

06:44 Alain_ has joined #m-labs

06:59 [florian] has quit [Ping timeout: 252 seconds]

06:59 [florian] has joined #m-labs

07:06 sb0 has joined #m-labs

07:21 sb0 has quit [Ping timeout: 240 seconds]

07:34 sb0 has joined #m-labs

07:38 _florent_ has joined #m-labs

07:54 nicksydney has quit [Remote host closed the connection]

07:54 nicksydney has joined #m-labs

08:31 mumptai has joined #m-labs

08:58 sh4rm4 has quit [Remote host closed the connection]

09:02 sh4rm4 has joined #m-labs

09:05 sh4rm4 has quit [Remote host closed the connection]

09:06 mumptai has quit [Ping timeout: 264 seconds]

09:07 sh4rm4 has joined #m-labs

10:10 sh[4]rm4 has joined #m-labs

10:12 sh4rm4 has quit [Ping timeout: 252 seconds]

10:58 xiangfu has quit [Ping timeout: 252 seconds]

11:10 _florent_ has quit [Ping timeout: 240 seconds]

13:18 _florent_ has joined #m-labs

15:23 Jespers has quit [Quit: Jespers]

15:30 rjo_ has joined #m-labs

16:05 <_florent_> hi

16:06 <_florent_> I'm facing a small optimization issue when rewriting Minimac with Migen:

16:06 <_florent_> to replace RAMB16BWER in minimac3_memory for the tx part, I need a dual port ram with

16:06 <_florent_> - 1 port used by wishbone *with byte enable*

16:06 <_florent_> - 1 port used by my logic to read the frame

16:07 <_florent_> the thing is that it does not seem to be possible to infer a dual port ram in block ram with ISE with bytes enable...

16:07 <_florent_> http://danstrother.com/2010/09/11/inferring-rams-in-fpgas/

16:07 <_florent_> so my ram is implemented in lut...

16:07 <_florent_> setting the we_granularity from 8 to 1 for my wishbone SRAM will make implement it in block ram, but I don't know if it can cause issues in the case of the tx_buffer?

16:08 <_florent_> or maybe someone has another idea?

16:25 <kristianpaul> hmm

16:26 <kristianpaul> are you sure about this?

16:29 <_florent_> yes when I removed the bytes enable, it's implemented correctly

16:30 <_florent_> in danstrother's review

16:30 <_florent_> "Both Xilinx and Altera support inferring byte-enables, to a certain extent. Xilinx only supports byte-enables on single-port memories. Altera supports byte-enables on both simple and true dual-port memories. Again: Xilinx supports VHDL and Verilog, Altera supports VHDL and SystemVerilog, and both use mutually incompatible constructs."

16:32 <sb0> _florent_, as a workaround there's a simplify transform in migen that will break down your RAM into four 8-bit memories

16:33 <sb0> this is not efficient, but it will work and is portable

16:34 <_florent_> ah ok, wasn't aware of this, I'll have a look

16:34 <sb0> I think the proper way to fix it is to hook the RAM generator in Mibuild to make the output palatable to the altera shitware

16:34 <sb0> with the byte-enable signals

16:36 <_florent_> I'm not sure it's not working on altera, for now I'm working with testing it on Mixxeo

16:37 <_florent_> but Lut usage for MiniSoC jumps from less than 3000 Luts to 4500 because of this...

16:54 <larsc> It probably makes sense for migen to have a abstract block ram memory module which can generate specialized primitives depending on the target platform

16:57 <_florent_> yes probably because infering rules seems to be different for each vendor...

17:00 <sb0> larsc, there's already such a memory module, and it supports different generators

17:00 <sb0> only there's only one generator at the moment, that outputs code that works in most cases

17:01 <sb0> though if the xilinx shitware doesn't support inference of multiport + byte enable, I think a direct code generator isn't the best way to solve it

17:02 <sb0> better have a transform that turns the memory module into RAMB*BWER instances

17:03 <sb0> this way you can use migen to create and connect those instances :)

17:04 <sb0> outputting the systemverilog construct for the altera shitware could be done by hooking the code generator though

17:15 <_florent_> in wishbone.py, I'm doing this

17:15 <_florent_> http://pastebin.com/JiSGqxYN

17:15 <_florent_> self.specials += FullMemoryWE(self.mem)

17:15 <_florent_> but it does not seem to work

17:16 <_florent_> maybe I'm not using correctly the FullMemroryWE decorator?

17:17 <sb0> well, first you're using it in wishbone.py and it does not belong here

17:18 <sb0> where it should be put is in mibuild, and apply automatically to any dual-port+byte-enable memory targeting xilinx

17:19 <_florent_> I agree, it's just a dirty test for now...

17:20 <_florent_> do you have an example where you use FullMemoryWE?

17:21 <sb0> I've used it with MiSoC make.py (-d FullMemoryWE) to apply it globally

17:22 <_florent_> ok thanks, I'll look at it

17:22 <sb0> and it's a module decorator

17:23 <sb0> you can't apply them to specials

17:23 <sb0> for your dirty test you can just do

17:23 <sb0> @FullMemoryWe

17:23 <sb0> class SRAM(Module):

17:23 <sb0> ...

17:23 <_florent_> thanks

17:44 <_florent_> it's working :) (ressource usage and test on board)

17:44 <_florent_> old version:

17:45 <_florent_> Luts : 3,177 / Registers: 2,616

17:45 <_florent_> new version:

17:45 <_florent_> Luts : 3,296 / Registers: 2,591

17:46 <_florent_> but the archirecture is not the same, and CDC is done using AsyncFIFOs:

17:46 <_florent_> http://git.io/jfeDGA

18:09 <sb0> hmm, why the FIFOs?

18:10 <sb0> SRAM_DEPTH = << lower case please

18:17 <_florent_> I'm now using a dataflow architecture and think it's easier to handle all regitsters in sys clock domain

18:38 mumptai has joined #m-labs

18:46 <sb0> mh?

18:47 <sb0> afaict the main reason to use AsyncFIFO over synchronizers would be to queue several words without waiting for a handshake - for performance reasons

18:50 <_florent_> why? if you have a dataflow architecture, you can change clock domain using AsyncFIFO and you don't have to manage multiple clock domain crossing

18:51 <_florent_> I think it's easier to do like that

18:51 <_florent_> and it will be easier to use another phy with this new architecture

18:52 <_florent_> but if you are really not convinced of that I can still change it

18:54 <_florent_> the only disavantage here is that more logic runs at sys_clk instead of eth_rx/eth_tx clks

18:56 <_florent_> but this is an a disavantage here with a 100Mbps PHY that will disappear with a 1Gbps PHY

19:05 <sb0> it's rather weird, that you're using a FIFO (that already contains potential BRAM storage) just to fill a dual-port, multi-clock capable BRAM in the other clock domain

19:08 sh[4]rm4 has quit [Remote host closed the connection]

19:08 sh[4]rm4 has joined #m-labs

19:08 sh[4]rm4 has quit [Remote host closed the connection]

19:14 <sb0> and hmm, seems the 1Gbps PHY would be much better off writing a BRAM at 125MHz than pushing data at 125MHz into a 4-word-deep FIFO later read at about 83MHz :)

19:15 <sb0> you can of course increase the FIFO word width to compensate, but it gets messy, especially with the optimum word width depending on the system/phy clock freq ratio

19:16 <sb0> that's how it's done in dvisampler/framebuffer, and I don't like it

19:17 <sb0> but there aren't many options when DMA-streaming a large buffer

19:17 <sb0> BRAM is different

19:26 sh4rm4 has joined #m-labs

19:36 <_florent_> for the 1Gbps PHY, I was more thinking of pushing 32 bits data @ 31.25Mhz, and I'm not planning to use it on slowtan6 :)

19:37 <sb0> why 32? and not 16 or 64? :)

19:38 <_florent_> but I have still some work to do and will think about our discussion, and tried to convince myself that yout solution is better :)

19:38 <_florent_> because it's the size of the wishbone bus

19:39 <_florent_> tried / try

19:41 <_florent_> and also the width of the ram

19:41 <sb0> well there are two problems

19:42 <sb0> 1) absorbing/delivering the data continuously at the PHY rate

19:42 <sb0> 2) packing it into WB words

19:42 <sb0> with what I'm proposing, both problems are decoupled - so it's more flexible

19:43 <sb0> also, you're using a bit less resources, as you don't implement FIFOs that are not really necessary

19:45 <_florent_> but you use synchronizers for each control/status signal so on the ressource usage it's probably almost the same

19:45 <sb0> a pulse synchronizer on a control signal is 3 flip-flops

19:45 Alain_ has quit [Remote host closed the connection]

19:46 <sb0> and 2 LUTs

19:47 <sb0> those FIFOs contains storage, counters, synchronizers on the counters (2 flip-flops per counter bit), and a bunch of LUTs for the control logic

19:48 <sb0> *contain

19:48 <_florent_> I agree but the idea it to have the smallest possible FIFO

19:49 <_florent_> It's actually set to 4 due to a limitation on the AsyncFIFO, but it can be set to 2

19:49 <sb0> which is zero :-P

19:50 <_florent_> haha :)

19:50 <sb0> that being said, those FIFOs would be nice if there's DMA

19:51 <sb0> in fact, I had those, with DMA, in the very first versions of minimac

19:51 <_florent_> ah ok, wasn't aware of that

19:51 <sb0> but it was messy to prevent them from over/underflowing with the limited wishbone bandwidth

19:52 <sb0> packets were getting corrupted when the WB bus was loaded

19:53 <sb0> and I had to use pretty large FIFOs... and triggering TX after the FIFO had been filled to a good level

19:54 <sb0> so I stopped messing around and used BRAMs with predictable timing

19:54 <_florent_> we can maybe implement LASMI ports that have high priority, this will enable the use of DMA for ethernet

19:55 <sb0> yeah, LASMI should have plenty of bandwidth

19:55 <_florent_> and having high priority ports can probably be useful for others applications

19:55 <sb0> I only tried Wishbone DMA before giving up and using BRAM

19:55 <sb0> I'm not sure if we need high-prio ports at all

19:56 <sb0> there's a good margin

19:56 <sb0> especially for 100Mbps

19:56 <sb0> HDMI works at like 30x that bandwidth, no problem, without a priority mechanism

19:57 <_florent_> ok, if you want I can try with ethernet with DMA since I'm working on it

19:57 <sb0> the problem, however, is that you need to invalidate the L2 (bridge) cache before transferring packets with the CPU

19:59 <sb0> might make things slower...

19:59 <_florent_> I'm aslo thinking of inseting /checking ethernet CRC in hardware, since it won't use too much ressource and will improve performance

19:59 <sb0> it'd possible to have cache-coherent DMA though

20:00 bentley` has quit [Remote host closed the connection]

20:00 <sb0> there's a second port on the bridge BRAM we're not needing atm

20:02 <sb0> yes, hw crc is probably good

20:05 <sb0> one thing that annoys me with that cache hierarchy is it's mostly needed for WB compatibility

20:06 <sb0> you could access the CPU caches directly from LASMI, and I'm sure the system would still meet timing in most cases

20:06 <sb0> with less resources, and higher performance

20:08 <_florent_> one thing to do if someone rewrite LM32 with Migen :)

20:09 <sb0> other bus accesses from the CPU are MMIO, which is easy, and XIP which pretty much needs the cache and is more difficult - unless we can tolerate a sluggish BIOS execution

20:10 <sb0> now that I think of it we probably can... the BIOS can load a second-stage bootloader into the SDRAM for things like netboot that need a bit of performance

20:18 <_florent_> BTW, what do you think of my change on the interface of LASMI? splitting write/read data acks and managing latencies inside the crossbar

20:19 <_florent_> I think it easier to use the LASMI ports like that

20:21 <_florent_> the ack on the read will signifies that data is valid

20:21 <_florent_> the ack on the write will signifies that we have to present the data

20:25 <sb0> early acks allow earlier unlocking of arbitration and higher performance on loaded buses

20:25 <sb0> did you test that your option even still allows HD mixing on the mixxeo?

20:27 <_florent_> yes it was ok and has exactly the same performance of the current implementation

20:27 <_florent_> it's just that the shift registrer of the ack signals are in the crossbar instead of being in the DMA

20:28 <sb0> 2x 720p60 in 1x 720p60 out? with mixing from both input framebuffers?

20:28 <_florent_> and the wishbone bridge is then simplified

20:29 <sb0> you'd still have early acks from the controller?

20:29 <_florent_> I was using 2x 720p60 in 1x 720p60 out

20:30 <_florent_> you still have req ack, write data ack, read data ack

20:31 <_florent_> in fact it's very close to a AXI interface: 1 cmd interface, 1 write interface, 1 read interface

20:35 <sb0> what I'm asking is - does the controller still unlock arbitration a few cycles before the actual data transfer takes place?

20:35 <_florent_> and if we want to have reordering later, we can add a signal that will be asserted with the ack to indicate which data the controller wants to write (or read)

20:36 <_florent_> yes, the behaviour is exactly the same

20:36 <sb0> I've experimented with full reordering... tends to cause big timing closure issues for small performance improvements

20:37 <sb0> e.g. page hit optimization

20:37 <_florent_> you've tried on random accesses?

20:38 <sb0> no, with the current soc access patterns, which aren't random

20:39 <sb0> yes, it's probably more worth doing it if the pattern is actually random

20:39 <sb0> but timing *is* a big mess

20:39 <_florent_> you just have to pipeline thing

20:40 <sb0> that increases latency :)

20:41 <sb0> also, how would you pipeline page hit reordering?

20:41 <_florent_> of course, but yes, you have to study if the gain will compensate the increase in latency

20:44 <sb0> request issue is another problem

20:44 <_florent_> sorry, I don't have in mind how it's done in LASMICON, but when I was working on a controller for a customer, we didn't had so much timing issues

20:45 <sb0> you need a priority encoder to find an empty location in the controller's request slots

20:46 <sb0> and you pretty much want 1-cycle issue, since the slowness of the FPGA vs. speed of DRAM means that after serializations bursts take 1 FPGA cycle

20:48 <_florent_> yes especially when you work at half rates or quarter rates in the FPGA

20:50 <sb0> well... maybe in a system where you issue requests per packs of, say, 16, then it makes sense to have reordering

20:51 <_florent_> for my modification, in fact it's simply that

20:52 <_florent_> https://github.com/m-labs/migen/blob/master/migen/actorlib/dma_lasmi.py#L45

20:52 <_florent_> this part if code is moved in the controller

20:52 <sb0> also you can do the page-hit reordering within those blocks

20:52 <_florent_> and then here

20:52 <sb0> which can be reasonably pipelined

20:52 <_florent_> https://github.com/m-labs/migen/blob/master/migen/actorlib/dma_lasmi.py#L58

20:53 <_florent_> you simply use the ack from LASMI

20:53 <_florent_> same thing for the write

20:58 <sb0> what reordering controller did you use btw?

20:59 <_florent_> http://www.barco-silex.com/sites/default/files/doc/BA317_FS.pdf

21:15 <sb0> hmm, yeah, I guess we could have fixed LASMI latency after the arbiter

21:18 <_florent_> ok if it's ok for you I will adapt the simulation too and send you a patch

21:38 Jespers has joined #m-labs

21:38 Jespers has quit [Client Quit]

21:46 _florent_ has quit [Quit: Page closed]

21:52 sb0 has quit [Quit: Leaving]

22:31 aeris has quit [Ping timeout: 240 seconds]

22:32 mumptai has quit [Ping timeout: 240 seconds]

22:34 aeris has joined #m-labs

23:05 stekern has quit [Ping timeout: 264 seconds]

23:07 stekern has joined #m-labs