#m-labs on 2014-10-24 — irc logs at freenode.irclog.whitequark.org

2013-12-11 12:34 lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs

00:59 <rjo> ysionneau, sb0: are you guys writing a new ddram controller (again ;) ?

00:59 <rjo> sb0: what numpy-json-incompatibility?

01:01 <rjo> sb0: list and ndarrays are just different and list*int vs array-like*array-like is not the only difference.

01:02 <rjo> sb0: i always liked the fact that numpy casts your lists to arrays, and does so consistently.

01:03 <sb0> json.dump() of a numpy thing results in "TypeError: xxx is not JSON serializable"

01:04 <rjo> sb0: what would you like it to do?

01:04 <sb0> I guess I need to add another layer that converts those things to lists, and adds a field with numpy type annotations

01:05 <sb0> make Pdq2.multi_frame RPC-able, for example

01:05 <rjo> sb0: how should the data type, storage order, etc be mapped?

01:06 <sb0> or should Pdq2.multi_frame take lists, and convert them manually to numpy?

01:07 <rjo> sb0: c.f. the suggestion on how to serialize ndarrays for pyzmq: http://zeromq.github.io/pyzmq/serialization.html

01:10 <rjo> sb0: afaict adding a simple ndarray json encoder/decoder would be a similar amount of work as using lists for multi_frame().

01:11 <sb0> yes, I looked at supporting lists in pdq2, and it was messy

01:11 <rjo> sb0: or stated differently: numpy ndarrays are awesome ;)

01:12 <rjo> sb0: re ipv4/6, to check my memory: we said that core-device-controller to core-device would be something-over-udp-over-ethernet.

01:13 <rjo> sb0: arp and minimal icmp would be necessary for that iirc.

01:13 <rjo> sb0: or neighbor solicitation for ipv6

01:15 <rjo> sb0: does it look like the ipv4/arp/icmp layer group could be roughly interchangeable with the ipv6/ndisc/icmp6 group?

01:17 <rjo> we are not even allowed to run ipv6 on the regular NIST net... local closed nets are different, though.

01:19 <rjo> the only disadvantage of ipv6 would be that we could not run the core on the regular NIST net.

01:20 <rjo> i love ipv6 and have been using it for (i am shocked) 12 years...

01:22 <sb0> yes, the idea is to use UDP

01:22 <sb0> only ARP is needed for that - no ICMP (unless we also want ping to work, but UDP does not require it)

01:24 <rjo> well. technically yes. standards-wise you MUST do ICMP ;)

01:25 <rjo> but yeah. i have no strong preference. ipv4/6 both have their advantages and disadvantages in our context.

01:25 <rjo> why were you thinking about ipv6?

01:26 <sb0> for no strong reason - just that it will replace ipv4 in some decades

01:27 <sb0> but ipv4 would work just as well

01:32 <sb0> what do you mean by "interchangeable"?

01:39 <rjo> if you do one stack now then later plug in the other one (or even run them dual stack).

01:40 <sb0> actually I've been thinking about putting the stack in gateware

01:41 <sb0> the CPU would just see memory-mapped packet buffers to read/write

01:41 <sb0> and the gateware handles (re)transmission and UDP

01:42 <sb0> Florent has done some of that for another project it seems, but I don't know how far he went

01:43 <sb0> this way we won't get an interrupt in the middle of a tighly timed loop because someone made an ARP lookup

01:45 <sb0> adding another CPU core to handle the network/housekeeping is also an option

01:45 <rjo> funny idea. sounds very much like these SDN fpga-routers.

01:45 <sb0> the advantage of this option that a miscompiled or otherwise corrupted kernel can't crash the management CPU

01:45 <rjo> and fragmentation?

01:46 <sb0> I'd do fragmentation in sw

01:46 <sb0> and have the gateware provide reliable packet delivery under the MTU

01:46 <rjo> maybe even on the top-most layer by just limiting message size.

01:47 <sb0> yes, then fragmentation is only needed when sending a large array or kernel, which can be done at the upper layer

01:49 <rjo> hmm. i don't know. usually you'd just mask (ethernet) irqs before a tight loop...

01:49 <sb0> how do you detect a tight loop?

01:49 <sb0> and what if that loop has a bug and loops forever?

01:49 <rjo> sb0: explicitly mark them.

01:50 <sb0> syscall("disable_irqs") ?

01:50 <rjo> sb0: timer irqs (for watchdog/timeouts) would not be masked.

01:50 <rjo> sb0: something like that.

01:50 <sb0> we can also limit the time during which interrupts can be masked

01:54 <rjo> sb0: i am just thinking about specific situations. it's the usual overrun/underrun problem. sounds like we could almost automatically mask ethernet interrupts (from gateware) if an overrun on inputs or underrun on outputs is about to happen.

01:54 <sb0> you can't know when an underflow is going to happen

01:55 <sb0> e.g. if you have long_computation(), delay(long_time), pulse()

01:55 <rjo> if the output fifo is empty, you could call that an underrun. (otherwise you could keep it full with a single dummy event).

01:56 <rjo> sb0: before long_computation() you would be in a pending underrun condition.

01:56 <rjo> that is: if the fifo is empty.

01:57 <sb0> sounds a bit cumbersome...

01:57 <sb0> what do you think of the dual-CPU option?

01:58 <rjo> sb0: "a second cpu to do networking etc" sounds very much like "gateware that does ip/udp".

02:00 <sb0> yes, the main differences are 1) we can hopefully recycle an existing software network stack 2) more robust (we can limit the memory address space of the kernel CPU to safe regions) 3) a bit less bandwidth

02:02 <rjo> hmm. i'll have to think about what that would look like.

02:03 <rjo> it would definitely solve the kernel-crash-recovery problem. the frontend cpu would just reset the backend cpu.

02:04 <rjo> that would be something like a mailbox-style interface between the cpus, like they do for DSP co-processors (c.f. OMAP).

02:05 <rjo> sb0: or were you thinking of that or1k-smp stuff?

02:06 <sb0> I'd rather keep it generic, e.g. in case we go back to lm32

02:06 <sb0> for inter-CPU communication, something like a small shared scratchpad with polling is enough

02:07 <sb0> the CPUs have little else to do than polling

02:07 <rjo> so more like a classic coprocessor. you could even have a lm32-frontend with a mor1kx-backend....

02:08 <sb0> then the main buffers (e.g. arrays to be transmitted over the network) can be transferred via the main (shared) memory

02:10 <sb0> a mixed-arch system would be messy to compile (the backend CPU will also execute static code from the runtime e.g. syscalls and compiler-rt), but technically possible

02:10 xiangfu has joined #m-labs

02:11 <rjo> sounds cool but i'm currently a bit uncertain which problem we are trying to solve with that...

02:11 <rjo> ;)

02:11 <sb0> with the mixed-arch?

02:11 <sb0> or the whole two-CPU idea?

02:12 <rjo> sb0: with any of multi-cpu or udp-in-gateware.

02:12 <rjo> sb0: the lower half of package handling. how much of a latency hit would that be?

02:13 <sb0> the only problems are when the CPU is busy running kernel code

02:14 <sb0> while it's doing that we need at least be able to 1) answer ARP 2) process a "abort kernel" message

02:14 <rjo> if the output fifo has events that keep it busy for the next N cycles, one could safely handle ethernet irqs if they are known to return after N cycles.

02:15 <sb0> yes

02:16 <sb0> well, mostly

02:17 <sb0> handling too many of them could slow down a large computation later on and still cause an underflow, but that sounds unlikely

02:17 <rjo> hmmm. difficult question which is best here. coprocessor, networking-gateware or some automatic or explicit interrupt masking.

02:17 <rjo> what are you leaning towards?

02:20 <sb0> I don't like the explicit interrupt masking, which is user unfriendly and obscure and I'm sure people will use it incorrectly

02:21 <rjo> explicit or automatic masking now and a coprocessor split later does not sound like too much of an work overhead if the coprocessor split planned for.

02:22 <rjo> explicit masking is definitely a low-level solution.

02:22 <sb0> coprocessor split doesn't look all that difficult, and there are some good candidates for network stacks to recycle

02:23 <rjo> i agree that reusing a soft network stack is worthwhile.

02:23 <sb0> at least the apple store(tm) ahem, sorry, internet of things is good for something

02:23 <rjo> ha!

02:25 <rjo> if i can buy a "mote" that does energy harvesting and has some useful wireless stack i would actually buy quite a few of them for the lab.

02:26 <sb0> the main advantage of the gateware approach is more bandwidth, but probably not a lot more (since the CPU still has to parse the packets in the end)

02:26 <sb0> *the contents of

02:27 <sb0> and copy memory

02:27 <sb0> sounds a bit like overoptimization

02:27 <rjo> a possible path seems to accurately delineate the line between the future frontend and the future backend now and implement that split later.

02:28 <rjo> yes. i don't think we need that much bandwidth to/from the core.

02:28 <sb0> well the second CPU can still have read-only access to all the runtime mem

02:28 <sb0> then it's just the same linker-based stuff, except that we start the second CPU instead of calling the function

02:29 <sb0> with some details like properly setting up the stack pointer of the backend CPU

02:32 <sb0> and any global variables that the syscalls write need to be put in a separate section that the backend can write

02:32 <rjo> are you thinking or bram-backed memory-mapped scratch space for the two directions (front-to-back and reverse) + flags for data stb/ack

02:32 <sb0> in the beginning, the backend can just have the same memory space as the frontent

02:32 <sb0> *frontend

02:33 <rjo> with different frame+stacks+bss+data sections.

02:34 <rjo> stack

02:35 <sb0> the message passing would just be backend->frontend "make RPC with parameters at address xxx in main memory", f->b "RPC completed, result is xxx", and b->f "kernel terminated with/without exception"

02:38 <rjo> and f->b "start kernel at X".

02:38 <sb0> yes, and/or use the reset signal

02:39 <rjo> by a special backend EBA.

02:41 <rjo> hmm. i like it. let's keep this option/feature in mind but schedule it a bit later.

02:41 <rjo> or do you think this should be implemented now?

02:41 <sb0> no, serial works fine for now, it's just slow and cannot abort kernels

02:46 <sb0> things would be easier if PCs still had ISA...

02:46 <sb0> just map a few IO ports to reset/start the CPU etc.

02:46 <sb0> meh

02:49 <rjo> isn't wishbone just a reborn ISA?

03:05 <sb0> that you can't plug into a computer motherboard, unfortunately...

03:22 <sb0> I see why CERN loves their VME systems

03:30 xiangfu_ has joined #m-labs

03:34 xiangfu has quit [Ping timeout: 260 seconds]

03:53 fengling has joined #m-labs

05:22 xiangfu_ has quit [Remote host closed the connection]

06:47 nicksydney has quit [*.net *.split]

06:47 stekern has quit [*.net *.split]

06:47 felix_ has quit [*.net *.split]

06:47 jaeckel has quit [*.net *.split]

06:47 kristianpaul has quit [*.net *.split]

06:47 aeris has quit [*.net *.split]

06:47 bentley` has quit [*.net *.split]

06:49 xiangfu has joined #m-labs

06:55 _florent_ has joined #m-labs

07:00 jaeckel has joined #m-labs

07:00 aeris has joined #m-labs

07:00 bentley` has joined #m-labs

07:00 kristianpaul has joined #m-labs

07:16 nicksydney has joined #m-labs

07:16 stekern has joined #m-labs

07:16 felix_ has joined #m-labs

07:28 nicksydney has quit [*.net *.split]

07:28 stekern has quit [*.net *.split]

07:29 felix_ has quit [*.net *.split]

07:29 jaeckel has quit [*.net *.split]

07:29 kristianpaul has quit [*.net *.split]

07:29 aeris has quit [*.net *.split]

07:29 bentley` has quit [*.net *.split]

07:32 bentley` has joined #m-labs

07:32 aeris has joined #m-labs

07:32 nicksydney has joined #m-labs

07:32 stekern has joined #m-labs

07:32 felix_ has joined #m-labs

07:36 aeris has quit [*.net *.split]

07:36 bentley` has quit [*.net *.split]

07:37 jaeckel has joined #m-labs

07:37 kristianpaul has joined #m-labs

07:40 aeris has joined #m-labs

07:40 bentley` has joined #m-labs

08:05 <ysionneau> morning

09:03 fengling_ has joined #m-labs

09:07 fengling has quit [Ping timeout: 265 seconds]

09:07 <sb0> now there's the problem of numpy embedded in some other structure (list, dict, etc...)

09:24 kyak has quit []

09:28 MY123 has joined #m-labs

09:39 kyak has joined #m-labs

09:53 nengel has quit [*.net *.split]

09:53 nengel has joined #m-labs

09:55 aeris has quit [*.net *.split]

09:55 bentley` has quit [*.net *.split]

09:55 fengling__ has joined #m-labs

09:56 aeris has joined #m-labs

09:56 bentley` has joined #m-labs

09:58 xiangfu has quit [*.net *.split]

09:58 ysionneau has quit [*.net *.split]

09:58 ohama has quit [*.net *.split]

09:58 fengling_ has quit [Ping timeout: 265 seconds]

09:58 xiangfu has joined #m-labs

09:58 ysionneau has joined #m-labs

09:58 ohama has joined #m-labs

10:26 fengling__ has quit [Quit: WeeChat 1.0]

10:26 FabM has quit [Quit: ChatZilla 0.9.91 [Iceweasel 31.1.0/20140903072827]]

10:30 _florent_ has quit [Ping timeout: 246 seconds]

11:12 xiangfu has quit [Ping timeout: 258 seconds]

14:32 _florent_ has joined #m-labs

14:43 <ysionneau> sb0: what kind of address does the Slicer take as input?

14:43 <ysionneau> CPU address (byte aligned)? WB address (4-bytes aligned)?

14:44 <ysionneau> DRAM address (2-bytes aligned)?

14:44 <sb0> iirc DRAM address

14:44 <ysionneau> aah makes sense ok

14:46 <ysionneau> there is still something I don't get

14:47 <ysionneau> as I understood, for the ppro example we have row(12 bits) bank(2 bits) col(8 bits)

14:47 <ysionneau> and this addresses 16 bits of data

14:48 <ysionneau> but to transform an address in the row number, I see the slicer does : address >> 7

14:48 <ysionneau> I would at least do address >> 8 , if by row the slicer means "row+bank"

14:52 <ysionneau> it is as if there was already one bit thrown away in the address given to the Slicer

14:54 <sb0> do you mean, to get the bank address?

14:54 <sb0> also you don't have to use that slicer

14:54 <ysionneau> I found the idea of having a "slicer" to cut down addresses into row/bank/col pretty usefull

14:55 <ysionneau> and I wanted to reuse the code

14:55 <ysionneau> but not sure if I can reuse the exact same then...

14:56 <sb0> the lasmicon slicer expects the bank address to be pre-extracted

14:56 <sb0> so no, you can't reuse the exact same

14:56 <ysionneau> ok that makes more sense, even if I still don't get why it only removes 7 bits instead of 8 to get rid of the column

14:56 <ysionneau> but ok let's do a slicer specific to this simple sdram controller

14:57 <sb0> what address_align did you give to it?

14:57 <ysionneau> it seems to be 1 for now on papilio

14:57 <ysionneau> since address_align = log2_int(burst_length)

14:58 <sb0> yes, and log2_int(1) == 0

14:58 <sb0> not 1

14:58 <ysionneau> ah burst length=1 ?

14:59 <ysionneau> I thought it was 2

14:59 <sb0> it'll be 2 after the PHY does clock multiplication. but that's the next step

14:59 <sb0> and yes, you'll want the last bit of address to be 0 in that case

14:59 <ysionneau> so we need 2 read commands for 1 wb access then

14:59 <sb0> otherwise you'll get burst reordering instead of getting a fully new burst at the next address

15:00 <sb0> yes, but make it a 16-bit wb bus at the controller

15:00 <sb0> and then use a 32-to-16 adapter between the cpu and controller

15:01 <ysionneau> ok :)

15:01 <sb0> I think _florent_ did some work on such adapters already...

15:01 <ysionneau> so the entire wb bus would be 16 bits wide?

15:02 <sb0> inside the sdram controller yes

15:02 <ysionneau> but the sdram controller is just a slave connected to the main wishbone bus, right?

15:02 <sb0> the sdram ctrl should create a wb interface with nphases*d bits

15:02 <ysionneau> and cpu has 2 masters connected to this same bus?

15:02 <sb0> yes

15:03 <sb0> but the cpu bus is 32

15:03 <ysionneau> yes

15:03 <sb0> so you need some adapter that turns a cpu request into two 16 bit requests

15:03 <ysionneau> ok so the cpu will not be connected directly to wishbone anymore, and the entire wishbone bus will be 16 bits-wide

15:03 <sb0> with ddr3 it'll be the opposite - turn a 512-bit memory access into a 32-bit cpu word

15:03 <ysionneau> and the adapter between CPU and WB will deal with the difference

15:03 <sb0> huh, no

15:04 <sb0> adapter between memory ctl and rest of the bus

15:04 <sb0> everything else stays 32

15:04 <ysionneau> right then I understand

15:04 <ysionneau> adapter is on the controller side

15:04 <ysionneau> ok

15:06 <sb0> there are a number of other things in the system that expect 32-bit wishbone... the SRAM, the flash controller, ...

15:07 <ysionneau> yes that's why I was surprised about the 16 bits wide but I just misunderstood

15:08 <_florent_> hi, yes I remember I was doing such adaptation, but now it's integrated in Wishbone2LASMI

15:08 <_florent_> but the code for the adaptation is probably that one:

15:08 <_florent_> https://github.com/m-labs/migen/blob/master/migen/bus/wishbone.py#L115

15:09 <ysionneau> awesome, thanks!

15:11 <_florent_> or maybe you can use Wishbone2LASMI with a minimal cache

15:11 <sb0> nah, the cache is useless here

15:12 <_florent_> it's not possible to have only 1 WB word in the cache?

15:12 <sb0> one thing which will be nice for ddr3 (with 512-bit accesses) is WB burst support in the converter

15:12 <sb0> and then we can call a factory function in gensoc that selects the correct converter based on the CPU and memory wishbone buses

15:13 <sb0> either downconverter, or the future converter which does 32 -> 512 with burst support

15:14 <_florent_> I have a version of Converter in my fork that support packets (sop, eop), it can probably be used for burst, we will just have to increase the

15:14 <_florent_> address a each new data of the packet

15:14 <_florent_> *at

15:15 <sb0> wishbone bursts?

15:16 <_florent_> no on dataflow, but we can probably have so logic in common, or use the same low level modules

15:18 <_florent_> what will be even better is to provides Wishbone support only for compatibility, and use our own bus internally that will be inspired of AXI but simplified...

15:18 <sb0> what's wrong with WB?

15:20 <larsc> _florent_: how do you simplify AXI?

15:20 <_florent_> AXI buses seems more easy to pipeline and to use because address / data use separated channels

15:21 <_florent_> maybe not simplify, but use the minimal version and maybe not support everything

15:22 <larsc> most of the advanced stuff is optional in the spec anyway I think

15:23 <_florent_> ok

15:24 <larsc> I'm a big fan of AXI, so much simpler than the other stuff

15:25 <_florent_> I've not use a lot AXI, but yes it seems simple, at least when you want to pipeline things

15:26 <sb0> by separate channels, you mean e.g. for reads the address is acked, and then the target core makes a request to return data?

15:27 <larsc> the basic protocol for the channels is like the ack/stb in migen

15:27 <larsc> for reading you have 2 channels, one request channel and one response channel

15:28 <_florent_> yes, similar to LASMI for a read for example now that you have splitted ack for command and data

15:28 <larsc> for writing you have 3 channels, one request, one data and one response channel

15:28 <sb0> LASMI arbitration is a bitch, tho

15:28 <larsc> the channels can be, but don't have to be asynchronous

15:28 <sb0> this scheme makes it a lot more complex than WB arbitration

15:29 <larsc> so you can send e.g. 3 read requests before you get data back

15:29 <larsc> which is great for pipelining

15:30 <sb0> yes. but then the arbiter has to know that 3 data items have to come back before it can switch to another initiator...

15:32 <larsc> typically each slave as a unique ID and arbiter just passes the IDs along and when a response comes back uses it for knowing were to send the response

15:32 <larsc> s/each slave/each master/

15:34 <larsc> AXI is pretty much a simple NOC

15:36 <sb0> so you also need to arbiter in the other direction, I guess?

15:36 <sb0> since several data items can come back for the same initiator at the same time

15:37 <larsc> yes

15:37 <larsc> although I think out-of-order replies have been dropped from the spec in AXI4

15:38 <sb0> well if you send requests to several targets with different latencies, then out-of-order replies might still happen

15:39 <larsc> I think the interconnect has to make sure that this does not happen

15:40 <larsc> the easy solution is to block parallel requests to different slaves

15:43 <larsc> a single master can also choose to have multiple IDs in this case it is still possible that requests for different IDs are completed out of order

15:44 <ysionneau> sounds like a bit of a mess

15:46 <larsc> no, from a system point of view it just looks like separate masters

15:46 <larsc> it's like a network card with multiple IP addresses

15:49 <ysionneau> ah indeed

15:50 <_florent_> at least there is a fact:

15:50 <_florent_> - you read the AXI specification, every thing seems clear and simple.

15:50 <_florent_> - you read the Wishbone specification, you don't understand anything...

15:50 <_florent_> :)

15:51 <ysionneau> fair enough =)

18:18 MY123 has quit []

18:48 nicksydney has quit [Quit: No Ping reply in 180 seconds.]

18:48 nicksydney has joined #m-labs

19:07 _florent_ has quit [Quit: Leaving]

22:43 nengel has quit [*.net *.split]

22:44 nengel has joined #m-labs