<rjo>
sb0: afaict adding a simple ndarray json encoder/decoder would be a similar amount of work as using lists for multi_frame().
<sb0>
yes, I looked at supporting lists in pdq2, and it was messy
<rjo>
sb0: or stated differently: numpy ndarrays are awesome ;)
<rjo>
sb0: re ipv4/6, to check my memory: we said that core-device-controller to core-device would be something-over-udp-over-ethernet.
<rjo>
sb0: arp and minimal icmp would be necessary for that iirc.
<rjo>
sb0: or neighbor solicitation for ipv6
<rjo>
sb0: does it look like the ipv4/arp/icmp layer group could be roughly interchangeable with the ipv6/ndisc/icmp6 group?
<rjo>
we are not even allowed to run ipv6 on the regular NIST net... local closed nets are different, though.
<rjo>
the only disadvantage of ipv6 would be that we could not run the core on the regular NIST net.
<rjo>
i love ipv6 and have been using it for (i am shocked) 12 years...
<sb0>
yes, the idea is to use UDP
<sb0>
only ARP is needed for that - no ICMP (unless we also want ping to work, but UDP does not require it)
<rjo>
well. technically yes. standards-wise you MUST do ICMP ;)
<rjo>
but yeah. i have no strong preference. ipv4/6 both have their advantages and disadvantages in our context.
<rjo>
why were you thinking about ipv6?
<sb0>
for no strong reason - just that it will replace ipv4 in some decades
<sb0>
but ipv4 would work just as well
<sb0>
what do you mean by "interchangeable"?
<rjo>
if you do one stack now then later plug in the other one (or even run them dual stack).
<sb0>
actually I've been thinking about putting the stack in gateware
<sb0>
the CPU would just see memory-mapped packet buffers to read/write
<sb0>
and the gateware handles (re)transmission and UDP
<sb0>
Florent has done some of that for another project it seems, but I don't know how far he went
<sb0>
this way we won't get an interrupt in the middle of a tighly timed loop because someone made an ARP lookup
<sb0>
adding another CPU core to handle the network/housekeeping is also an option
<rjo>
funny idea. sounds very much like these SDN fpga-routers.
<sb0>
the advantage of this option that a miscompiled or otherwise corrupted kernel can't crash the management CPU
<rjo>
and fragmentation?
<sb0>
I'd do fragmentation in sw
<sb0>
and have the gateware provide reliable packet delivery under the MTU
<rjo>
maybe even on the top-most layer by just limiting message size.
<sb0>
yes, then fragmentation is only needed when sending a large array or kernel, which can be done at the upper layer
<rjo>
hmm. i don't know. usually you'd just mask (ethernet) irqs before a tight loop...
<sb0>
how do you detect a tight loop?
<sb0>
and what if that loop has a bug and loops forever?
<rjo>
sb0: explicitly mark them.
<sb0>
syscall("disable_irqs") ?
<rjo>
sb0: timer irqs (for watchdog/timeouts) would not be masked.
<rjo>
sb0: something like that.
<sb0>
we can also limit the time during which interrupts can be masked
<rjo>
sb0: i am just thinking about specific situations. it's the usual overrun/underrun problem. sounds like we could almost automatically mask ethernet interrupts (from gateware) if an overrun on inputs or underrun on outputs is about to happen.
<sb0>
you can't know when an underflow is going to happen
<sb0>
e.g. if you have long_computation(), delay(long_time), pulse()
<rjo>
if the output fifo is empty, you could call that an underrun. (otherwise you could keep it full with a single dummy event).
<rjo>
sb0: before long_computation() you would be in a pending underrun condition.
<rjo>
that is: if the fifo is empty.
<sb0>
sounds a bit cumbersome...
<sb0>
what do you think of the dual-CPU option?
<rjo>
sb0: "a second cpu to do networking etc" sounds very much like "gateware that does ip/udp".
<sb0>
yes, the main differences are 1) we can hopefully recycle an existing software network stack 2) more robust (we can limit the memory address space of the kernel CPU to safe regions) 3) a bit less bandwidth
<rjo>
hmm. i'll have to think about what that would look like.
<rjo>
it would definitely solve the kernel-crash-recovery problem. the frontend cpu would just reset the backend cpu.
<rjo>
that would be something like a mailbox-style interface between the cpus, like they do for DSP co-processors (c.f. OMAP).
<rjo>
sb0: or were you thinking of that or1k-smp stuff?
<sb0>
I'd rather keep it generic, e.g. in case we go back to lm32
<sb0>
for inter-CPU communication, something like a small shared scratchpad with polling is enough
<sb0>
the CPUs have little else to do than polling
<rjo>
so more like a classic coprocessor. you could even have a lm32-frontend with a mor1kx-backend....
<sb0>
then the main buffers (e.g. arrays to be transmitted over the network) can be transferred via the main (shared) memory
<sb0>
a mixed-arch system would be messy to compile (the backend CPU will also execute static code from the runtime e.g. syscalls and compiler-rt), but technically possible
xiangfu has joined #m-labs
<rjo>
sounds cool but i'm currently a bit uncertain which problem we are trying to solve with that...
<rjo>
;)
<sb0>
with the mixed-arch?
<sb0>
or the whole two-CPU idea?
<rjo>
sb0: with any of multi-cpu or udp-in-gateware.
<rjo>
sb0: the lower half of package handling. how much of a latency hit would that be?
<sb0>
the only problems are when the CPU is busy running kernel code
<sb0>
while it's doing that we need at least be able to 1) answer ARP 2) process a "abort kernel" message
<rjo>
if the output fifo has events that keep it busy for the next N cycles, one could safely handle ethernet irqs if they are known to return after N cycles.
<sb0>
yes
<sb0>
well, mostly
<sb0>
handling too many of them could slow down a large computation later on and still cause an underflow, but that sounds unlikely
<rjo>
hmmm. difficult question which is best here. coprocessor, networking-gateware or some automatic or explicit interrupt masking.
<rjo>
what are you leaning towards?
<sb0>
I don't like the explicit interrupt masking, which is user unfriendly and obscure and I'm sure people will use it incorrectly
<rjo>
explicit or automatic masking now and a coprocessor split later does not sound like too much of an work overhead if the coprocessor split planned for.
<rjo>
explicit masking is definitely a low-level solution.
<sb0>
coprocessor split doesn't look all that difficult, and there are some good candidates for network stacks to recycle
<rjo>
i agree that reusing a soft network stack is worthwhile.
<sb0>
at least the apple store(tm) ahem, sorry, internet of things is good for something
<rjo>
ha!
<rjo>
if i can buy a "mote" that does energy harvesting and has some useful wireless stack i would actually buy quite a few of them for the lab.
<sb0>
the main advantage of the gateware approach is more bandwidth, but probably not a lot more (since the CPU still has to parse the packets in the end)
<sb0>
*the contents of
<sb0>
and copy memory
<sb0>
sounds a bit like overoptimization
<rjo>
a possible path seems to accurately delineate the line between the future frontend and the future backend now and implement that split later.
<rjo>
yes. i don't think we need that much bandwidth to/from the core.
<sb0>
well the second CPU can still have read-only access to all the runtime mem
<sb0>
then it's just the same linker-based stuff, except that we start the second CPU instead of calling the function
<sb0>
with some details like properly setting up the stack pointer of the backend CPU
<sb0>
and any global variables that the syscalls write need to be put in a separate section that the backend can write
<rjo>
are you thinking or bram-backed memory-mapped scratch space for the two directions (front-to-back and reverse) + flags for data stb/ack
<sb0>
in the beginning, the backend can just have the same memory space as the frontent
<sb0>
*frontend
<rjo>
with different frame+stacks+bss+data sections.
<rjo>
stack
<sb0>
the message passing would just be backend->frontend "make RPC with parameters at address xxx in main memory", f->b "RPC completed, result is xxx", and b->f "kernel terminated with/without exception"
<rjo>
and f->b "start kernel at X".
<sb0>
yes, and/or use the reset signal
<rjo>
by a special backend EBA.
<rjo>
hmm. i like it. let's keep this option/feature in mind but schedule it a bit later.
<rjo>
or do you think this should be implemented now?
<sb0>
no, serial works fine for now, it's just slow and cannot abort kernels
<sb0>
things would be easier if PCs still had ISA...
<sb0>
just map a few IO ports to reset/start the CPU etc.
<sb0>
meh
<rjo>
isn't wishbone just a reborn ISA?
<sb0>
that you can't plug into a computer motherboard, unfortunately...
<sb0>
I see why CERN loves their VME systems
xiangfu_ has joined #m-labs
xiangfu has quit [Ping timeout: 260 seconds]
fengling has joined #m-labs
xiangfu_ has quit [Remote host closed the connection]
nicksydney has quit [*.net *.split]
stekern has quit [*.net *.split]
felix_ has quit [*.net *.split]
jaeckel has quit [*.net *.split]
kristianpaul has quit [*.net *.split]
aeris has quit [*.net *.split]
bentley` has quit [*.net *.split]
xiangfu has joined #m-labs
_florent_ has joined #m-labs
jaeckel has joined #m-labs
aeris has joined #m-labs
bentley` has joined #m-labs
kristianpaul has joined #m-labs
nicksydney has joined #m-labs
stekern has joined #m-labs
felix_ has joined #m-labs
nicksydney has quit [*.net *.split]
stekern has quit [*.net *.split]
felix_ has quit [*.net *.split]
jaeckel has quit [*.net *.split]
kristianpaul has quit [*.net *.split]
aeris has quit [*.net *.split]
bentley` has quit [*.net *.split]
bentley` has joined #m-labs
aeris has joined #m-labs
nicksydney has joined #m-labs
stekern has joined #m-labs
felix_ has joined #m-labs
aeris has quit [*.net *.split]
bentley` has quit [*.net *.split]
jaeckel has joined #m-labs
kristianpaul has joined #m-labs
aeris has joined #m-labs
bentley` has joined #m-labs
<ysionneau>
morning
fengling_ has joined #m-labs
fengling has quit [Ping timeout: 265 seconds]
<sb0>
now there's the problem of numpy embedded in some other structure (list, dict, etc...)
kyak has quit []
MY123 has joined #m-labs
kyak has joined #m-labs
nengel has quit [*.net *.split]
nengel has joined #m-labs
aeris has quit [*.net *.split]
bentley` has quit [*.net *.split]
fengling__ has joined #m-labs
aeris has joined #m-labs
bentley` has joined #m-labs
xiangfu has quit [*.net *.split]
ysionneau has quit [*.net *.split]
ohama has quit [*.net *.split]
fengling_ has quit [Ping timeout: 265 seconds]
xiangfu has joined #m-labs
ysionneau has joined #m-labs
ohama has joined #m-labs
fengling__ has quit [Quit: WeeChat 1.0]
FabM has quit [Quit: ChatZilla 0.9.91 [Iceweasel 31.1.0/20140903072827]]
_florent_ has quit [Ping timeout: 246 seconds]
xiangfu has quit [Ping timeout: 258 seconds]
_florent_ has joined #m-labs
<ysionneau>
sb0: what kind of address does the Slicer take as input?
<ysionneau>
CPU address (byte aligned)? WB address (4-bytes aligned)?
<ysionneau>
DRAM address (2-bytes aligned)?
<sb0>
iirc DRAM address
<ysionneau>
aah makes sense ok
<ysionneau>
there is still something I don't get
<ysionneau>
as I understood, for the ppro example we have row(12 bits) bank(2 bits) col(8 bits)
<ysionneau>
and this addresses 16 bits of data
<ysionneau>
but to transform an address in the row number, I see the slicer does : address >> 7
<ysionneau>
I would at least do address >> 8 , if by row the slicer means "row+bank"
<ysionneau>
it is as if there was already one bit thrown away in the address given to the Slicer
<sb0>
do you mean, to get the bank address?
<sb0>
also you don't have to use that slicer
<ysionneau>
I found the idea of having a "slicer" to cut down addresses into row/bank/col pretty usefull
<ysionneau>
and I wanted to reuse the code
<ysionneau>
but not sure if I can reuse the exact same then...
<sb0>
the lasmicon slicer expects the bank address to be pre-extracted
<sb0>
so no, you can't reuse the exact same
<ysionneau>
ok that makes more sense, even if I still don't get why it only removes 7 bits instead of 8 to get rid of the column
<ysionneau>
but ok let's do a slicer specific to this simple sdram controller
<sb0>
what address_align did you give to it?
<ysionneau>
it seems to be 1 for now on papilio
<ysionneau>
since address_align = log2_int(burst_length)
<sb0>
yes, and log2_int(1) == 0
<sb0>
not 1
<ysionneau>
ah burst length=1 ?
<ysionneau>
I thought it was 2
<sb0>
it'll be 2 after the PHY does clock multiplication. but that's the next step
<sb0>
and yes, you'll want the last bit of address to be 0 in that case
<ysionneau>
so we need 2 read commands for 1 wb access then
<sb0>
otherwise you'll get burst reordering instead of getting a fully new burst at the next address
<sb0>
yes, but make it a 16-bit wb bus at the controller
<sb0>
and then use a 32-to-16 adapter between the cpu and controller
<ysionneau>
ok :)
<sb0>
I think _florent_ did some work on such adapters already...
<ysionneau>
so the entire wb bus would be 16 bits wide?
<sb0>
inside the sdram controller yes
<ysionneau>
but the sdram controller is just a slave connected to the main wishbone bus, right?
<sb0>
the sdram ctrl should create a wb interface with nphases*d bits
<ysionneau>
and cpu has 2 masters connected to this same bus?
<sb0>
yes
<sb0>
but the cpu bus is 32
<ysionneau>
yes
<sb0>
so you need some adapter that turns a cpu request into two 16 bit requests
<ysionneau>
ok so the cpu will not be connected directly to wishbone anymore, and the entire wishbone bus will be 16 bits-wide
<sb0>
with ddr3 it'll be the opposite - turn a 512-bit memory access into a 32-bit cpu word
<ysionneau>
and the adapter between CPU and WB will deal with the difference
<sb0>
huh, no
<sb0>
adapter between memory ctl and rest of the bus
<sb0>
everything else stays 32
<ysionneau>
right then I understand
<ysionneau>
adapter is on the controller side
<ysionneau>
ok
<sb0>
there are a number of other things in the system that expect 32-bit wishbone... the SRAM, the flash controller, ...
<ysionneau>
yes that's why I was surprised about the 16 bits wide but I just misunderstood
<_florent_>
hi, yes I remember I was doing such adaptation, but now it's integrated in Wishbone2LASMI
<_florent_>
but the code for the adaptation is probably that one:
<_florent_>
or maybe you can use Wishbone2LASMI with a minimal cache
<sb0>
nah, the cache is useless here
<_florent_>
it's not possible to have only 1 WB word in the cache?
<sb0>
one thing which will be nice for ddr3 (with 512-bit accesses) is WB burst support in the converter
<sb0>
and then we can call a factory function in gensoc that selects the correct converter based on the CPU and memory wishbone buses
<sb0>
either downconverter, or the future converter which does 32 -> 512 with burst support
<_florent_>
I have a version of Converter in my fork that support packets (sop, eop), it can probably be used for burst, we will just have to increase the
<_florent_>
address a each new data of the packet
<_florent_>
*at
<sb0>
wishbone bursts?
<_florent_>
no on dataflow, but we can probably have so logic in common, or use the same low level modules
<_florent_>
what will be even better is to provides Wishbone support only for compatibility, and use our own bus internally that will be inspired of AXI but simplified...
<sb0>
what's wrong with WB?
<larsc>
_florent_: how do you simplify AXI?
<_florent_>
AXI buses seems more easy to pipeline and to use because address / data use separated channels
<_florent_>
maybe not simplify, but use the minimal version and maybe not support everything
<larsc>
most of the advanced stuff is optional in the spec anyway I think
<_florent_>
ok
<larsc>
I'm a big fan of AXI, so much simpler than the other stuff
<_florent_>
I've not use a lot AXI, but yes it seems simple, at least when you want to pipeline things
<sb0>
by separate channels, you mean e.g. for reads the address is acked, and then the target core makes a request to return data?
<larsc>
the basic protocol for the channels is like the ack/stb in migen
<larsc>
for reading you have 2 channels, one request channel and one response channel
<_florent_>
yes, similar to LASMI for a read for example now that you have splitted ack for command and data
<larsc>
for writing you have 3 channels, one request, one data and one response channel
<sb0>
LASMI arbitration is a bitch, tho
<larsc>
the channels can be, but don't have to be asynchronous
<sb0>
this scheme makes it a lot more complex than WB arbitration
<larsc>
so you can send e.g. 3 read requests before you get data back
<larsc>
which is great for pipelining
<sb0>
yes. but then the arbiter has to know that 3 data items have to come back before it can switch to another initiator...
<larsc>
typically each slave as a unique ID and arbiter just passes the IDs along and when a response comes back uses it for knowing were to send the response
<larsc>
s/each slave/each master/
<larsc>
AXI is pretty much a simple NOC
<sb0>
so you also need to arbiter in the other direction, I guess?
<sb0>
since several data items can come back for the same initiator at the same time
<larsc>
yes
<larsc>
although I think out-of-order replies have been dropped from the spec in AXI4
<sb0>
well if you send requests to several targets with different latencies, then out-of-order replies might still happen
<larsc>
I think the interconnect has to make sure that this does not happen
<larsc>
the easy solution is to block parallel requests to different slaves
<larsc>
a single master can also choose to have multiple IDs in this case it is still possible that requests for different IDs are completed out of order
<ysionneau>
sounds like a bit of a mess
<larsc>
no, from a system point of view it just looks like separate masters
<larsc>
it's like a network card with multiple IP addresses
<ysionneau>
ah indeed
<_florent_>
at least there is a fact:
<_florent_>
- you read the AXI specification, every thing seems clear and simple.
<_florent_>
- you read the Wishbone specification, you don't understand anything...
<_florent_>
:)
<ysionneau>
fair enough =)
MY123 has quit []
nicksydney has quit [Quit: No Ping reply in 180 seconds.]