Topic for #milkymist is now Radical Tech Coalition :: Milkymist One, Migen, Milkymist SoC & Flickernoise :: Logs: http://en.qi-hardware.com/mmlogs
kristianpaul [kristianpaul!~kristianp@2001:0:53aa:64c:2430:4cb2:415a:74c5] has joined #milkymist
kristianpaul [kristianpaul!~kristianp@unaffiliated/kristianpaul] has joined #milkymist
kristianpaul [kristianpaul!~kristianp@unaffiliated/kristianpaul] has joined #milkymist
togi [togi!~yur@c83-250-142-73.bredband.comhem.se] has joined #milkymist
togi [togi!~yur@c83-250-142-73.bredband.comhem.se] has joined #milkymist
cwoodall [cwoodall!~cwoodall@comm575-0301-dhcp011.bu.edu] has joined #milkymist
xiangfu [xiangfu!~xiangfu@fidelio.qi-hardware.com] has joined #milkymist
elldekaa [elldekaa!~hyviquel@abo-168-129-68.bdx.modulonet.fr] has joined #milkymist
mumptai [mumptai!~calle@brmn-4db70ebb.pool.mediaWays.net] has joined #milkymist
togi [togi!~yur@c83-250-142-73.bredband.comhem.se] has joined #milkymist
togi [togi!~yur@c83-250-142-73.bredband.comhem.se] has joined #milkymist
<lekernel> larsc: sequential(1) means you accept data on every second cycle, and process it on the other. sequential(0) is what you mean.
<lekernel> so yes, pipeline(0) = sequential(0) ... but to avoid this, having zero latency is not permitted for pipeline/sequential and one should use combinatorial instead
<lekernel> wpwrak: not only is it more complicated, but it doesn't make sense, because the base Actor class already includes the control logic for you when you use one of the predefined scheduling models
<lekernel> so removing scheduling models would increase code redudancy, not reduce it
<lekernel> larsc: having both properties could be an option, yes.
<wpwrak> yeah, i was thinking that you may be able to determine what interface and thus scheduling model you need from the functionality
<wpwrak> but yes, it would be harder for the computer
<lekernel> larsc: can every graph of sequential/pipelined actors be described by these two numbers? hmm...
<lekernel> if so, it'd be very interesting to use them
<lekernel> but there will be a problem if the graph has several input ports, which can accept tokens independently
<wpwrak> if you have out = f(inA, inB), would you express the choice between backpressure (if inX is ready before inY, you stall it until inY is ready) and internal merging (e.g., accept inX and buffer it, perform the operation when inY is ready) be expressed by varying pipeline(N) ?
<larsc> you actually already have a problem if a pipelined actor is before an sequential actor. for examle if your data arvies in burst you'd probably want to fill the pipeline and then stall
<wpwrak> you'd also need a means for expressing if inX and inY are synchronized and always arrive at the same time
<wpwrak> and of course, if inX and inY are bursty, you'd need to be able to specify the burst size
<wpwrak> well, burst = 0 could be backpressure, burst = 1 could be buffer-until-the-other(s)-is/are-ready, etc.
<lekernel> wpwrak: synchronized inX and inY mean combining them into a single token
<wpwrak> ah, nice
<lekernel> the token data type is a tree of integers (i.e. a record, which can include other records as members)
<wpwrak> if you have unsynchronized data paths that travel together (e.g., VGA + DCC), can you still express them as some sort of compound ? or is that too exotic
<larsc> well, you can always split them
<larsc> vga + ddc whould describe the physical signal
<larsc> eh, would
<lekernel> larsc: we can abandon that property, and make pipelines stall the input whenever the output is stalled, no matter what's in them
<lekernel> actually it uses a tad less resources to do it this way
<larsc> lekernel: yes
<larsc> we can, but depending on the workload it might be suboptimal
<larsc> and i think the latency properties do not apply to the whole graph, but rather to a path in the graph
azonenberg [azonenberg!~azonenber@cpe-67-246-33-188.nycap.res.rr.com] has joined #milkymist
<larsc> signals which are synchronized lie on the same edge
<wpwrak> how would write-back memory be modeled ? (non-blocking interface) -- (FIFO) -- (blocking interface) -- (bus access & arbiter) ?
<wpwrak> (assuming the FIFO is sized for the worst case)
<larsc> wpwrak: that's probably dynamic
<larsc> but you can give a upper and a lower bound i guess
<wpwrak> data-dependent delays could also get interesting. e.g., for SIMD units.
<wpwrak> larsc: i was wondering more about the interfaces. properly dimensioning the FIFO is another problem. not necessarily a trivial one, of course ;-)
<larsc> wpwrak: which interfaces?
<larsc> you'd have the same handshake signals as every else i guess
<wpwrak> between the processing elements
<wpwrak> hmm, but you don't always handshake, do you ?
<larsc> the idea of migen flow is to remove the handshakes if we know from the scheduling model they are not neccesarry
<wpwrak> yup. so how would that write-back memory interface work ?
<wpwrak> or, in general, any FIFO with such properties
<wpwrak> i.e., it needs backpressure at the output but not at the input. it needs a data strobe on both ends.
elldekaa [elldekaa!~hyviquel@abo-168-129-68.bdx.modulonet.fr] has joined #milkymist
<larsc> yes
<larsc> but i still don't get what the question is
<lekernel> wpwrak: then that FIFO would always assert its ack signal
<lekernel> you can't have non-blocking interfaces, only blocking interfaces that may not block in practice
<wpwrak> so migen wouldn't see a difference ?
<lekernel> no
<wpwrak> i see
<lekernel> larsc: and yes, running the pipelines this way is suboptimal. but using verilog and automatic par is suboptimal to programming LUTs and interconnect manually, too.
<wpwrak> maybe the verilog synthesizer can figure it out when it sees that the ACK is always set :)
<larsc> it will
<larsc> or at least should
<lekernel> wpwrak: in some case it does, but it won't model the speed of data in and out the fifo
<larsc> ;)
<wpwrak> you mean the traffic profiles = the FIFO size ? yeah, that would be asking for a bit too much :)
wolfspraul [wolfspraul!~wolfsprau@p5B0ACF87.dip.t-dialin.net] has joined #milkymist
<wpwrak> there are a few nice bookshelf full of queuing theory and descendant fields of science ... that nobody should be forced to read ;-)
<wpwrak> in practice, you do everything with simulation anyway (coming from the networking side)
kilae [kilae!~chatzilla@catv-161-018.tbwil.ch] has joined #milkymist
<larsc> lekernel: if our graph can accept inputs independently there has to be an actor somewhere in the graph which can accept input independently and thus is has dynamic scheduling
<larsc> (given the graph is connected)
<larsc> i wonder if this is minimum cut/maximum flow
<larsc> lekernel: consider this: http://metafoo.de/flow.png edge are actors, circels are sources/sinks. the strobe signal of sink1 still depends on all the sources strobe signals
<larsc> yes, i think this is a min cut/max flow problem. if you have a graph you can calculate boths delays over the whole graph. and if you create handshaking signals based on these you can remove all handshaking inside the graph itself
<larsc> to do this you'd insert supernode synchronizer before the inputs and after the outputs
<larsc> synchronizers
<larsc> and than you'd implement the handshaking between the supernode input and the supernode output
<larsc> and the grey circles are the supernode stuff
<larsc> i wonder if the graph has to be a DAC for this to work
<larsc> i think you can even calculate the maximum flow for each of the delays for a graph independently and it will still work
<wolfspraul> wpwrak: since you did led color comparisons, I just ran into a remark from Casio in 2010, where they say that for some high-end projectors, they produce component colors in the following way "red by red led, blue by a blue laser and green converted by phosphor from a blue laser"
<wpwrak> hehe :)
<wpwrak> the green leds in M1 are already a lot better than the "super-bright" green leds i have (selected some 5-6 years ago). but still, in comparison to red, green sucks.
<wpwrak> at least they manage to outshine at ~10 mA and 100% duty cycle my red LEDs (also 5-6 years old) running at ~6 mA, 15%
gamambel [gamambel!~torsrvrs@anonymizer1.torservers.net] has joined #milkymist
<larsc> so to optimize the handshaking of any graph you'd do:
<larsc> remove all actors with dynamic scheduling and then calculate the connected components.
<larsc> for each component check whether it is a DAC or not. If it is not a DAC calculate the minimum set of edges that have to be removed in order for it to be a DAG.
<larsc> Remove these edges from the graph, the nodes which have been connected by a removed edge will need handshake signals, handle them like other sink or source nodes.
<larsc> Then for each component calculate the total delays. Based on these delays generate the handshake signals, these handshake signals will be used for all sinks and all sources of the component.
<lekernel> DAC?
<lekernel> I know DAG, but not DAC
<larsc> yeeah fdag
<larsc> dag
<lekernel> and yes... I had some ideas about removing handshaking inside graphs, but I have not developed them. your method seems correct and interesting :)
<lekernel> but I think dealing with DAGs only is sufficient, no?
<lekernel> if you have a cyclic graph with only seq/pipeline actors, it will never terminate
<larsc> if it also depends on external input it will. e.g. a LFSR
<lekernel> so, it's not really useful in practice imho
<lekernel> hmm...
<lekernel> then there's another problem. all actors also have a "busy" signal that they should assert when they have data that shouldn't be lost in any of their registers
<lekernel> this is useful to signal the completion of a hardware accelerator to a CPU, for example
<lekernel> if we have a never-ending dataflow system, the global "busy" signal will stay asserted and this doesn't make much sense
<lekernel> (global busy = OR of all actors' busy signals with the current design)
<lekernel> for the LFSR I'd implement it in one actor too :)
<larsc> i don't see a problem with this right now
<larsc> the lfsr is non busy if there is no new input
<lekernel> but once there has been some input, it will be busy trying to send tokens to itself
<lekernel> and in the case of a pure sequential or pipelined actor, this will never end
<larsc> we might just need to say that a feedback connection does not cause busy to be asserted
<lekernel> or simply avoid those complexities entirely, and impose that all sub-graphs made of seq/pipe actors must be acyclic?
<lekernel> are they really useful?
<lekernel> seems we need a dynamically scheduled actor in the feedback path anyway, to provide the first value(s)
<larsc> maybe. i don't know. but given the current defintion of busy in actor.py I still don't see why busy should be asserted permanently if there is a cycle
<lekernel> busy is asserted when attempting to send a token which is stalled
<larsc> i think we can start without support for cycles
<lekernel> until the end of the stall
<larsc> but it will only transition from non-busy to busy if all inputs are pending
<lekernel> yes, but if one of its inputs is connected to its output, it will be busy because it tries to send data to itself, and cannot receive it until the other inputs also have data
<larsc> hm
<lekernel> wpwrak: regarding those laser phosphors, I wonder if they can be used to make EPR photon pairs. the crystals normally used for this are crap expensive, e.g. http://www.eksmaoptics.com/en/p/beta-barium-borate-bbo-crystals-298
<larsc> lekernel: is a counter really cheaper than a one-hot shift register for small numbers? (referring to the sequential control fragment)
<lekernel> depends how small
<lekernel> but I haven't done the math (which should be straightforward enough though)
<lekernel> it's just a small detail
<larsc> i was just thinking about the control fragment for a actor with sequential and pipeline dealy. you'd just use a shift register like for the pipeline control fragment with the length being pipeline delay + sequential delay.
<larsc> and ack_o would only be asserted if the lower n entries are zero
<larsc> where n is the sequential delay
<larsc> another approach is to just stick the sequential logic in front of the pipeline logic and always add a new entry to the shift register when the timer triggers
<lekernel> you can use a LFSR instead of the counter. even faster due to no carry propagation (though on a FPGA, the chains are pretty fast)...
<lekernel> there are lots of possible optimizations, but the counter is simple to write/understand and good enough imo
<lekernel> if it turns out that those counters are using more than 0.5% of the FPGA resources or cause timing problems in a design, then I'll pay them more attention
<lekernel> but I believe they won't
<larsc> lekernel: hm, I don't quite understand what Cat() does
<lekernel> the FHDL thing?
<lekernel> it simply concatenates bit vector together
<lekernel> Cat(a, b, c) ==> {c, b, a} in Verilog
<larsc> hm, ok
Artyom [Artyom!~chatzilla@84.23.62.191] has joined #milkymist
<wpwrak> lekernel: (photon pairs) hmm, maybe. but the photons would be at very different energy levels.
<lekernel> hm?
<lekernel> as I understand it, those things take one blue photon and turn it into two green photons
<lekernel> or are there more complicated things, e.g. blue to green + IR?
<wpwrak> i think such things take one blue and emit one green plus one IR (or dissipate the energy in some other way, e.g., mechanical excitation)
<wpwrak> yeah
<wpwrak> (somewhat different chemistry, though)
<lekernel> so I guess I should grow my own BBO crystals if I want cheap EPR pairs
<wpwrak> yeah. have a little crystal garden in the backyard ;-)
mumptai [mumptai!~calle@brmn-4db70ebb.pool.mediaWays.net] has joined #milkymist
cwoodall [cwoodall!~cwoodall@comm575-0301-dhcp011.bu.edu] has joined #milkymist
<larsc> lekernel: the sequential control logic takes N+1 cycles is this a bug or a feature?
<lekernel> hmm, not sure what you mean
<lekernel> what it should do, when input tokens are always available and output tokens always accepted, is accept/send tokens in one cycle, then process for N cycles
<lekernel> if you include communication then yes, N+1 is a feature, not a bug
<lekernel> but I have not simulated this code yet, so you can probably easily find problems there
<larsc> ok, but the one is actually pipeline latency, isn't it?
<lekernel> in some way, yes, it's a 'pipeline'
<lekernel> you want to switch to the X pipeline stages + Y sequential cycles model?
<larsc> i think it is so much nicer to work with
sh4rm4 [sh4rm4!~sh4rm@gateway/tor-sasl/sh4rm4] has joined #milkymist
cwoodall [cwoodall!~cwoodall@comm575-0301-dhcp011.bu.edu] has joined #milkymist
<Fallenou> a draft/sum up of mailing list/bunch of ideas/questions about MMU : http://piratepad.net/RSE6AWxIIa
<Fallenou> I'm not sure to have understood everything that has been said on the mailing list about MMU so feel free to correct anything I got wrong and/or to add more accurate thinkings
<Fallenou> One problem of the design described is exposed at the bottom, I don't know the solution so if you know one :)
<wpwrak> kewl. that piratepad crashes konqueror
<Fallenou> oops, maybe I'd better e-mail this
<lekernel> If two processes are mapping the same PA with two different VA : we will have Cache coherency troubles.
<lekernel> How can we deal with this ?
<Fallenou> I'm not sure whether there was an answer to this problem or not on the mailing list
<lekernel> the TLB only affects the upper parts of the address... and if the lower part is sufficient to address the whole cache, the two mappings will end up at the same place in the cache
<lekernel> use just a software mapped TLB
<lekernel> s/mapped/managed
<Fallenou> ok, so no hardwired page tree walker
<lekernel> yes
<lekernel> just add special instructions that modify the TLB
<Fallenou> yep
<lekernel> and regarding cache aliases, there are no problems even with context switches, if you use virtually indexed physically tagged caches with page and cache sizes appropriately chosen
<Fallenou> this part of the mail exchange I didn't understand : about the aliases problems going away if page and cache sizes appropriately chosen
<lekernel> In 1 clock cycle the TLB answers it's value which is the 20 bits physical pfn + a few rights bits like READ/WRITE/EXECUTE => nope
<lekernel> you have separate data caches and an instruction caches
<wpwrak> i don't think the VA1 != VA2 but PA1 == PA2 case goes away easily (assuming the page is used r/w)
<lekernel> so you'll have separate data TLB and instruction TLB. then e.g. non-executable pages are simply not loaded into the instruction TLB
<Fallenou> oh ok
<Fallenou> I thought it was possible to cope with only 1 TLB accessed by both Instruction and Memory stages
<Fallenou> to save a bit of on-chip ram
<lekernel> it is certainly possible, but probably more complicated
<lekernel> it'd save on-chip RAM only if some pages are both data and code and therefore can be shared among the two TLBs, which rarely happens imo
<Fallenou> I guess sometimes it needs to happen
<Fallenou> (injecting code via load&stores and jumping on it)
<lekernel> this only happens at program startup or shared library loads
<lekernel> (afaik)
<Fallenou> maybe modules loading
<Fallenou> btw , do you want virtual kernel addresses or physical ones ?
<lekernel> yes, but that too isn't something that should be optimized (especially since I'm not sure this would actually provide a noticeable speed gain)
<lekernel> I think the simplest is to disable address translation in kernel mode
<Fallenou> ok so we just flush Caches and TLBs when doing weird things
<lekernel> and the CPU also boots in kernel mode, so it's backward compatible
<Fallenou> yep ok
<lekernel> you can use the LM32 "software exception" instruction to implement syscalls imo
<lekernel> and exceptions should also put the CPU in kernel mode
<Fallenou> there is system call exception
<Fallenou> scall instruction
<lekernel> ah, it's even called "system call" :) maybe they had something in mind
<Fallenou> hehe yep =)
<lekernel> wpwrak: I think page size = cache size solves a lot of problems
<Fallenou> great it's the case at the moment :'
<Fallenou> but I still don't see why
<Fallenou> maybe I need a drawing
<lekernel> Fallenou: btw the caches are two-way set-associative
<Fallenou> oh really ?
<Fallenou> my milkymist sources are not up to date I guess
<Fallenou> oh or maybe associativity = 1 means two-way
<Fallenou> 0..1
<Fallenou> I read the generate part too fast I guess
<lekernel> this isn't very complicated, you can just picture this as two directly mapped caches running in parallel plus some policy that only replaces misses in one of the caches
<Fallenou> hummm weird
<lekernel> hm no, you're right
<Fallenou> ok
<lekernel> some mistake... it should be 2-way
<lekernel> I think I changed this to work around some spartan6 xst problems and then forgot to restore it
<lekernel> anyway, the TLB should work for 1-way and 2-way (and don't be afraid, it's not hard - you only need a second physical tag comparator for the second way)
<Fallenou> yes it does not seem too had
<Fallenou> just a multiplex
<Fallenou> hard*
<lekernel> at each memory access you have 3 lookups: one in the TLB, one in the first way of the cache, and one in the second way
<Fallenou> then I compare two tags from two ways
<lekernel> one pipeline stage later, you compare the PA from the TLB to the tag in each way of the cache
<Fallenou> with the TLB output
<Fallenou> and I don't have to bother with TLB coherency between instruction and memory stages ?
<Fallenou> software will have to deal with it (flushing i guess) ?
<lekernel> nah, I think the software should take care of this
<Fallenou> ok good
<lekernel> and since address translation is disabled in kernel mode, the kernel can always run even with inconsistent TLB content
<Fallenou> right
<Fallenou> that's good, especially when you run the "tlb miss" exception handler :)
<Fallenou> do you mind if I send an e-mail to ML in order to sum up all of this ? and maybe to ask if someone can give a crystal clear explanation about page size = cache size simplification that puts our problems away ?
<lekernel> no I told you you can ask all the questions you want regarding the MMU topic
<lekernel> see Wesley W. Terpstra | 26 Apr 10:01
<Fallenou> yep I read all this thread actually
kilae_ [kilae_!~chatzilla@catv-161-018.tbwil.ch] has joined #milkymist
<Fallenou> Any access to the shared page will use the same low bits index into that page (12 bits in our example) <=
<Fallenou> I really don't get this in Wesley e-mail :x
<Fallenou> oh or maybe he says the two processes access the same byte of the same physical page
<Fallenou> in this case same byte => same offset in the page => same 12 low bits
<Fallenou> ok :o
DJTachyon [DJTachyon!~Tachyon@ool-43529b4a.dyn.optonline.net] has joined #milkymist
<lekernel> the lower (12?) bits are the same for PAs and VAs
<lekernel> the TLB only translates the upper bits
<Fallenou> yes yes
<Fallenou> I think I begin to understand Wesley e-mail
<larsc> lekernel: hm, one thing i did not consider is that we still need the individual trigger signals.
<lekernel> larsc: yes, but you can use one single FSM to generate those
<lekernel> or just a "timeline"
<larsc> which is just a shift register?
<lekernel> yeah basically, though for the one in corelogic it's a counter and comparators
<larsc> hm
<larsc> i suppose that makes sense
<lekernel> it's not always that simple though... eg if you have one pipelined actor feeding a sequential actor
<larsc> or maybe not since we have to keep track of multiple tokens
<kristianpaul> how you track a token ? nice mix of FSMs? :-)
DJTachyon [DJTachyon!~Tachyon@ool-43529b4a.dyn.optonline.net] has joined #milkymist
<lekernel> at the moment, with one piece of control logic per actor
<lekernel> larsc: if there's a lot of actor time-sharing (which isn't implemented at all atm), it definitely makes sense to use a FSM/timeline for control though (then it becomes a bit like the PFPU)
lekernel_ [lekernel_!~lekernel@g225045177.adsl.alicedsl.de] has joined #milkymist
<lekernel_> at the moment, with one piece of control logic per actor
<lekernel_> larsc: if there's a lot of actor time-sharing (which isn't implemented at all atm), it definitely makes sense to use a FSM/timeline for control though (then it becomes a bit like the PFPU)
ximian [ximian!juoni@deviate.fi] has joined #milkymist
<Fallenou> lekernel: would you use dual ported ram for the TLB in order to make it easy for lookups to happen AND for any instruction to modify a TLB entry ?
<Fallenou> or maybe use just a single ported ram and then more complex arbitration
<Fallenou> lm32 seems to be using dual ported rams for dcache-data icache-data dcache-tags and icache-tags
<wolfspraul> should we make it possible for an expansion board to supply power to m1?
<larsc> lekernel: time-sharing as in tdm?
<Fallenou> gn
<wolfspraul> n8
<kristianpaul> i think should no, at least you are talking about a battery pack expansion board for M1 :)
<wolfspraul> solar panel roof :-)
<wolfspraul> nuclear battery board for infinite power
<kristianpaul> he
<kristianpaul> no infinite, even voyager is getting out of power i read
<kristianpaul> nasa said no more tha 10 years left.. lets read then by then again then :)