<whitequark>
sb0: then if you get a NaN or +Inf, the results can be unpredictable
<whitequark>
or -Inf or -0
<whitequark>
moreover (this is a separate flag) LLVM will also do algebraically equivalent transformations such as reassociation, which can dramatically change precision in some cases
<whitequark>
but this wins us almost a 2x gain on PulseRateDDS...
<sb0>
how fast is it now?
<whitequark>
182us per batch of two writes
<whitequark>
will be 100us
<sb0>
Unfortunately this isn't feasible because our attribute writeback machinery allows outside code to grab a pointer to any object in the graph, which is exactly what it's ought to do
<sb0>
but this writeback code only reads the objects, no?
<whitequark>
no way to tell LLVM that.
<whitequark>
besides it doesn't really do constant *propagation* through globals
<sb0>
and if you don't tell it anything?
<whitequark>
instead it pulls in the entire global and replace its value
<whitequark>
it doesn't do anything.
<whitequark>
since it assumes there can be writes
<sb0>
yes, but if you get the objects without telling LLVM about it?
<whitequark>
impossible
<whitequark>
if I don't tell LLVM about objects, it will mangle them beyond recognition. in fact there won't *be* any objects
<whitequark>
if all constant fields, including those from setattr_device, are actually marked as constant, then that implementation shouldn't present any problems
<whitequark>
we'll also need to bring in the concept of a fire-and-forget RPC back... hrm
<whitequark>
and, compared to the current situation, it will inflate code size
<sb0>
whitequark, maybe disable writeback and have only explicit host attribute writes. setattr(self, "name", value) as RPC...
<whitequark>
sb0: ok. without attribute writeback that dds batch takes 41us
<whitequark>
still not quite as fast as it could be due to some LLVM FP silliness
<whitequark>
without that silliness it would be 27us
<whitequark>
sb0: if we ported the OR1K backend to LLVM 3.6 then we could keep attribute writeback.
<whitequark>
there is less than a dozen changes to the backend interfaces between 3.5 and 3.6 and none of them are functional
<whitequark>
so this should take less than a day
<whitequark>
actually, we could probably go right to 3.9 without much hassle.
<whitequark>
well, 3.8, last released one...
<whitequark>
this also has the advantage that we can use upstream llvmlite
rohitksingh has joined #m-labs
<sb0>
27us for a 2x batch? that's pretty good
rohitksingh has quit [Ping timeout: 260 seconds]
<sb0>
whitequark, ok, try going for the llvm upgrade. but do not break 1.0.
rohitksingh has joined #m-labs
rohitksingh has quit [Quit: Leaving.]
sandeepkr_ has joined #m-labs
<sb0>
whitequark, so the square wave minimum period is still 1.34us?
<sb0>
wasn't it 1us at some point?
<sb0>
whitequark, the time I get per batch of two DDS writes is 300us, not 182
<sb0>
artiq 1.0rc1+46.g1d8b0d4
sandeepkr_ has quit [Ping timeout: 244 seconds]
<rjo>
sb0: ~1.3 µs is what i remember it being for a long time.
<sb0>
rjo, ok.
<sb0>
btw the problem of getting a conservative number of free entries in a async FIFO is interesting, as it can be used to optimize the DRTIO protocol
<sb0>
maintain a local copy of the number of free entries, when it is >0, write blindly, otherwise ask the remote side for entries
<sb0>
underflows and other errors can be detected locally (similar as they are now) since there will be time sync with the remote
<sb0>
I think DRTIO should be a pretty separate core design, except for RTLink... it won't share much code with the current one
<larsc>
just like a async fifo
<sb0>
yeah, that's the basic idea, but the implementation is very different
sandeepkr_ has joined #m-labs
<sb0>
rjo, it seems the spinboxes absolute min/max values can be exceeded when dragging the sliders
<sb0>
and then things go out of sync if you touch the spinboxes and then the sliders
<rjo>
ah yes. sounds possible. could you file a bug so that i don't forget?
<rjo>
sb0: could you add unary minus support to value_bits_sign()? i don't know where to dig to determine the correct behavior in that case.
<rjo>
does a unary minus actually change the signedness?
<rjo>
sb0: DRTIO: but this fire-and-forget way of doing writes only works for the output and does not work for the other errors that can only be detected at the phy, right?
<sb0>
yes, overflow and busy - same problem as before...
<sb0>
I'll look into that. are you working on the JESD204 signal generator?
<rjo>
a bit. yes. i did some sketches and some math on what it can conceivably do.
<rjo>
from the design i can pretty much reverse engineer what AD does inside the DDSes and why certain things are as they are...
<rjo>
sb0: and there is a nasty bug in the simulator with Mux() and signals wider than one bit as the selector IIRC. but i had worked around it a while ago and i don't remember the details.
<sb0>
rjo, ok, can you file issues for those things?
<sb0>
not too secret, you just need to register an account
<rjo>
the good old phys wiki
<sb0>
I think the AMC standalone mode proposed here doesn't make much sense
<sb0>
where is the power supply going to be? what about protecting the board with an enclosure? where will the extra SFP go, on the already crowded front panel?
<sb0>
rjo, btw, since xilinx had the bright idea to remove the phase detectors from the IOSERDES in 7-series, we might have to halve the max data rate on the backplane
<sb0>
unless we can assume that, once started, the clock/data timing relationship won't vary enough to cause trouble.
<sb0>
might be actually ok
<rjo>
i would have to read up on that xapp again to comment on that.
<rjo>
what speed would be un-halved?
<rjo>
as i see it, amc standalone would basically be a very minimal amc infrastructure. yes: with power supply, potentially enclosure, sfp.
<sb0>
1200Mbps -> 600Mbps between MCH and AMC
<sb0>
per lane
<rjo>
for spartan6 with that quad oversampling, that would be 1060M/4, right?
<sb0>
we can of course use the transceivers there, as Greg suggests, which obviously have a phase detector
<sb0>
and are much faster.
<sb0>
for Spartan6/Oxford hardware, there is a phase detector, so you can run at ~1Gbps
<sb0>
I can take care of this if you want, since I've already done it for HDMI
<rjo>
are the ones on the milldown on transcievers or standard io?
<sb0>
the Spartan-6 IOSERDES (standard IO)
<rjo>
then what is the quad oversampling from that xapp note needed for?
<sb0>
what xapp note?
<sb0>
with the spartan-6 phase detector, there is no oversampling at all
<rjo>
xapp1064. ah. that is indeed 1050M.
<rjo>
no. not that one. that's source synchronous.
<sb0>
maybe the phase detector is not necessary
<sb0>
you can just scan the delays and note if you're able to get a valid data stream, then just go in the middle of the working range
<sb0>
and stay there
<sb0>
this is more likely to work on 7-series, which have calibrated delays
<sb0>
whereas on the s6... the delays are actually a very fast ring oscillator
<sb0>
uncalibrated
<sb0>
FWIW, we don't recalibrate the DDR3 delays, and it seems stable
<rjo>
how do they reconstruct the clock for spartan6 ioserdes?
<sb0>
there is no clock reconstruction possible with the ioserdes
<sb0>
you receive a clock which is phase-locked with the data, but you don't know the phase
<sb0>
...well, if you do 4x oversampling, you can implement a digital PLL that will do some form of clock reconstruction
<sb0>
this is what is used in some 12Mbps USB PHYs
<sb0>
sampling at 48MHz
<sb0>
with the 48MHz asynchronous to the data, and the DPLL fixes it up
<sb0>
but since we have this fancy backplane, we can send the clock to the AMCs, and then the receiver only have to determine the phase, not the complete clock
<rjo>
but in general we don't have the clock.
<sb0>
what do you mean?
<rjo>
for many non-backplane versions there will only be the rx tx pair.
<sb0>
yes, in that case you need a transceiver, or do the slow 4x oversampling + DPLL trick
<rjo>
then what needs to be designed anyway are a) a 4x+DPLL or 7 series transciever version, b) the one for the milldown spartan6 transcievers.
<sb0>
you can use the IOSERDES for b
<sb0>
and send a clock
<rjo>
and you are sayting that c) ioserdes with 7 series for the M-Labs ARTIQ HW is something that should be done as well?
<rjo>
i tought the transcievers were fixed and you can't use those pads as standard logic.
<sb0>
once we have one IOSERDES the other ones are semi-trivial. similar to another IOSERDES RTIO PHY
<sb0>
transceivers have dedicated pads yes, but AFAIK their backplane also has links on regular IO
<rjo>
on that adapter board design it seemed to be very little additional io
<sb0>
the kc705 adapter?
<sb0>
hmm
<rjo>
yes.
<sb0>
I think that in general we should prefer IOSERDES over transceivers. they are less messy, magical, proprietary, messy and a pain to use
<rjo>
my guess is that in the long run we will want/need/be force to use transcievers whether we like it or not. and it would be nice to being able to generally fall back to the reconstructed clock and to not worry about speed limitations.
<sb0>
IOSERDES are more portable too
<sb0>
note that using a transceiver reconstructed clock requires an off-chip PLL/VCXO in many cases
<rjo>
isn't the portability already disproven be the removal of the phase detector between 6 and 7 series?
<rjo>
yes but no additional link.
<sb0>
we can simulate the s6 phase detector by using 2x oversampling in the IOSERDES, with minimal code modifications
<sb0>
the first version of my HDMI core did that, because the phase detector is only available on differential IOs that were not possible with my hacky adapter
<rjo>
i don't have the strongest optinion on this. but we will invest a lot into the transcievers anyway.
<sb0>
transceivers are different on each fpga family, and the hundreds of obscure parameters they have change.
<sb0>
in fact, instantiating a transceiver in migen breaks the normal python function calls syntax, which is limited to 255 arguments
<sb0>
the workaround is to put them in a dict and use **kwargs
<rjo>
that hassle seems to be on par with the details of implementing the iodelay interface, ioserdes changes, master/slave pin pair limitations, phase detector intrinsics or unavailability, oversampling, lack of clock reconstruction, speed limitations, etc.
<rjo>
one would hope to manage the arguments in a dict and not as a big instantiation.
<sb0>
iodelay and phase detectors are rather simple things
<sb0>
there are no master/slave pin pair limitations, each differential input has a master and a slave
<rjo>
and then the gearbox, the scrambling/encoding, framing symbols.
<sb0>
yes. better have those items as open source components (which are available in my HDMI core) than obscure transceiver features that _will not_ work and _will_ be a pain in the arse to debug
<sb0>
it's not even that hard or performance-critical, and I wonder why Xilinx has those as hard-blocks
<sb0>
it just makes things more complicated imo
<rjo>
well. i am perfectly fine with abstaining because i have never implemented or used the transcievers myself.
<rjo>
but we should really consider all factors here.
<sb0>
there are valid reasons for using transceivers, but the fact that they contain the data encoding logic is not one of them
<rjo>
isn't 600 Mbit something that we might actually sustainably saturate pretty quickly with our cpu. that would not be smart.
<rjo>
a drtio write might be something like 200 bit. 400 bit for a pulse.
<rjo>
depending on how smart we are with the protocol.
<rjo>
we certainly saturate it with dma very soon.
<rjo>
oh. and i don't even know wether the SFP transcievers work fine at low frequency.
<sb0>
they do
<sb0>
minimum data rate is some hundred Mbps iirc
<sb0>
or, you mean the fiber PHY? this I don't know
<rjo>
yes.
<sb0>
but I'm not suggesting IOSERDES for SFP. the case for transceivers is pretty clear there.
<sb0>
how fast can one SFP go?
<rjo>
a 8b10b 1.25 GBit transciever needs to work down to something like 125 MHz but i don't know how steep the dc correction edge is below that.
<rjo>
i think 10 GBit on a SFP+ is doable. let me check.
<rjo>
yep.
<sb0>
without fancy (but jittery) signal encoding on the link?
<rjo>
pretty sure.
<rjo>
that is one electrical pair.
<rjo>
one optical wavelength.
<sb0>
how do they modulate the laser that fast? kerr cell?
<rjo>
no. plain vecsel
<sb0>
ok. sounds good
<rjo>
iirc that speed was a pretty hard barrier when they built the first transcievers. they could not get 10 GBit with 8b10b working. that barrier was one reason for 64b66b
<sb0>
we can probably run them at e.g. 6Gbit-ish
<sb0>
may make things simpler in the standalone digital box - we could use a low-end fpga
<rjo>
artix with transcievers for the box?
<sb0>
yes, or maybe spartan6 even
<sb0>
btw, do you know that the 3 smallest artix have the exact same silicon die? the only limitation is on the total LUT/BRAM count that vivado will accept to use
<sb0>
and those are placed anywhere on the chip, so I think that if you rewrite the bitstream header you can run a 55 bitstream on a 15 chip
<rjo>
nice.
<rjo>
but they probably had to do something like that. the artix things seemed really cheap to me and maintaining the entire fab line for two more silicons might not be worth it.
<rjo>
hmm. it would make sense to get a good number for the sustained throughput needed in the pulse shaping wideband rf case. the superconducting labs will probably need a lot.