<GitHub-m-labs>
artiq/master 729ce58 Sebastien Bourdeauducq: sayma: use GTP_CLK1 to clock DRTIO satellite transceiver...
<sb0>
seems we have inter-board sync now! my other Sayma is offline as the replacement resistors for the power regulator haven't arrived yet, so I cannot test the DAC output between two boards, but the logs look good with a single sayma controlled by kasli
<sb0>
hm, or maybe I can hook up a urukul on the kasli and make that generate 9MHz; then without rebooting the kasli I should see constant phase between urukul and sayma
<hartytp>
sb0: is there anything else you want me to look at today?
<hartytp>
998?
<sb0>
hartytp, 998 is annoying but doesn't have a high impact
<hartytp>
ack
<sb0>
hartytp, what about synch? in your log I see this "unable to determine SYSREF margin at FPGA" which should not be there ...
<hartytp>
was curious to see if it's fixed by the vccint issue. maybe a long shot as disabling the sawg didn't help
marmelada has joined #m-labs
<hartytp>
hmm
<hartytp>
need to re-read the code to remind myself what that means
<sb0>
also, DRTIO - siphaser through the tree + SYSREFs. I'm seeing strange non-deterministic behavior, and looking at the clocks would help
<hartytp>
interest in 998 was partially in case it's a clue about some larger issue
<marmelada>
hey, sb0: what happens if I load only gateware and flash is empty?
<hartytp>
so, which order would you like me to prioritize in?
<marmelada>
should I get something on console?
<sb0>
marmelada, no, since the bootloader has to be in flash
<marmelada>
is bootloader loaded by default during flash?
<hartytp>
maybe let's start with the sysref thing, as I have everything set up for that
<sb0>
hartytp, hmm, hard to tell, I'd say sysref
<sb0>
ok
<hartytp>
good
<hartytp>
so, for that I should check the SYSREF input at the AMC (maybe on the back of the AMC<->RTM connector)
<sb0>
marmelada, artiq_flash writes the bootloader by default, yes
<hartytp>
and the GTX_CLK2
<sb0>
hartytp, the FPGA is receiving that SYSREF
<hartytp>
anything else?
<sb0>
otherwise you would not get finite slip counts
<sb0>
hartytp, so the way this works is: there is a fixed delay using the coarse digital + analog delay on FPGA SYSREF
<marmelada>
I asked because after flash/load/loading with platform cable I don't get anything on console (if I load forth it works)
<sb0>
that delay makes sure that SYSREF meets s/h at the FPGA input
<hartytp>
marmelada: I've had that a few times
<hartytp>
in every case, it's been something to do with my builds
<marmelada>
hmm, the same build works on another sayma
<hartytp>
the same binaries? Haven't seen that then
<sb0>
then the "slips" align SYSREF with the RTIO clock and RTIO timestamp counter, by adding periods of the 1.2GHz clock (and when adding those periods, s/h should still be met)
<sb0>
eventually, the firmware modifies the coarse digital+analog delay, and looks at what point SYSREF is no longer sampled the same - this verifies the timing margin and that s/h was met
<sb0>
then sets that delay back to the original, fixed value
<hartytp>
why are you using the slips?
<hartytp>
won't the digital delay do fine?
<sb0>
not enough range
<hartytp>
true
<sb0>
we need to adjust by up to 8 RTIO clock cycles
<hartytp>
ok
<hartytp>
I haven't looked into the slips
<hartytp>
I'd need to read the datasheet for the 7043
<sb0>
since SYSREF is at 1/8 the RTIO frequency
<sb0>
that's also aligning SYSREF with the RTIO timestamp counter - not just the RTIO clock
<hartytp>
do you need to do that?
<hartytp>
isn't it enough to sweep the delay on sysref to ensure that it meets s/h wrt the RTIO clock
<hartytp>
and then do the rest in the FPGA?
<sb0>
how do you align SYSREFs on two different boards otherwise?
<hartytp>
fyi: i don't think I've ever seen a serwb issue with the LVCMOS phy
<hartytp>
with the other GTX clocking arrangement, I now don't see any init errors/crashes or major bugs!
<sb0>
reshuffling samples based on SYSREF is a) more complicated b) less debuggable (you can't just look at the observable SYSREF on the two boards when it doesn't work) c) it increases latency, as you'd need a trasmit buffer on the FPGA
<hartytp>
I still think it would have been easier to have done this the other way around and use the FPGA to generate SYSREF
<hartytp>
feed it to the SYNC input of a HMC7044 and you're done
<hartytp>
well, the HMC7044 generates sysref
<hartytp>
but the FPGA generates the sync signal
<sb0>
you could reset the RTIO timecounter - that would work on a single board, but in a DRTIO satellite you don't get to choose the value of the timecounter
<hartytp>
sb0: no, your way makes sense
<hartytp>
I just hadn't thought it through fully and understood all the constraints
<hartytp>
I'm with you now
<sb0>
the current system isn't complicated. it's just affected by a sayma-bug, and driving the sync input could also suffer from this problem (plus requires hardware rework)
<sb0>
the only problem is, while I'm pretty confident the FPGA input can meet s/h for every slip of the 1.2GHz clock, this may not work so well at 2.4GHz and we may want to do something else (e.g. something like DDMTD)
<hartytp>
"that delay makes sure that SYSREF meets s/h at the FPGA input"
<hartytp>
checks that sysref is aligned with the local RTIO time counter
<hartytp>
(i.e. with particular RTIO clock edges)
<sb0>
well 2.4GHz doesn't sound crazy, but I'm not confident and it needs a lot of testing
<hartytp>
right
<sb0>
you can think of this code as being equivalent to the FPGA generating an internal SYSREF (based on the RTIO timecounter), and using that internal SYSREF to sample the 7043 SYSREF
<sb0>
then it reports the result of that sampling to the firmware
<sb0>
it's designed that way to avoid creating another clock domain, with uses more resources, is complicated, has skew problems etc.
<sb0>
*which
<hartytp>
I understand that
<hartytp>
but, I'll be interested to see it working that fast
<sb0>
on the board reboots when it works, it seems pretty stable
<hartytp>
at 600MHz
<sb0>
with nice margins that match what they should be
<sb0>
1200
<hartytp>
true
<sb0>
that code has 1200Mbps margins, it's the DAC SYSREF that has 600Mbps margins
<hartytp>
well, I'm keen to turn the interpolation on soon and try it out at 2.4
<sb0>
right now there's a bug already
<sb0>
let's get Sayma to work at 1.2G first, it's not like that's not a huge time sink already
<hartytp>
sure
<sb0>
unless you think 2.4G results could give us more hints about why it's not working reliably right now
<hartytp>
so, the sampler effectively compares the SYSREF that the HMC7043 generates with one generated by the FPGA (the rtio time counter) both run at f_rtio/16
<sb0>
yes
<sb0>
sorry I said 8 earlier, it's 16
<hartytp>
at boot, we start with the sysref edge aligned with the rtio clock, but not with the rtio time counter
<hartytp>
(those dividers are not synchronised at all)
<sb0>
it's not aligned with the rtio clock - but it meets s/h with the rtio clock
<sb0>
ok, after disabling overclocking, the new machine no longer crashes when running 3 vivado instances at the same time
<hartytp>
we start by slipping the sysref to find the "0" region, where the internal and external sysrefs have different values when sampled by the local RTIO/JESD clock
<hartytp>
then we slip again until we find the 1 region, where they are aligned
<hartytp>
slipping means delaying by 1 cycled of the 1.2GHz clock = 1/8 rtio clock cycle
<hartytp>
so, the sysref may not meet s/h wrt the rtio clock for some of those slip values
X-Scale has joined #m-labs
<sb0>
hartytp, it should meet s/h for all slip values
<hartytp>
what am I missing?
<hartytp>
in the limit of high dac clock frequency, the slip is basically just a continuous phase shift
<sb0>
the window during which the input of the FPGA flip-flop has to remain constant (so that s/h is met) is small, dozens ps or so
<hartytp>
si issues etc?
<sb0>
considering that the FPGA IOs are rated for >1200MBps operation, the window should be small enough for s/h to be met for all slips, even taking into account jitter
<hartytp>
hmmm...I still don't quite get it
<hartytp>
can't you get unlucky and find that the delay required to move from the starting point (which meets s/h) to somewhere that doesn't is exactly a multiple of 833ps?
<sb0>
"meeting s/h" simply means that your signal is never allowed to transition during a small window when the FF samples it (in response to an edge of the clock clocking the FF)
<hartytp>
yes, so you can get unlucky and end up there, can't you?
<sb0>
that scan is done with the fine delay of 25 ps
<travis-ci>
m-labs/smoltcp#1089 (master - 4a8242a : Kai Lüke): The build passed.
<sb0>
then that scan verifies that the hardcoded value is acceptable
<hartytp>
well, I guess it doesn't really matter if the signal fails to meet s/h at some points in the slip scan any way
<sb0>
it matters. then you get +/- 1 slip of error, and correspondingly 833ps of phase error on the DACs
<sb0>
you can think of SYSREF as a 1200Mbps signal
<sb0>
which is decimated by a factor of 16*(1200/150)
<sb0>
it still has to meet 1200Mbps timing at the sampling FF, even though that FF is clocked at 150MHz, and it has a clock enable that decimates by another factor of 16
<hartytp>
so, once we've located the slip where the two sysrefs agree, we do a scan where we reduce the fine phase offset to 0 and look for a point where they disagree again
<sb0>
the fine analog delay is used so that this 1200MBps timing is met
<sb0>
yes, that's the idea. and this is used only to verify that the 1200Mbps timing is met.
<sb0>
the 7043 coarse slip is just shifting the "1200MBps bit pattern", and doesn't alter s/h
<hartytp>
so, the error I'm getting indicates that no matter what we do to the analog phase after the slip scan, we can't get it to disagree with our internal sysref timer
<sb0>
yes
<hartytp>
"the 7043 coarse slip is just shifting the "1200MBps bit pattern", and doesn't alter s/h"
<hartytp>
sb0: okay, I take your point. By meeting s/h wrt the 150MHz clock, we're arguing that it should meet s/h w.r.t. a 1.2GHz clock at the fpga with the same phase, so if we shift by an integer number of periods of that 1.2GHz clock we will still meet s/h
<hartytp>
I buy that
<hartytp>
still, I'm not quite sure how to go about debugging this
<sb0>
I'm quite puzzled by that error as well, maybe try increasing the amount of fine delay that you have
<sb0>
so that the firmware has more leeway to reduce it until the sysref sample "disagrees"
<sb0>
maybe that will give a clue
<hartytp>
where does this come from: SYSREF_PHASE_FPGA = 53?
<hartytp>
do you have a scan to verify that on each board?
<hartytp>
does it matter that we don't change that when switching from gtp_clk1 to gtp_clk2?
<hartytp>
this code breaks if you change the clock frequencies
<sb0>
I know
<hartytp>
that's a bit ugly, could be fixed easily to use the dividers to do that properly
<hartytp>
anyway, worry about that later
<hartytp>
"maybe try increasing the amount of fine delay that you have"
<sb0>
increase SYSREF_PHASE_FPGA, yes
<hartytp>
it's 53 right now
<sb0>
and no, it could not be fixed "easily"
<sb0>
the divider value should be choosen so it gives the smoothest transition when switching between the analog 25ps and digital 1/2 clkin delays
<sb0>
yes, you can crank that up to something like 17*17 iirc
<hartytp>
so, that's 1.5 cycles of the 1.2GHz clock
<sb0>
no
<sb0>
something like 9 cycles iirc
<hartytp>
3 digital delay taps
<sb0>
ah, the current value! i thought you meant the range
<hartytp>
yes, we scan that from 53 to 0, which is a scan of 1.5 cycles of the 1.2GHz clock
<sb0>
yes
<sb0>
so it should transition, but doesn't. sayma-bug.
<GitHub-m-labs>
[artiq] jordens commented on issue #1065: The worst case would be with the feedback loop open and R315/R316 placed. But if they are not placed, the resistor protects the FPGA just fine. It fails open after limiting the VCCINT current to <1.2A during its short but fulfilled life. You won't be seeing 12V there. https://github.com/m-labs/artiq/issues/1065#issuecomment-401014511
<sb0>
do you reproducibly get that result by the way?
<hartytp>
yes
<sb0>
it could be something as simple as s/h not being met, then jitter makes it slip too much
<hartytp>
right
<hartytp>
a good starting point might be to do an eye/margin scan
<hartytp>
to check that we do meet s/h before slips
<hartytp>
crash kernel still running since this am
<GitHub-m-labs>
[artiq] sbourdeauducq commented on issue #1065: VCCINT current can be as low as 998mA (see "Quiescent VCCINT supply current" in the datasheet). Then the FPGA could get 2V - probably less than that, since likely the current increases with VCCINT, but that still exceeds the absolute maximum of 1.1V by a large margin. https://github.com/m-labs/artiq/issues/1065#issuecomment-401015646
sb0 has quit [Quit: Leaving]
<hartytp>
sb0: okay, lunch now
<hartytp>
then I'll check that you're actually slipping by the correct amount
<hartytp>
(1 cycle, not two)
<hartytp>
and I'll try increasing the fpga starting phase by a few * 17
<hartytp>
let's see what that does
<GitHub37>
[smoltcp] birkenfeld commented on issue #186: I'm using this branch (rebased on master, which was mostly painless), and it works fine for me. The only thing I changed was to increase the discover timeout since my DHCP server was too slow in sending replies, and the sequence number of the reply never matched the last request. https://github.com/m-labs/smoltcp/pull/186#issuecomment-401018999
rohitksingh_work has quit [Read error: Connection reset by peer]
<GitHub-m-labs>
[artiq] jordens commented on issue #1065: If you are worried the resistor might survive longer than the FPGA does in tests where you can't ensure feedback loop closure, then you should definitely depopulate R507 together with R315 and R316. https://github.com/m-labs/artiq/issues/1065#issuecomment-401024445
<hartytp>
2. there is a potential si issue that your algo isn't robust to (this will be more likely to be an issue at higher frequencies when the slips are smaller so we're more likely to see a blip due to a reflection)
<hartytp>
3. the slips seem to be more than 1 cycle
Gurty has quit [Ping timeout: 256 seconds]
sb000 has joined #m-labs
<sb000>
the algo assumes a correct phase value, and good SI
<sb000>
3. would be a problem
<sb000>
the 7043 ds says the dividers take 'some nanoseconds' to slip
<hartytp>
when we do the scan, the transition occurs at tap 20
<hartytp>
sorry 41
<hartytp>
when we then reset the phase to 20, we see all 1s
<hartytp>
so there is some settling time or hysteresis
<hartytp>
despite the 100us delays I added
<hartytp>
what happens on your board with that patch?
<sb000>
ah yes, this kind of 7043 shenanigans would explain things...
<hartytp>
so, I can fix the phase value on my build trivially, but we need to get to the bottom of these phase issues on the hmc7043
<hartytp>
I'll think about a way to debug this with a scope
<hartytp>
but can you run that scan on your board please?
<sb000>
I was using the 7043 delay due to the problems with the ultrascale delay, but it seems it's another can of worms
<sb000>
I wish we had used a k7 where the delays just work ...
<hartytp>
sb000: well, like all these things, I'm sure it's obvious when one reads the data sheet again after figuring out what the problem was
<hartytp>
probably some register is not set correctly
<hartytp>
anyway, let's see if it's the same on all boards or not
<sb000>
registers that are not there cannot be set incorrectly
<sb000>
that's how chips ought to be designed
<hartytp>
well, they may not be optimized for our use case
<hartytp>
so, good design is rather application dependent
<sb000>
k7 ios are well designed
<sb000>
so is the si5324 family
<hartytp>
anyway, no philosophy please, get me some eye scans!
<hartytp>
:)
<sb000>
the hmc clock chips on the other hand are obnoxious
<hartytp>
(also, please look over my code to check I'm not doing something silly)
<sb000>
what patch should I apply exactly?
<hartytp>
yes, Si chips are easy to use, but not great noise. HMC are analog people who do phase noise well, but not digital interfaces. anyway, this doesn't helo#
<omid>
jro: Well while working at umd it took us a long time (~12 hours each) to get kc705 and sayma to work. I had made it work here at duke too. But this is a new setup.
<rjo>
omid: let us know if you have specific suggestions how to improve that and how to best prevent you from taking the wrong turn.
<cjbe>
omid: so you have a valid MAC address and IP address programmed in, but you cannot ping it? Could you post the uart log over a restart?
<cjbe>
i.e. artiq_flash -t kc705 start && flterm /dev/ttyUSB? (where /dev/ttyUSB? is the KC705 uart port)
mumptai has joined #m-labs
<rjo>
omid: write the mac and the ip to the storage image (artiq_mkfs) and flash that (artiq_flash). Also set the ip (or the corresponding hostname) in device_db (core_addr).
<GitHub-m-labs>
[artiq] jordens commented on issue #1084: But it says (in L118) `RTIO slack is the difference between timeline cursor and wall clock time (now - rtio_counter).` right when slack is mentioned the first time. That directly implies that positive slack is equivalent to `now > rtio_counter`. You are correct, this is a crucial concept. What do you propose? https://github.com/m-labs/artiq/issues/1084#issuecomment-