#m-labs on 2017-03-17 — irc logs at freenode.irclog.whitequark.org

2015-03-04 14:45 sb0 changed the topic of #m-labs to: ARTIQ, Migen, MiSoC, Mixxeo & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs

01:27 cr1901_modern has joined #m-labs

03:39 rohitksingh_work has joined #m-labs

04:01 ohama has quit [Ping timeout: 260 seconds]

04:03 ohama has joined #m-labs

04:10 loetkoenig has joined #m-labs

04:26 loetkoenig has quit [Quit: Page closed]

06:17 rohitksingh_work has quit [Ping timeout: 264 seconds]

06:17 rohitksingh_work has joined #m-labs

06:43 <whitequark> sb0: um

06:43 <whitequark> how do I reverse a signal in migen?

06:45 <sb0> reverse all its bits?

06:45 <whitequark> yeah

06:45 <sb0> Cat(s[i] for i in reversed(range(len(s))))

06:46 <sb0> or Cat(*[]), I don't remember if it can take generators directly

06:46 <whitequark> shouldn't there be a helper function for that...

06:47 <sb0> that's a pretty uncommon operation

06:47 <whitequark> ok

07:02 _whitelogger has joined #m-labs

07:05 _whitelogger has joined #m-labs

07:20 cyrozap has quit [Ping timeout: 264 seconds]

07:21 <whitequark> OverflowError: Python int too large to convert to C ssize_t

07:21 <whitequark> what on earth

07:21 <whitequark> if (n >> i) & 1: nr |= 1 << (n - 1 - i)

07:21 <whitequark> this is on this operation

07:21 <whitequark> oh the shift should be ssize_t

07:24 cyrozap has joined #m-labs

07:30 <GitHub> [artiq] whitequark pushed 2 new commits to master: https://github.com/m-labs/artiq/compare/f5aa73b8faf1...6b63322106e2

07:30 <GitHub> artiq/master 6b63322 whitequark: gateware: reverse SDRAM words in RTIO DMA engine.

07:30 <GitHub> artiq/master 4b14887 whitequark: gateware: work around ISE/Vivado bugs with very wide shifts.

08:06 <bb-m-labs> build #469 of artiq-board is complete: Failure [failed conda_build] Build details are at http://buildbot.m-labs.hk/builders/artiq-board/builds/469 blamelist: whitequark <whitequark@whitequark.org>

08:06 <bb-m-labs> build #1399 of artiq is complete: Failure [failed] Build details are at http://buildbot.m-labs.hk/builders/artiq/builds/1399 blamelist: whitequark <whitequark@whitequark.org>

09:35 <sb0> whitequark, if timing isn't met, that could explain the intermittent failure

09:42 <sb0> when will we have RSFQ FPGAs...

09:55 <sb0> bb-m-labs, force build artiq

09:55 <bb-m-labs> build #1400 forced

09:55 <bb-m-labs> I'll give a shout when the build finishes

09:55 <sb0> trying with the latest vivado

10:29 <bb-m-labs> build #470 of artiq-board is complete: Success [build successful] Build details are at http://buildbot.m-labs.hk/builders/artiq-board/builds/470

10:30 key2 has joined #m-labs

10:33 <bb-m-labs> build #1400 of artiq is complete: Failure [failed python_unittest_2] Build details are at http://buildbot.m-labs.hk/builders/artiq/builds/1400

10:35 <whitequark> these spurious failures are driving SNR way down.

10:35 <whitequark> there's been many cases where genuine issues were obscured by one of those flaky tests

10:35 <whitequark> they should be either fixed or skipped

10:35 <whitequark> (or relaxed)

10:36 <whitequark> uhhh

10:36 <whitequark> OutputMessage(channel=128, timestamp=333529387896, rtio_counter=333479716120, address=0, data=141046726000768)

10:37 <whitequark> this makes no sense

10:37 <whitequark> the data is 0x804800000080

10:38 <whitequark> ... oh

10:38 <whitequark> I think I know

10:43 <whitequark> of *course*

10:43 <whitequark> I've reversed the entire word, but should've reversed the word *and* the octets

10:51 <sb0> i've been relaxing the dds test, but it keeps increasing

10:52 <sb0> and I agree, the CI has been messy for the last months

10:52 <whitequark> sb0: let's relax it 2x.

10:52 <whitequark> this will catch catastrophic failures.

10:52 <whitequark> and then, once we decide to commit to a certain WCET for that testcase, someone will investigate and fix it.

10:53 <whitequark> right now we clearly can't guarantee WCET even if the test passes, because of how sensitive to something it is.

10:53 <GitHub> [artiq] jordens commented on commit 6b63322: `Cat(reversed(s))` https://github.com/m-labs/artiq/commit/6b63322106e25b098c4c0cc4c4edc87cd1942f05#commitcomment-21369353

10:53 <whitequark> rjo: try writing that.

10:53 <whitequark> that was the first thing I tried

10:54 <whitequark> oh, wait, that was without Cat.

10:55 <rjo> i'd also welcome a migen patch that makes Cat() a noop and makes all iterables in expressions interpreted accordingly.

11:11 <whitequark> if I spend ten more minutes on this endianness idiocy I'll just leave in the software fix.

11:11 <whitequark> what a monumental waste of time.

11:18 <whitequark> hm, misoc is very slow when simulating nested Cat's for some reason...

11:20 <whitequark> sb0: I'll also relax test_rpc_timing, but only on Windows.

11:21 <whitequark> I'm not sure what's the exact issue.

11:21 <whitequark> having a mean roundtrip time of 2ms or 4ms shouldn't really matter anyway

11:24 bb-m-labs has quit [Quit: buildmaster reconfigured: bot disconnecting]

11:24 <GitHub30> [buildbot-config] whitequark pushed 1 new commit to master: https://git.io/vyQcd

11:24 <GitHub30> buildbot-config/master 74af454 whitequark: Don't consider build broken if only the Windows worker fails; warn.

11:25 bb-m-labs has joined #m-labs

11:25 <GitHub> [artiq] whitequark pushed 3 new commits to master: https://github.com/m-labs/artiq/compare/6b63322106e2...e9cf451c0b30

11:25 <GitHub> artiq/master e9cf451 whitequark: test: relax test_rpc_timing on Windows.

11:25 <GitHub> artiq/master 7dc7dcd whitequark: test: relax test_pulse_rate_dds to only catch catastrophic slowdown.

11:25 <GitHub> artiq/master 4de336f whitequark: gateware: reverse bytes of SDRAM word, not bits.

11:34 <rjo> 2ms vs 4ms matters.

11:35 <whitequark> okay

11:35 <rjo> i am extremely uneasy about the constant creepage of slowness in all parts.

11:36 <GitHub> [artiq] whitequark pushed 1 new commit to master: https://github.com/m-labs/artiq/commit/dbea679e96f85e2166d74014f35a7e9ef3daf262

11:36 <GitHub> artiq/master dbea679 whitequark: Revert "test: relax test_rpc_timing on Windows."...

11:36 <rjo> dds programming, rpc latency, data throughput.

11:36 <whitequark> um

11:36 <whitequark> what?

11:37 <rjo> whitequark: we can do it temporarily if we really think that it will advance progress. but we have to commit to making a significant effort to speeding things up again.

11:37 <whitequark> we started off with 15ms rpc latency on lwip.

11:37 <whitequark> smoltcp has improved things by a factor of eight or so

11:37 <rjo> really? i remember a few ms before rust.

11:38 <whitequark> was 15ms in 2015

11:38 <whitequark> 10ms slightly before that

11:38 <whitequark> and the test was checked in with 10ms.

11:38 <rjo> ok. got a link handy to the build for artiq 2.x?

11:38 <whitequark> no, I'm looking at the sources of the test in git log

11:39 <rjo> whitequark: then i am content; scratch rpc latency out of that list. but i am adding worker startup/kernel compilation again.

11:40 <whitequark> throughput should be better now than it ever was with lwip as well, because we've added a few more buffers to thethmac

11:41 <whitequark> and I've limited the advertised receive window by the buffer size too, in smoltcp

11:44 <rjo> unfortunately we never printed out the rpc latency.

11:45 <whitequark> the rpc latency is pretty hard to instrument at this point

11:45 <rjo> whitequark: i seem remember a discussion with you where you gave a 100kB/s-ish throughput number.

11:45 <whitequark> although

11:45 <rjo> just ask the host for the time twice.

11:45 <whitequark> no, scratch that, a sampling profiler collecting perhaps ten times per second should not disturb it too much to skew the results

11:46 <whitequark> no, that's not it

11:46 <whitequark> measuring it is easy.

11:46 <whitequark> learning where time is wasted, not so much.

11:46 <whitequark> regarding throughput, that doesn't sound right. let me check something

11:46 <rjo> network traces would tell you about the network things.

11:58 <whitequark> rjo: actually, it's more like 25kB/s.

11:58 <whitequark> which... is interesting

11:59 <whitequark> the cause is that packets are only sent every 40ms

11:59 <whitequark> looking at how stable that number is, it's probably something like an unwanted interaction with delayed ACK.

11:59 <whitequark> I'll take a look at it later.

12:00 <rjo> whitequark: right. that's not much fun. we should track number for that in the unittests.

12:00 <bb-m-labs> build #471 of artiq-board is complete: Success [build successful] Build details are at http://buildbot.m-labs.hk/builders/artiq-board/builds/471

12:00 <whitequark> it should be, well, at least 40 times faster

12:00 <rjo> whitequark: in general it might be useful to have some long term performance tracking mechanism for all these things.

12:01 <rjo> whitequark: mind if i file a bug and track it for 3.0 for that throughput thing?

12:01 <whitequark> sure.

12:03 <whitequark> rjo: we can make the unit tests output some magic string, which gets exported through buildbot's API.

12:04 <bb-m-labs> build #1401 of artiq is complete: Failure [failed python_unittest_2] Build details are at http://buildbot.m-labs.hk/builders/artiq/builds/1401 blamelist: whitequark <whitequark@whitequark.org>

12:05 <whitequark> oh, wonderful

12:06 <whitequark> that test fails because DMA now works

12:06 <whitequark> fsvo works

12:07 <GitHub> [artiq] jordens opened issue #685: TCP throughput https://github.com/m-labs/artiq/issues/685

12:07 <whitequark> rjo: http://buildbot.m-labs.hk/json/builders/artiq/builds/1400?as_text=1

12:07 <whitequark> scrape this then draw a graph.

12:07 <whitequark> could reuse some code that rust uses.

12:08 <whitequark> you can file an issue to do this if you want it.

12:11 <GitHub> [artiq] jordens opened issue #686: track performance https://github.com/m-labs/artiq/issues/686

12:20 <GitHub> [artiq] whitequark pushed 1 new commit to master: https://github.com/m-labs/artiq/commit/ac9e8b8568dbad2c23a3c9b3af6f9b630c60538e

12:20 <GitHub> artiq/master ac9e8b8 whitequark: test: avoid underflow in DMA replay test.

12:23 <whitequark> um

12:23 <whitequark> sb0: why do I get awk: symbol lookup error: awk: undefined symbol: mpfr_z_sub

12:36 <whitequark> rjo: great initiative. (re ml)

12:42 <bb-m-labs> build #472 of artiq-board is complete: Success [build successful] Build details are at http://buildbot.m-labs.hk/builders/artiq-board/builds/472

12:46 rohitksingh_work has quit [Read error: Connection reset by peer]

12:47 <bb-m-labs> build #1402 of artiq is complete: Failure [failed python_unittest_2] Build details are at http://buildbot.m-labs.hk/builders/artiq/builds/1402 blamelist: whitequark <whitequark@whitequark.org>

12:59 <rjo> whitequark: ack. i wonder what we are going to get. when i was still in the lab i had a pretty good handle. but now i feel that i am diverging a bit as well.

13:19 <bb-m-labs> build #473 of artiq-board is complete: Success [build successful] Build details are at http://buildbot.m-labs.hk/builders/artiq-board/builds/473

13:25 <whitequark> bb-m-labs: stop build

13:25 <bb-m-labs> try 'stop build WHICH <REASON>'

13:25 <whitequark> bb-m-labs: stop build artiq broken

13:25 <bb-m-labs> build 1403 interrupted

13:25 <bb-m-labs> build #1403 of artiq is complete: Exception [exception python_unittest_2 interrupted] Build details are at http://buildbot.m-labs.hk/builders/artiq/builds/1403 blamelist: whitequark <whitequark@whitequark.org>

13:29 allen0s has joined #m-labs

13:43 <allen0s> whitequark: I'm not currently on the artiq mailing list, but was forwarded the call for a survey. New to artiq and setting it up on a clean install of ubuntu 16.04. I was able to work through a conda install, but when I was here yesterday, you indicated that my version was out of date. So I went back to install from source. And a day later, I'm still hitting dependency hell, mismatch, etc. It's not necessarily artiq, but the 100

13:54 <rjo> allen0s: you were here yesterday?

13:55 <whitequark> allen0s: "but the 100" and then the message was cut off.

13:55 <rjo> allen0s: help us triage your problem: who are you and what do you want to do? what did you do? and what was the outcome?

13:57 <allen0s> whitequark: 1000 things it depends on directly and indirectly

13:58 <whitequark> allen0s: okay.

13:58 <whitequark> so, it's not necessary to install artiq from source to get the latest version.

13:58 <whitequark> adding the "dev" conda channel is enough

13:59 <whitequark> install from source is mostly there for people who detest conda (like me).

14:00 <allen0s> rjo: stewart allen w/ ionq. building ion trap quantum computers

14:00 <whitequark> the docs describe how to use the dev channel: https://m-labs.hk/artiq/manual-release-2/installing.html#installing-the-artiq-packages

14:00 <allen0s> whitequark: would like to be able to work from src to be able to contribute back more effectively anyway

14:01 <whitequark> allen0s: conda can still help you there; install the "artiq-dev" package

14:01 <whitequark> this has all the dependencies of artiq, but not artiq itself

14:02 <whitequark> it's fairly unlikely that you'll need e.g. a custom build of llvm. so if you can cope with conda, sure, use it!

14:03 <allen0s> whitequark: like you, not a fan

14:04 <allen0s> whitequark: but would like to get up and running faster

14:07 <rjo> allen0s: there is no fast and easy installation of all the packages from source that does not require you to learn most of the build machinery. you have to choose between the pain of handling the dependencies and the building yourself and the somewhat less flexible artiq-dev package based development.

14:11 <sb0> whitequark, you can safely ignore that "undefined symbol: mpfr_z_sub" message from vivado.

14:11 <sb0> allen0s, you installed nist_qc1. it is no longer maintained. use nist_qc2, nist_clock or pipistrello for the latest version

14:12 <whitequark> sb0: that doesn't seem like a thing one should safely ignore...

14:12 <whitequark> why is it even calling awk?

14:12 <sb0> well as far as I can tell, it still runs fine despite printing that messge

14:13 <sb0> it seems to be a common problem too, if you google for it

14:13 <sb0> as to "why", because xilinx shitware sucks.

14:15 <whitequark> you'd think the awk call serves some purpose...

14:15 <whitequark> but I guess you never know with xilinx.

14:15 <GitHub> [artiq] jbqubit commented on issue #686: Thank you for advocating for this @jordens. Much needed. Add TCP throughput to the list #685. ... https://github.com/m-labs/artiq/issues/686#issuecomment-287365235

14:16 <GitHub> [artiq] whitequark commented on issue #686: That's RPC throughput in the list. https://github.com/m-labs/artiq/issues/686#issuecomment-287365509

14:19 <sb0> iirc lwip with the C runtime was 1MB/s

14:19 <whitequark> it's not limited by resource depletion.

14:20 <whitequark> it's limited by not pushing the window hard enough

14:20 <whitequark> it's probably a one-line fix somewhere, too.

14:25 <whitequark> sb0: wtf.

14:25 <whitequark> I've added CSRStatus for FSM state.

14:25 <whitequark> now it doesn't hang anymore (!)

14:27 <sb0> whitequark, okay. typical xilinx garbage. just leave the CSR there...

14:27 <whitequark> ...

14:27 <whitequark> I thought suggesting this for a moment but then considered it too hacky

14:27 <sb0> write a comment indicating that the CSRs can be removed once xilinx get their shit together, if ever

14:28 <whitequark> something something observing the state of the system causes it to stop collapsing

14:28 <sb0> is that with the latest vivado?

14:28 <whitequark> um, it's with whatever is in the PATH

14:29 <whitequark> on the buildserver

14:29 <whitequark> I ran "python3 -m artiq.gateware.targets.kc705_dds" about two hours ago.

14:29 <sb0> so it should be the latest one, I changed it so that timing passes

14:29 <sb0> they did seem to make the compilation result less slow with later vivado version

14:29 <whitequark> hmm

14:30 <sb0> and migen doesn't use PATH, it looks into /opt/Xilinx for the latest version

14:30 <sb0> the problem though, is that this latest version segfaults when compiling the drtio core

14:30 <whitequark> sb0: oh, that was what fixed it.

14:30 <whitequark> it doesn't crash with the flashed gateware either.

14:30 <sb0> so we probably have a problem if we want drtio and dma at the same time

14:31 <sb0> maybe the version just before can compile both without fucking up

14:31 <whitequark> here's the patch in case anyone in the futre wants to try it out. https://paste.debian.net/922266/

14:33 <sb0> whitequark, vivado has some code that recognizes FSM and reencodes/optimizes them

14:33 <whitequark> sb0: no, I mean, your timing changes

14:33 <sb0> probably what happens is that code has some bug, but when you add the CSR it no longer recognizes a FSM and that bug is not tickled

14:33 <whitequark> I tried it again with the buildbot-built gateware

14:33 <whitequark> bitstream even

14:34 <whitequark> it still doesn't hang

14:34 <sb0> ah!

14:34 <sb0> so it was just a timing problem. not some xilinx bug.

14:34 <whitequark> I don't know, the xilinx version also changed, no?

14:34 <whitequark> and timing passed many times before

14:34 <whitequark> and it still hung

14:34 <sb0> timing broke when you replaced the shift with a Case

14:34 <sb0> it passed before

14:35 <whitequark> ah I see

14:35 <whitequark> it could be hanging for a different reason before that

14:35 <sb0> so there were two versions before: 1) shift that passes timing but miscompiles 2) Case that doesn't miscompile but passes timing

14:35 <whitequark> btw how did you fix timing?

14:35 <sb0> *breaks timing

14:35 <sb0> I just changed the vivado version for the latest one

14:35 <sb0> as I said they seem to have made the result less slow

14:35 <sb0> once in a blue moon things improve with vivado upgrades

14:38 <whitequark> oh

14:39 <whitequark> it hung.

14:39 <whitequark> wtf

14:39 <whitequark> it hung when I tried to write a test for it, specifcally

14:39 <whitequark> ah I see, that happens after an underflow.

14:51 <whitequark> sb0: ok so

14:51 <whitequark> it still hangs

14:52 <whitequark> I don't know what's the exact condition for hanging it but the *combination* of tests (not pushed yet) reproduces it

14:52 <whitequark> ... and every FSM is at zero when it's hung.

14:54 <whitequark> oh

14:55 <whitequark> it's at zero because that mechanism is just broken.

14:57 <whitequark> hm, it might not be, actually

14:59 <whitequark> no, the mechanism is not broken, I was just querying it after the core actually *did* finish for this test

15:00 <whitequark> sb0: so. it definitely hangs. and all FSMs are definitely in IDLE.

15:00 <whitequark> let's see how is that possible...

15:01 <whitequark> sb0: what's "CRI" and how does this thing work?

15:08 <sb0> common rtio interface

15:09 <sb0> just a set of signals shared between rtio/drtio, a bit like wishbone buses, but specific to rtio

15:14 hartytp has joined #m-labs

15:21 <whitequark> ok. well, I don't know why the arbiter is broken.

15:23 hartytp has quit [Quit: Page closed]

15:24 <allen0s> whitequark: i just went back to the conda packages using -dev and when i try to flash the board using nist_qc2, i get:

15:24 <allen0s> pkg_resources.DistributionNotFound: The 'artiq==3.0.dev0+820.gf4ae166' distribution was not found and is required by the application

15:25 <allen0s> the other day using non-dev and nist_qc1 worked

15:26 <sb0> whitequark, what are the symptoms?

15:26 <whitequark> sb0: same as before

15:26 <sb0> broken arbiter should not hang

15:26 <whitequark> after a certain sequence of events, csr::rtio_dma::arb_gnt_read() never equals 1

15:27 <sb0> did you do csr::rtio::arb_req_write(0); csr::rtio_dma::arb_req_write(1) ?

15:28 <sb0> is it where it hangs? waiting for the arbiter?

15:28 <whitequark> sure. I'm using your code.

15:28 <whitequark> it hangs in the loop in rtio_arb_dma().

15:28 <whitequark> and just before calling that function, all FSMs are at idle

15:28 <sb0> if you set the arbiter permanently to dma, is there any dma bug left?

15:45 <GitHub> [artiq] sbourdeauducq commented on issue #681: > If the ion trap is working with a long chain of ions, asynchronous kernel termination could cause ion loss. Hooks should be in place so we can do clean up.... https://github.com/m-labs/artiq/issues/681#issuecomment-287390855

15:55 <whitequark> sb0: I cannot set the arbiter permanently to DMA.

15:55 <whitequark> I added a condition: if csr::rtio_dma::arb_gnt_read() == 0 { rtio_arb_dma(); }

15:56 <whitequark> well, somehow, arb_gnt gets reset back to 0 at some point

15:56 <whitequark> despite me never doing it (in fact the code that could do it is commented out)

16:08 <sb0> whitequark, how come? just set it to dma when the runtime starts, and don't touch it afterwards

16:09 <whitequark> sb0: I don't touch it.

16:09 <whitequark> it still resets itself back to 1.

16:09 <whitequark> erm, 0.

16:16 allen0s has quit [Quit: Page closed]

16:18 <sb0> whitequark, feel free to replace the arbiter with a switch

16:19 <sb0> i.e. have a csr for selecter and remove arb_req and arb_gnt

16:19 <sb0> *selected

16:21 <whitequark> ok

16:22 acathla has quit [Quit: Coyote finally caught me]

16:22 acathla has joined #m-labs

16:24 acathla has quit [Changing host]

16:24 acathla has joined #m-labs

18:32 key2 has quit [Quit: Page closed]

18:36 AndChat|326081 has quit [Ping timeout: 240 seconds]