<wolfspraul>
all things worth doing took longer than they should have
<wolfspraul>
xiangfu: reading backlog about flash
<wolfspraul>
so by power cycling, you managed to flip a bit in your nor flash?
<xiangfu>
maybe I am not sure.
<wolfspraul>
unfortunately the rc2 and rc3 boards are different in that area, that complicates our search
<xiangfu>
werner replied in mailing list. If this is really a NOR corruption and if it is caused by incorrect
<xiangfu>
sequencing of power supplies, perhaps power-cycling (instead of the
<xiangfu>
reset button) could make it happen more often.
<xiangfu>
I don't know how to narrow down the bug. needs help from Werner and Adam :)
<wolfspraul>
what is the reset button? werner means reset in the gui?
<xiangfu>
I think he means press all three buttons for reboot.
<xiangfu>
I am do "replug the dc adapter male plug" for power-cycling
<wpwrak>
xiangfu: (reset button) err yes, whatever button press makes the M1 reset ;-)
<wolfspraul>
ok so we have 3 ways to reset:
<wpwrak>
for now, i'd suggest to establish how often this happens. i.e., with a fixed reset procedure, repeat until hitting maybe 10 or more corruption events. (after each, reflash)
<wolfspraul>
1) cold removal of power supply, either by removing DC jack or by unplugging/switching off the power behind the adapter
<wolfspraul>
2) press three buttons at once for reset
<wolfspraul>
3) power off (or reset) in the gui
<wolfspraul>
so actually it's 5 ways
<wolfspraul>
1) unplug DC jack
<wolfspraul>
2) unplug power supply itself from mains
<kristianpaul>
bartbes: seems you managed to solve it :) sorry i wasnt able to help. actually now that there is a openwrt-milkymist port i think i'll care more about later about learn about makefiles and packaging
<wolfspraul>
wpwrak: 10 corruption events!!! :-)
<wolfspraul>
you will drive everybody to heart attack. This event is relatively rare, don't forget.
<wolfspraul>
if adam sets 38 boards to 'available', that's 380 full power to render cycles without any such occurance.
<wolfspraul>
and maybe 3-5 boards did show the problem in between
<wolfspraul>
so maybe 1-2 times per 100 power cycles
<wolfspraul>
if you look at it that way
<wolfspraul>
but right now we know too little
<wolfspraul>
not just 'power cycle', actually it's a full rendering cycle with boot-to-render and then let it render for 30 seconds
<wpwrak>
i think we could try and see if unplugging the DC jack does the trick. unplugging mains would be even nastier, but it's also less controllable
<wpwrak>
xiangfu: btw, after flashing, are the partitions write-protected again ?
<wpwrak>
(10 events) yeah, but otherwise you don't know how long you have to test to be reasonably sure you've nailed it :)
<wpwrak>
100 events would of course be better ;-)
<wpwrak>
the good news: unless adam always runs the tests and checks the test logs, he may see a much lower than real incidence, because he'll probably only notice bitstream corruptions
<xiangfu>
wpwrak, (write protected) hmm... I use urjtag flash them. there is now write-protect like command in urjtag
<xiangfu>
wpwrak, the flash output is like unlock... flashing... unlock... flashing...
<wpwrak>
;-)
<xiangfu>
wpwrak, how this write-protect works?
<wpwrak>
lemme check ... there are many different ways how NOR chips do this
<wpwrak>
page 28. "Configurable Block Locking" ... " For additional information and collateral request, please contact your filed representative ." haha, very funny
<aw_>
the one was d2/d3 dimly lit after used 1.8 M ago. then today I powered on and d2 is fully off so that I started to test it. ;-) surely 0x7f was replaced new u7/u19/u20. ;-)
<aw_>
now it's all passed except 8:10 which i forgot to insert card. ;-) do rendering next. :)
<wolfspraul>
ok
<wolfspraul>
not sure whether that's related to the flash problems
<wpwrak>
xiangfu: however, section 10.1 (on the same page) might be useful. if the blocks are left unlocked, that's an invitation for trouble. you probably still have partitions where writes are expected and don't want to lock these. but at least the bitstream and maybe also the rescue stuff should probably default to being locked
<wolfspraul>
but that's why you go through the entire batch now, so that we can then see the flash problem more clearly
<kristianpaul>
(locking) good, now i can finally call that nor chip rom :)
<wolfspraul>
xiangfu: when your board couldn't boot anymore because standby was corrupted (maybe), was the rescue boot still working?
<wpwrak>
kristianpaul: you could, if you didn't need to store data there. when you download patches, where do they go ? i'd guess to the NOR, right ?
<xiangfu>
wpwrak, thanks. so there are some addition area for store the block lock info?
<kristianpaul>
wpwrak: actually sebastien idea is that, for a writable FSÂ Â memory card have a job :)
<aw_>
hmm...0x7f: bad that can't configure after powered-cyle. :(
<kristianpaul>
wpwrak: pathces can be loaded from memcard
<xiangfu>
wpwrak, (rescue stuff shoudl be locked) yes.
<kristianpaul>
wpwrak: store, i think user and pass for ftp, dunno what else i lost the track to flickerinose
<wolfspraul>
aw_: wait
<wolfspraul>
so you just reflashed 0x7f, successfully (according to flash script)
<wolfspraul>
then you ran the test software that was loaded over serial, and it succeeded
<aw_>
wolfspraul, no reflash
<wpwrak>
section 10.4 is funny. a "password access", some 64 bits string. as if that accomplished much ;-))
<aw_>
yes
<wolfspraul>
now you try to power cycle but it won't boot?
<wolfspraul>
how did you power cycle?
<wolfspraul>
and how does it stop? d2/d3 dimly lit?
<aw_>
it successed. so I inserted 8:10 card then powered on ...and d2 is dimly lit though. :(
<kristianpaul>
wpwrak: (password) so murphy cant guess that one :)
<aw_>
i unplugged the dc jack's plug
<aw_>
and replugged it
<wolfspraul>
is 'd2 dimly lit' the same as the nor flash corruption or a separate bug?
<wpwrak>
kristianpaul: naw, i think it's meant as a means to protect confidential content. now, how hard would it to record that bit string ? :)
<kristianpaul>
ahh
<kristianpaul>
shame
<kristianpaul>
hum yes
<aw_>
seems once it had have d2/d3 dimly lit  after first flash, it will be easily show d2 dimly lit once power on
<wpwrak>
xiangfu: (lock) in the NOR data sheet, sections 6.1 and 6.2. it should be very similar to the unlocking operation
<aw_>
well..records firstly then continue other boards.
<wolfspraul>
yeah
<wolfspraul>
wasn't the reset circuit meant to fix the 'd2 dimly lit' problem?
<wpwrak>
wasn't it reset -> fix NOR corruption and NOR corruption -> "d2 dimly lit" ?
<wolfspraul>
unfortunately that's what we will only really learn and understand right now in the middle of the rc3 run ;-)
<wpwrak>
so i think we have some evidence that the reset circuit in its present state isn't sufficient to make NOR corruption go away
<wolfspraul>
hmm
<wolfspraul>
the 'd2 dimly lit' problem I knew about went after after power cycling again
<wpwrak>
(evidence) at least xiangfu's M1rc2 seems to suffer real NOR corruption. we haven't properly established that on an M1rc3 yet, though
<wolfspraul>
aw_: can you try to power cycle 0x7F three times?
<wpwrak>
ah, interesting. thought that was also the symptom of bad NOR
<aw_>
wolfspraul, ok
<wolfspraul>
many m1 rc2 users have found workarounds for power cycle/boot problems for themselves
<wolfspraul>
that complicates our analysis now
<xiangfu>
wpwrak, I only find the unlock code in urjtag. not found the lock code.
<xiangfu>
wpwrak, let me find the source code url.
<aw_>
wolfspraul, all d2 dimly lit now in three times powered-cycle.
<wolfspraul>
the data they report is biased because of their workarounds. so we need to try to remove that now.
<wolfspraul>
aw_: he :-) try another 3 times, wait 5 seconds in between.
<wpwrak>
xiangfu: check the data sheet. it describes the lock/unlock process. once you've found the same bytes for the unlock, it should be easy to do the locking
<aw_>
wolfspraul, still d2/d3 dimly lit, wait at least 5 seconds in between. i felt it keeps this stage unless I put aside it long long time even day. ;-)
<wolfspraul>
hmm
<wolfspraul>
ok
<wolfspraul>
move it aside
<wolfspraul>
:-)
<wolfspraul>
a proper (!) flash corruption will not fix itself
<wolfspraul>
:-)
<wolfspraul>
so it should not come back even after several days, which it sometimes does
<wolfspraul>
so I think there are several different problems here, masking each other partially
<wolfspraul>
aw_: just continue with the full round, then we look at all data carefully
<aw_>
wolfspraul, yes
<wolfspraul>
ok so right now, we are not locking anything in the nor flash
<wpwrak>
xiangfu: you can probably just copy the unlock functions, change UNLOCK_BLOCK to LOCK_BLOCK, and you're done
<wolfspraul>
but werner proposes to look several parts like rescue bitstream and maybe more
<kristianpaul>
(fix itself) so bus corruption? fpga..  a 20Ghz scope near? ;)
<wpwrak>
xiangfu: well, plus calling them ;-)
<wolfspraul>
s/look/lock/
<wolfspraul>
if we lock anything, my thoughts would be: a) how does that impact the ability for web updates or other updates
<wpwrak>
is the NOR mapped in the LM32's memory address space ?
<wolfspraul>
b) are we just covering up the real bug behind a lock (even if effective), or is this a proper fix still?
<wolfspraul>
just my thoughts, nothing else
<wpwrak>
just an extra protection
<wpwrak>
so you shouldn't set the locks when hunting the corruption
<wolfspraul>
and the locks need to be removed for updates
<kristianpaul>
wpwrak: yes is mapped
<wpwrak>
then not locking those things borders on insanity ;-)
<wolfspraul>
I don't think the "power-to-render cycles leading to unreconfigurable board" is related to anything the fpga does during rendering
<wpwrak>
considering that there's not even an MMU. any little sw bug can corrupt your NOR :)
<wolfspraul>
that's because we see this problem regularly when doing sets of 10 power cycles with 30 second rendering sprints
<kristianpaul>
;)
<wolfspraul>
but I never once have heard from it after a multi-hour rendering
<wolfspraul>
that's a very weak logic, but still
<wolfspraul>
it could be that long renderings are rare, and we are not focusing enough on this problem
<wolfspraul>
that's not to say that an unlocked memory mapped NOR is insane
<wpwrak>
wolfspraul: (ability to update) the update process would have to unlock before writing, then lock again. should be no problem.
<wolfspraul>
welcome to m1 :-)
<wolfspraul>
yes
<wolfspraul>
but that needs to be added
<wolfspraul>
there are two risks in start selling rc3 now, basically signing off boards to leave Taipei
<kristianpaul>
wpwrak: okay that another reason for a MMU, i think now i get more sense to me have one :)
<wolfspraul>
the first risk is that the hardware is physically in a state that requires a fix later (a hardware fix)
<wolfspraul>
the second risk is that it is a software problem only, but the board is driven into a state where a normal user cannot recover it anymore, leading to them potentially having to ship units around the world for unbricking
<wpwrak>
(corrupting NOR via writes) i think it's a little more difficult than just doing a single bus cycle, but with protection off and all that, you're a lot closer to being able to inflict mayhem than you want to be
<wpwrak>
wolfspraul: yes, you could try and see if locking properly protects the rescue partitions. that would at least allow recovery without usb-jtag. but without solving the origin of the corruption (which may be hw), M1s would still see corruption
<wpwrak>
just in recoverable partitions. e.g., the one where all the patches for your show tonight are :)
<wolfspraul>
that's exactly how I see it too
<wolfspraul>
a lot of work ;-)
<wolfspraul>
oh the units are not 'production ready' in its current state
<wpwrak>
the evidence pointing to power cycling being a factor is strong. particularly given that people who rarely power cycle but reset often don't seem to experience NOR corruption easily
<wolfspraul>
since the normal (web) update does not update the rescue stuff, it would be a relatively easy next step to lock all rescue partitions
<wolfspraul>
xiangfu: so maybe you can try to get the locking done, and the we regularly lock all rescue partitions, as the normal process of reflash_m1.sh ?
<wolfspraul>
I don't see the downside to that right now
<xiangfu>
wolfspraul, ye. lock all rescue part partitions should be ok.
<wolfspraul>
how many partitions is that?
<wolfspraul>
I still don't have a mental map of all our partitions
<wpwrak>
to corrupt the NOR, this should do nicely: volatile uint32_t *p = (void *) 0xSOMEWHERE; *p = 0x40; *p = 0;
<xiangfu>
wpwrak, how can I test if the lock is correct. write some thing to this area then readback. will different. right?
<wolfspraul>
test?
<wolfspraul>
just lock then be happy
<wolfspraul>
:-)
<xiangfu>
i mean make sure the code is do lock correct.
<wolfspraul>
sure, I was joking
<wpwrak>
you could use the code snippet from above. if the lock works, then it won't be able to zero the word in question. else, ... :)
<xiangfu>
wolfspraul, all rescue + standby. so 5 partitions
<wolfspraul>
xiangfu: the standby bitstream is needed by the rescue boot path?
<xiangfu>
it will goto standby after you plug the power.
<wolfspraul>
so it's always needed, even in rescue mode?
<xiangfu>
yes.
<wolfspraul>
is the standby bitstream updated by the web update?
<xiangfu>
no
<wolfspraul>
then it should probably be locked as well
<wpwrak>
xiangfu: it seems that you read the lock bit with the Read Device Information command. that command changes the way the NOR behaves. reads then return status information, not the NOR data.
<xiangfu>
if I understand correct. when plug power. fpga will load standby immediately, for enable the power button. reboot button. etc.
<wpwrak>
xiangfu: then you can retrieve the lock bit, see page 22, table 9
<wolfspraul>
ah yes, you said that already
<wolfspraul>
in total lock 5 partitions
<wolfspraul>
btw, the single-bit corruption (if it was one) xiangfu saw is not likely caused by a simple software pointer problem. that would have been much more likely to overwrite an entire word or more
<wpwrak>
probably lock more than 5. the regular bitstream for sure. then, does FN normally need to write to BIOS, splash, APP ? or only to Data ?
<wpwrak>
i don't know how often you can lock/unlock. probably not more often than you can write a regular NOR cell. so locking/unlocking should roughly follow the frequency of program cycles of the respective block.
<wpwrak>
you may want to ask numonyx for clarification, though
<xiangfu>
wpwrak, (read device Information command) yes.
<wpwrak>
wolfspraul: 0 is a very common word value :)
<wolfspraul>
but only 1 bit was changed
<wpwrak>
wolfspraul: remember that the transition was 0x1000 -> 0x0000
<wpwrak>
wolfspraul: yes, there was only one "1" bit there to destroy :)
<wolfspraul>
if the standby bitstream is corrupted in random ways (offsets), then whether D2/D3 stay fully off, or dimly lit, may be just a coincidence and caused by the same root problem
<wolfspraul>
hmm, true
<wpwrak>
xiangfu: hmm, does that mean that everything after 0078dd0 is 0 ? or that the file ended at 0078dd0 ?
<xiangfu>
also we maybe needs erase all NOR flash before flash .
<xiangfu>
wpwrak, the origin file is only 495060 length. so it end at 0078dd0
<xiangfu>
wpwrak, when I read back the standy , I read whole 640KB from m1. so it end at 00a0000
<wpwrak>
xiangfu: ah, so the stuff at the end is a retrieval artefact
<wolfspraul>
xiangfu: yes erase all sounds good. we don't do that now?
<xiangfu>
no
<wolfspraul>
how fast/slow would it be?
<xiangfu>
erase very fast.
<xiangfu>
acceptable
<wpwrak>
you probably erase each block before writing it. that may or may not be sufficient. depends a bit on what the software expects.
<wolfspraul>
in a perfect world we should not need the erase, I guess
<xiangfu>
it maybe because the last standby.bin is small then the previous one
<xiangfu>
erase all nor flash can make all those bit to '1' :)
<wpwrak>
xiangfu: those two extra words (0004 0004) are indeed a little odd. they're within the same block. so they must have been erased. (if you never erased, you would have noticed by now :)
<wpwrak>
xiangfu: so something is writing a bit of extra data that's not in the file
<xiangfu>
wpwrak, (forget the block size). yes. something is writing a bit of extra data.
<wpwrak>
one more for the bug pile ;-)
<xiangfu>
then that is a bug in urjtag
<xiangfu>
yes.
<wpwrak>
yeah, probably urjtag
<xiangfu>
I can read more partition and compare .
<xiangfu>
see if this happen in other partitions.
<wpwrak>
maybe some  for (i = 0; i <= n; i++) program_word(i);  :)
<kristianpaul>
fake a file witha now patter
<kristianpaul>
write it read back
<kristianpaul>
comapre :)
<wpwrak>
then change the size and repeat. dd if=/dev/urandom  is your friend :)
<wolfspraul>
bugs everywhere. sigh. but I need to decide whether we can start selling 'good' rc3 boards or not ;-0
<wolfspraul>
at least we have _lots_ of good starting points
<wolfspraul>
do we have consensus that we should add a full 32 megabytes erase to reflash_m1.sh ?
<kristianpaul>
at least will help to track corupt yes ;)
<wolfspraul>
xiangfu: is Adam using reflash_m1.sh or reflash_all.batch ?
<ignatius->
The JLime kernel tree compiles and sees the entire NAND. I wasn't able to get previous kernels to see that extra NAND. I've deducted that it my be a kernel option. Anyone know what that might be?
<bartbes>
then.. this sounds like a very hacky solution
<bartbes>
the thing is, I know it doesn't work
<`antonio`>
this is a temporary solution
<bartbes>
because I believe it is the function I nerfed
<jivs>
bartbes, do u think this error might be related with libunistring in some way?
<bartbes>
I would expect that, yes
<bartbes>
hmm
<bartbes>
it is a different function
<`antonio`>
which one?
<jow_laptop>
`antonio`: the proper way to override that (within an OpenWrt makefile) is  CONFIGURE_VARS += gl_cv_func_duplocale_works=yes
<jow_laptop>
no need to edit a config.cache
<`antonio`>
jow_laptop: nice, thanks
<jow_laptop>
the configure should then output something like  "Checking for foo ... yes (cached)"
<jivs>
jow_laptop, thanks
<bartbes>
is there a MAKE_ARGS thing too?
<bartbes>
jow_laptop: ah, cool
<jow_laptop>
bartbes: yes
<bartbes>
I was going to override it at a later point, during make, but this probably works
<jow_laptop>
there is  MAKE_VARS  which overrides the environment (e.g.  FOO=bar make ...)
<jow_laptop>
and  MAKE_FLAGS  which extends the args (e.g.  make FOO=bar)
<bartbes>
right, so if it doesn't work I can play with MAKE_FLAGS, thanks
<jow_laptop>
just be sure to always append (+=) to those vars since they already contain a bunch of default overrides and variables
<bartbes>
`antonio`: I managed to get it down to a link error
<bartbes>
updating the old patch should fix that
<`antonio`>
can you paste bin it
<jivs>
cool bartbes
<bartbes>
also time to take out all my desperate attempts
<bartbes>
it's just the csqrt one
<bartbes>
also, do you know what this 'issue with threads' is?
<jivs>
csqrt, is it similar to the guile1.8.7 patch
<bartbes>
yeah
<bartbes>
so easy
<jivs>
there is a patch for that already on 1.8.7, hopefully it will work
<bartbes>
no need to
<bartbes>
I know what it does
<bartbes>
so I can just replicate it
<jivs>
okay
<bartbes>
I guess I might as well start working on become the richest man in the world
<bartbes>
because that will finish sooner than this compile
<jivs>
so can be solved using configure_vars from Makefile. isn't it?
<`antonio`>
bartbes, how long it takes in your machine ?
<bartbes>
jivs: that's what I tried
<jivs>
cool
<bartbes>
`antonio`: not as long as I made it out to be, but 10 mins, I guess
<bartbes>
if this build fails I'll time it for you
<bartbes>
(hoping it doesn't, though)
<jivs>
lets be optimistic :-)
<bartbes>
well, you know, I'm undoing my desperate measures
<bartbes>
so it might very well happen
<bartbes>
`antonio`: well, 10 minutes seems like a good estimate for the source to compile
<bartbes>
it's been chewing through docs for a while now, though
<bartbes>
stupid texinfo manuals..
<`antonio`>
so successfully completed ?
<`antonio`>
bartbes, then you'r almost there
<bartbes>
it's still creating manuals
<bartbes>
I'm going to have to see if there's an option to turn that off
<bartbes>
`antonio`: progress update: still compiling texinfo manuals
<bartbes>
not a fun activity
<`antonio`>
what processor do yo have?
<bartbes>
it's compiling on an old p4
<bartbes>
but still, it's coming up to 45 mins
<jivs>
still no new error, so good going
<jivs>
if it completes fine, will be worth the wait ...
<bartbes>
jivs: like I said, it's been compiling docs (with the same command) for 45 mins!
<bartbes>
yeah, I'll just let xiangfu dispatch the build server or something
<bartbes>
like hell I'm compiling these docs again..
<`antonio`>
wpwrak, I am following the INSTALL-Ben instructions and applying patches to the kernel but when  I install the kernel in my nanonote I get  "ERROR: Can't get kernel image!". the problem might be that I am using my own image, can I apply those patches directly to the toolchain? Â
<`antonio`>
bartbes, got to go now, let me know if you got it working !
<jivs>
bartbes, How did it go?
<bartbes>
still.. making.. docs..
<jivs>
omg
<jivs>
have u found any way to disable that for future!
<bartbes>
not yet
<bartbes>
:(
<jivs>
I will also start in my toolchain soon. but its not that powerful though...
<jivs>
bbiab
<bartbes>
guild snarf-check-and-output-texi          > guile-procedures.texi || { rm guile-procedures.texi; false; }
<bartbes>
30425 pts/1Â Â Â Â R+Â Â 152:41 /bin/sh /media/shared/home/nanonote/openwrt-xburst/build_dir/target-mipsel_uClibc-0.9.32/guile-2.0.2/meta/guile -e (@@ (guild) main) -s /media/shared/home/nanonote/openwrt-xburst/build_dir/target-mipsel_uClibc-0.9.32/guile-2.0.2/meta/guild snarf-check-and-output-texi
<bartbes>
oh, minus that first line
<bartbes>
I couldn't help but interrupt
<bartbes>
this wasn't going to work
<jivs>
oh
<jivs>
i think there is some patch to disable snarf on guile 1.87, will that help us now?
<viric>
I've always wondered how someone building linux manage the memory it is going to use
<viric>
(user programs apart, of course)
<viric>
Looking for that, I never found the information I wanted. Does anybody here happen to knuw much about that?
<viric>
know
<viric>
it's clear how to compile away code
<bartbes>
jivs: probably
<viric>
but not-code... how?
<bartbes>
jivs: I hope so, because my attempt failed
<jivs>
i will update you my progress..
<bartbes>
jivs: please tell me you've found a way to disable the docs yet
<jivs>
bartbes, can u paste the second confiigure_args
<jivs>
configure_vars
<bartbes>
CONFIGURE_VARS += gl_cv_func_duplocale_works=yes guile_cv_use_csqrt="no, Ben NanoNote (cross-compiling)"
<jivs>
sorry I don't have that good news yet..
<bartbes>
the worst part is that it takes about 10 mins to verify..
<jivs>
did u get this error? :->Â Â i18n.c: In function 'str_upcase_l':
<jivs>
i18n.c:874:12: error: dereferencing pointer to incomplete type
<jivs>
this error went away after I added 2nd conf_var
<jivs>
Did u get this error? :-> bash: -c: line 0: unexpected EOF while looking for matching `"'
<jivs>
bash: -c: line 1: syntax error: unexpected end of file
<jivs>
paste here plz, bbiab
<bartbes>
jivs: I never got the second, probably because I patched the first
<bartbes>
jivs: I almost cracked it
<jivs>
whats the error now?
<bartbes>
I disabled the build rule and it complained about the lack of output
<bartbes>
so I disabled that expectation too
<jivs>
is it compiling now?
<bartbes>
yeah
<bartbes>
we'll see how this ends..
<bartbes>
if it builds, I'll commit
<bartbes>
testing can wait
<bartbes>
I've spent enough time on this..
<jivs>
ok
<jivs>
Can you pastebin your Makefile plz. I am still getting that bash: -c error