rellla changed the topic of #linux-sunxi to: Allwinner/sunxi /development discussion - did you try looking at our wiki? https://linux-sunxi.org - Don't ask to ask. Just ask and wait! - https://github.com/linux-sunxi/ - Logs at http://irclog.whitequark.org/linux-sunxi - *only registered users can talk*
mauz555 has quit [Remote host closed the connection]
tllim has quit [Ping timeout: 256 seconds]
lurchi__ is now known as lurchi_
lurchi_ is now known as lurchi__
ChriChri_ has joined #linux-sunxi
ChriChri has quit [Ping timeout: 265 seconds]
ChriChri_ is now known as ChriChri
tllim has joined #linux-sunxi
Mangy_Dog has quit [Ping timeout: 258 seconds]
splitice has joined #linux-sunxi
<splitice> So I've been hunting for a timer issue affecting quite a few H3 users
<splitice> We have been seeing it sporadically too in our fleet
<splitice> Anyone got any gut feelings for this one?
<splitice> clock monotonic jumps forward, sometimes there is a cpu stall (governor) but that may be because of the increase in clock_monotonic creating an uptime overflow
<splitice> the a64 errata comes to mind, however surely someone else in this community would have noticed by now
_whitelogger has joined #linux-sunxi
dev1990 has quit [Quit: Konversation terminated!]
<megi> never saw that on any of my H3 boards, except for those rcu stalls years ago when debugging cpufreq
<splitice> I'll put the kernel trace up in case anyone has any ideas that I havent - https://paste.ee/p/05p13
<splitice> unfortunately havent been able to work out how to replicate in the lab (that's from a remote device). So I can't get 100% access during the issue (openvpn ssh fails, but a http "backdoor" we have installed on some of these devices continues working)
<megi> It's not clear to me what 4.19.57 armbian kernel is, so it's hard to see what cpu clock rate change code looks like
<megi> Ie how it looks like after applying all the patches armbian carries for this kernel
<megi> try reproducing it with the latest mainline kernel
<splitice> that kernels already been stripped down of most of the non essential sunxi patches
<megi> 5.5 or 5.6-rc
<splitice> I'll check the list
<splitice> I'm working on a 5.4 port for testing but without a method to replicate it's going to be difficult to work out if it's fixed...
<splitice> what I know so far is that 3/15 devices deployed have the issue. Two have had it once each (in the past fortnight) and one has the issue repeatedly every 2-4 days.
<splitice> none of the 6 in our office (or when the 15 that were also there for 2 weeks powered on) exhibit the issue
<megi> do the locations have differing ambient temperatures?
<megi> what is the board dts?
<splitice> friendlyarm nanopi neo
<splitice> air
<splitice> the main armbian patches we have left enabled are for thermal zone support
<splitice> Temp: They are human occupied houses so they shouldnt be too extreme. The currently failed device is reporting 47deg currently, no idea if there was thermal stress events during operation.
<splitice> (SoC temp)
<splitice> big heatsinks and reasonable cpu settings are employed
<megi> hmm, the board doesn't have CPU voltage regulator?
<splitice> 3 voltages I beleive
<megi> dts doesn't have it set up
<megi> this means cpufreq will alter the cpu frequency freely, without touching the whatever default voltage is there after boot
<splitice> pretty sure it's patched in in the thermal areas.
<megi> at least it's not wired up in 5.6-rc5 dts
<splitice> let me see if I can get a DTS
<megi> there's no cpu power supply
<splitice> ok, what block is that normally? I'll see if I can find a patch.
<megi> if the default voltage is >= 1.32V it may work I guess, otherwise you're probably undervolting the CPU
<megi> how to add the regulator node depends on the board
<megi> you can try limiting the top frequency to 816MHz to see if it will stop the stalls
<megi> that only requires 1.1V
<smaeul> I threw my timer testing tools on my OPi+2E, and I immediately got:
<smaeul> CPU 2: jumped back 0μs: 0x00000007997a7177 → 0x00000007997a7177 > 0x00000007997a7176 → 0x00000007997a7176
<smaeul> CPU 0: jumped back 0μs: 0x00000007997a7177 → 0x00000007997a7177 → 0x00000007997a7177 > 0x00000007997a7176
<megi> fun
<megi> smaeul: do you have code somewhere?
<splitice> @smaeul: indeed I'
<splitice> can run on a fleet fairly easily
<megi> splitice: regulator issue is real too, though :)
tllim has quit [Read error: Connection reset by peer]
<splitice> megi currently investigating the cpu voltage situation, it's a possibility, however given we have some devices up over 180 days I suspect it might have just been my patch reduction (reduction of surface area)
ganbold_ has joined #linux-sunxi
<splitice> bpi uses basically the same regular setup so I should be able to make a patch if not. More a kernel network susbsys guy but how hard can it be... lol
<megi> some H3 SoC may be stable at 1.3V/1.2GHz some may need slightly more
<smaeul> I never got around to cleaning it up after the a64 patch got merged, so... http://ix.io/2dRj
<megi> Allwinner doesn't do binning for H3, AFAIK
<megi> doesn't mean process doesn't vary
<smaeul> I get one or two CPUs jumping back, but only right when the test starts, so maybe it really is a cpufreq thing
ganbold has quit [Ping timeout: 256 seconds]
<splitice> our workload is one with alot of peaks and troffs, so the frequency would be changing
<smaeul> definitely CPU frequency related (i.e. clocks, bus stalls, power spikes/voltage drops, etc.). I run this and the backward ticks come rolling in:
<smaeul> while true; do cut -d' ' -f$(((RANDOM%9)+1)) scaling_available_frequencies > scaling_max_freq; done
<megi> other thing is that I don't like mainline implementation of CPUX rate changing code, so I run my own - mainline intentionally uses simple but broken CPU rate changing code, locks up PLL and then tries to recover
<splitice> megi got a patch?
<megi> sure
<splitice> I know for sure mainline cpu online code is broken
<splitice> it's actually how we are restarting the devices locked up due to this bug
<smaeul> got it to jump more than one tick backward: CPU 2: jumped back 1μs: 0x0000000d44b2e1ff → 0x0000000d44b2e1ff > 0x0000000d44b2e1e0 → 0x0000000d44b2e1e0
<splitice> ```
<splitice> ```while [[ 1 ]]; do
<splitice> that crashes every h3 I've ever tested it on (BPI and FriendlyARM NEO air, core and NEO)
<splitice> thanks megi, I'll apply against our 5.4 kernel (WIP) and keep an eye out
<megi> none of those have cpu-supply wireg either
<megi> wired
<megi> you need u-boot patch too, or you'll get lockup/stall on first thermal zone throttling event
<megi> yes
<splitice> megi I'm thinking https://paste.ee/p/5WvOd for the regulator. It's the same regulator setup as the BPI zero plus but with a different pin for control.
<megi> smaeul: I also get jumps
<smaeul> even with your PLL patch?
<megi> yes
<megi> well I get output, I don't know what it means :)
<megi> have to check the code
<smaeul> it's just the raw CNTVCT values (or whatever the armv7 equivalent is called), so a 24MHz counter
<megi> cyclic buffer
<megi> jsut buffer :)
<smaeul> yeah, it just reads 4 times and compares :)
<megi> kernel does isb() prior to mrrc
<megi> doesn't help :)
ElBarto has quit [Ping timeout: 268 seconds]
<megi> those jumps are all very small
ElBarto has joined #linux-sunxi
<splitice> the end result for us was an uptime that overflowed the 32bit counter (or under flowed the uptime?)
<splitice> that's what actually causes the irreparable problems, systemd doesnt like that...
<splitice> and I'm guessing nethier does some parts of the kernel
<smaeul> yes, the size doesn't matter when the jump is backwards
<megi> it's possible kernel will not even notice, since it's probably not getting cntvct in a tight loop
<smaeul> very likely even, which is why the date jumps so rarely
<smaeul> it just takes once :)
<megi> it would likely happen on all boards then, not just some
<smaeul> I'll leave the test running for 12h on my OPi+2E locked to the max freq (with active cooling) to see if I can get it to jump without a cpufreq transition
<megi> maybe cpufreq code manipulates some counter values?
<megi> or is the counter fixed freq?
<smaeul> the counter is fixed frequency, running off HOSC
<smaeul> Linux writing to the counter value at runtime would be crazy
<splitice> I see date jumps every ~0.3s over multiple cpu on a 4.14 kernel during a cpu speed transition loop (scaling_setspeed). Usually 0ms other times 1ms back.
<megi> even more bizarre why it would change on CPU freq change then
<megi> us :)
<splitice> I'll test on 5.4 in a sec too
<splitice> yep 5.4 exhibits the same behavior
<splitice> that's some serious timetravel
<megi> hmm, mainline switches CPU to HOSC during CPUX rate change
<megi> code execution slows down quite a bit during that time
<smaeul> wow, so it's not like the a64 bug at all (where we read indeterminate values). the clock actually *goes backward* and then counts the same values over again
<megi> that's why I suspect kernel is doing some correction :)
<smaeul> in that picture you can see CPU 0 reaching [the same value] three times (can't copy/paste from your screenshots :/)
<megi> hmmh, does the same thing happen on A64 during cpufreq?
<megi> increase in backjumps?
<megi> smaeul: it doesn't happen on H5
<smaeul> I don't know, there are so many jumps already that it would be nontrivial to filter out from the noise (and all of my A64 systems have the workaround already)
<megi> nevermind, H5 is 64-bit and doesn't jump on cpufreq
iyzsong has joined #linux-sunxi
megi has quit [Quit: WeeChat 2.7.1]
megi has joined #linux-sunxi
<MoeIcenowy> smaeul: what means *actually goes backward* ?
<MoeIcenowy> cannot parse the context
<megi> smaeul: I fixed it :)
<megi> I dropped the PLL gating and reparenting and there are no jumps anymore
<megi> so it's caused by that
<smaeul> MoeIcenowy: on the a64 when the clock jumps back from (say) 0xfffff back to 0x7ffff, it immediately jumps forward again to 0x100000 and counts from there
<smaeul> on the h3, after jumping back, it would count 0x7ffff -> 0x80000 -> 0x80001 and so on
<MoeIcenowy> why is the clock jumping back?
<smaeul> so on the a64, the counter hardware is still counting forward, but what you can see from the CPU has some wrong bits
<MoeIcenowy> megi: then please test CPU frequency switching stability
<megi> MoeIcenowy: I will not do that again
<megi> I already did that and it works fine
<MoeIcenowy> megi: okay
<megi> anyway, I don't think this is really the cause of splitice's problems
<MoeIcenowy> so I still cannot understand what happened. Is it that CPUfreq scaling makes the clock to be tweaked back, and then triggered bug?
<megi> it happens when switching parents of CPUX clock
<megi> to HOSC and back
<megi> no idea why
<megi> it only happens on H3, not H5
<MoeIcenowy> switching the parent will make the clock to be tweaked back?
<megi> yes
<splitice> Unfortunately even I have little idea what causes my issue. That will likely be the case until I can replicate it in lab. This certainly is a strong contender for the cause.
<MoeIcenowy> and this happens on A64 too, right? (although the HW seems to be trying to hide it)
<megi> probably not
<megi> I don't think A64 had CPUX reparenting until recently
<MoeIcenowy> I know that the current A64 bugfix is not optimal
<MoeIcenowy> it cannot prevent timetravel, only reduces it
<MoeIcenowy> (I mean the timer fix)
<megi> yes
<megi> someone reported some fsl bugfix worked for them better
<MoeIcenowy> which fsl bugfix?
<megi> but I had infinite loops using it in u-boot :)
<megi> so no
<MoeIcenowy> CONFIG_FSL_ERRATUM_A008585 ?
JohnDoe_71Rus has joined #linux-sunxi
<megi> probably
<megi> it does some unbounded loop
<MoeIcenowy> megi: when I'm reading the code
<MoeIcenowy> it's forced to be bounded to 200
<megi> it randomly locked up my bootloader depending on the code size (which affected boot timing)
<megi> maybe it's bounded in Linux
<MoeIcenowy> let me check git log
<smaeul> if you have a better fix, I'll review a patch, but I have never seen time travel on any of my 6 A64 devices with the current fix
<MoeIcenowy> (My git repo is on HDD, so it's slow
<megi> I did not either
<MoeIcenowy> smaeul: I saw it on Pinebook once, PinePhone once
<MoeIcenowy> both to the 22th century
<MoeIcenowy> BTW dhclient seems to be not y2038-ready
<MoeIcenowy> it starts to segfault after the timetravel
<megi> works only with my other CPUX patches
<MoeIcenowy> megi: BTW, as we know this bug, I think you should submit a patch that hides the divider from the CCU driver
<megi> and u-boot patch
<megi> yeah, I'll try
<MoeIcenowy> but keep its compatibility with older u-boot is an issue
<megi> but this is a tough pill to swallow
<megi> this will break kernels running on incompatible u-boots
<splitice> thanks megi, i'll test that. Working on applying your patches currently. Broke my kernel in the last build so stepping back with patches.
<megi> I don't think it will be accepted
<splitice> I need a faster build machine :(
<megi> splitice: good luck, I'm off :)
<MoeIcenowy> megi: if this bug is confirmed, we MUST find a way to get it accepted
<splitice> thanks mate, have a good morning/day/evening
<MoeIcenowy> reset the divisor when booting?
<MoeIcenowy> this will have an one-shot possibility to trigger the bug
<MoeIcenowy> but it prevents triggers the bug all the time
<splitice> a one shot on boot (only an old u-boot) is infinitely better than it occurring at a random time.
<splitice> it would be great if it could be done after watchdog init then a restart could be triggered, but that would probably be more effort than it's worth
lurchi_ has joined #linux-sunxi
lurchi__ has quit [Ping timeout: 265 seconds]
aloo_shu has quit [Disconnected by services]
chewitt has quit [Quit: Zzz..]
selfbg has joined #linux-sunxi
<splitice> Ah my failure wasnt megi's patches, it was `CONFIG_RTC_CLASS=y CONFIG_RTC_INTF_DEV=y`. Turns out that's broken on h3 (sun6i-rtc/sun8i-h3-rtc)
<Werner> splitice: And I was just about to ask if this issue is related to the 1978 date thingy ^^
airgapp has joined #linux-sunxi
reinforce has joined #linux-sunxi
<splitice> certainly is related I think, assuming the issue I'm seeing is same as OPs
<montjoie> wow, the TRNG seems working on R40
JohnDoe_71Rus has quit [Ping timeout: 256 seconds]
JohnDoe_71Rus has joined #linux-sunxi
<splitice> any idea on how to easily verifiy cpu voltage regulation is working?
<splitice> i guess I could try and probe the component... so tiny
<KotCzarny> temperature
<KotCzarny> it's almost directly related to voltage, not freq
<KotCzarny> so stick to 628 mhz or something and start playing
<KotCzarny> montjoie: congrats! :)
Corkhat has joined #linux-sunxi
Corkhat has quit [Remote host closed the connection]
<splitice> multimeter confirms my patch works
<splitice> default configuration of the neo air dts is to always run at 1.3v i.e over-volt
<KotCzarny> i hope you put some heatsink/fan combo on them
<splitice> I'll package up the patch tomorrow
<splitice> KotCzarny comes with a very effective heatsink
Putti has joined #linux-sunxi
Putti has quit [Changing host]
ldevulder_ has joined #linux-sunxi
kaspter has quit [Quit: kaspter]
kaspter has joined #linux-sunxi
<splitice> I can confirm that megi'
ldevulder has quit [Ping timeout: 256 seconds]
<splitice> s patch works to fix the issue as tested with smaeul's tool
<splitice> I'll do some testing for an introduced issues then ship to some testing devices and see if the issue continues
maccraft has joined #linux-sunxi
<KotCzarny> 12MB cache on cpu, that thing could run linux directly, har har
mauz555 has joined #linux-sunxi
suprothunderbolt has quit [Ping timeout: 258 seconds]
mauz555 has quit [Ping timeout: 272 seconds]
matthias_bgg has joined #linux-sunxi
gaston1980 has joined #linux-sunxi
yann has quit [Ping timeout: 255 seconds]
JohnDoe_71Rus has quit [Read error: Connection reset by peer]
JohnDoe_71Rus has joined #linux-sunxi
gsz has joined #linux-sunxi
dddddd has quit [Ping timeout: 258 seconds]
markk__ has joined #linux-sunxi
JohnDoe_71Rus has quit [Ping timeout: 255 seconds]
JohnDoe_71Rus has joined #linux-sunxi
tnovotny has joined #linux-sunxi
JohnDoe_71Rus has quit [Remote host closed the connection]
JohnDoe_71Rus has joined #linux-sunxi
florian_kc has joined #linux-sunxi
gsz has quit [Quit: Konversation terminated!]
JohnDoe_71Rus has quit [Client Quit]
selfbg has quit [Ping timeout: 255 seconds]
selfbg has joined #linux-sunxi
<obbardc> jernej: turns out, whatever has happened in drm since v5.4 has broken my HDMI>VGA converter
<obbardc> as it works fine on X and modetest on 1080p HDMI screen
<obbardc> but not on 1024x768 VGA monitor via the converter
<obbardc> it may have been before v5.4 i last tested the converter, not sure
<obbardc> unplugging the 1080p then plugging in the low-res screen via converter does get some output though
<obbardc> strange :-)
<KotCzarny> bad/unsupported mode (ie. clocks out of range?)
DrFrankensteinUK has quit [Ping timeout: 256 seconds]
<obbardc> maybe, it worked before with the defaults though
<montjoie> KotCzarny: I am just surprised it works, only H6 has it working until now
<KotCzarny> montjoie: fixing allwinner bugs and holes is like adventure game
AneoX has joined #linux-sunxi
DrFrankensteinUK has joined #linux-sunxi
ldevulder_ is now known as ldevulder
yann has joined #linux-sunxi
florian_kc is now known as florian
hlauer has joined #linux-sunxi
markk__ has quit [Ping timeout: 265 seconds]
matthias_bgg has quit [Ping timeout: 268 seconds]
matthias_bgg has joined #linux-sunxi
markk__ has joined #linux-sunxi
mauz555 has joined #linux-sunxi
markk__ has quit [Ping timeout: 255 seconds]
<mru> KotCzarny: a maze of twisty little passages, all alike?
JohnDoe_71Rus has joined #linux-sunxi
cnxsoft1 has quit [Read error: Connection reset by peer]
cnxsoft has joined #linux-sunxi
_whitelogger has joined #linux-sunxi
splitice has quit [Remote host closed the connection]
<willmore> You have been eaten by the Grue.
<mru> we finally know the name of the grue
<mru> it is musb
<KotCzarny> passages created by a copy paste with little changes and often not properly mapped (maps are also copypasted, but by another person)
Mangy_Dog has joined #linux-sunxi
AneoX has quit [Quit: Textual IRC Client: www.textualapp.com]
afaerber has quit [Quit: Leaving]
<megi> MoeIcenowy: it's not one shot with good odds of not hapeening, lockup always happens on first thermal event, when changing CPU frequency, when using dividers
<megi> MoeIcenowy: yes, you can set divider to 1 in a safe way in the kernel, you just have to wait for PLL VCO to lock on the lower frequency first, before changing the post-divider
dev1990 has joined #linux-sunxi
cnxsoft has quit [Read error: Connection reset by peer]
cnxsoft has joined #linux-sunxi
lurchi_ is now known as lurchi__
yann has quit [Read error: Connection reset by peer]
<MoeIcenowy> megi: good, then a kernel patch that locks the divider can be created
<megi> splitice: 1.3V is slightly undervolted for 1.2GHz
<megi> that's probably why you see some boards failing and some not, depending on SoC variability
<megi> and probably outside conditions, like precision of the voltage regulator
yann has joined #linux-sunxi
lurchi__ is now known as lurchi_
afaerber has joined #linux-sunxi
dddddd has joined #linux-sunxi
aloo_shu has joined #linux-sunxi
matthias_bgg has quit [Ping timeout: 255 seconds]
reinforce has quit [Quit: Leaving.]
mauz555 has quit []
aalm has quit [Ping timeout: 268 seconds]
aalm has joined #linux-sunxi
selfbg has quit [Remote host closed the connection]
cnxsoft has quit [Remote host closed the connection]
markk__ has joined #linux-sunxi
afaerber has quit [Quit: Leaving]
afaerber has joined #linux-sunxi
lurchi_ is now known as lurchi__
AneoX has joined #linux-sunxi
AneoX has quit [Client Quit]
mauz555 has joined #linux-sunxi
gsz has joined #linux-sunxi
netlynx has joined #linux-sunxi
netlynx has quit [Changing host]
netlynx has joined #linux-sunxi
hlauer has quit [Ping timeout: 258 seconds]
florian_kc has joined #linux-sunxi
matthias_bgg has joined #linux-sunxi
maccraft has quit [Quit: WeeChat 2.7.1]
maccraft has joined #linux-sunxi
lkcl has quit [Ping timeout: 260 seconds]
yann has quit [Ping timeout: 272 seconds]
lurchi__ is now known as lurchi_
lkcl has joined #linux-sunxi
reinforce has joined #linux-sunxi
gediz0x539 has joined #linux-sunxi
florian_kc has quit [Ping timeout: 258 seconds]
matthias_bgg has quit [Ping timeout: 265 seconds]
lurchi_ is now known as lurchi__
tnovotny has quit [Quit: Leaving]
gsz has quit [Quit: Konversation terminated!]
maccraft123 has joined #linux-sunxi
maccraft has quit [Ping timeout: 255 seconds]
markk__ has quit [Ping timeout: 256 seconds]
arete74 has quit [Ping timeout: 256 seconds]
maccraft123 is now known as maccraft
arete74 has joined #linux-sunxi
matthias_bgg has joined #linux-sunxi
maccraft123 has joined #linux-sunxi
matthias_bgg has quit [Ping timeout: 260 seconds]
afaerber has quit [Remote host closed the connection]
yann has joined #linux-sunxi
maccraft has quit [Ping timeout: 255 seconds]
maccraft has joined #linux-sunxi
florian_kc has joined #linux-sunxi
afaerber has joined #linux-sunxi
JohnDoe_71Rus has quit [Quit: KVIrc 5.0.0 Aria http://www.kvirc.net/]
maccraft123 has quit [Quit: WeeChat 2.7.1]
florian has quit [Disconnected by services]
florian_kc is now known as florian
lurchi__ is now known as lurchi_
florian_kc has joined #linux-sunxi
netlynx has quit [Quit: Ex-Chat]
markk__ has joined #linux-sunxi
reinforce has quit [Quit: Leaving.]
gediz0x539 has quit [Ping timeout: 265 seconds]
gediz0x539 has joined #linux-sunxi
lurchi_ is now known as lurchi__
matthias_bgg has joined #linux-sunxi
vagrantc has joined #linux-sunxi
maccraft is now known as progpol
progpol is now known as maccraft
dev1990 has quit [Quit: Konversation terminated!]
gediz0x539 has quit [Ping timeout: 255 seconds]
markk__ has quit [Ping timeout: 260 seconds]
marvs has quit [Ping timeout: 260 seconds]
marvs has joined #linux-sunxi
marvs has joined #linux-sunxi
marvs has quit [Changing host]
return0e_ has joined #linux-sunxi
return0e has quit [Ping timeout: 265 seconds]
matthias_bgg has quit [Ping timeout: 255 seconds]
gaston1980 has quit [Quit: Konversation terminated!]
vagrantc has quit [Quit: leaving]
suprothunderbolt has joined #linux-sunxi
maccraft has quit [Quit: WeeChat 2.7.1]
maccraft has joined #linux-sunxi
embed-3d has quit [Remote host closed the connection]
embed-3d has joined #linux-sunxi
mauz555 has quit [Ping timeout: 272 seconds]