<_whitenotifier-b>
[scopehal] azonenberg pushed 3 commits to master [+0/-0/±5] https://git.io/JJ1Rt
<_whitenotifier-b>
[scopehal] azonenberg 39993e7 - Merged Convert8BitSamples and FillWaveformHeaders
<_whitenotifier-b>
[scopehal] azonenberg 6c83322 - LeCroyOscilloscope: use OpenMP to parallelize conversions of >1M point waveforms
<_whitenotifier-b>
[scopehal] azonenberg 5f083bb - Set up preferred thread count
<_whitenotifier-b>
[scopehal] azonenberg pushed 1 commit to master [+0/-0/±4] https://git.io/JJ12l
<_whitenotifier-b>
[scopehal] azonenberg 932bc15 - Performance improvements to LeCroy/VICP driver, reduced a bunch of needless data copying
electronic_eel has quit [Ping timeout: 256 seconds]
electronic_eel has joined #scopehal
Nero_ has joined #scopehal
Nero_ is now known as Guest86320
Guest86320 is now known as NeroTHz
<_whitenotifier-b>
[scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JJ1PM
<_whitenotifier-b>
[scopehal] azonenberg 22a7b22 - Don't mess with thread count here, it's now done in ScopeThread
<_whitenotifier-b>
[scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JJ1PD
<_whitenotifier-b>
[scopehal-apps] azonenberg e0dde4c - Set thread count in ScopeThread based on number of cores
<azonenberg>
So I'm thinking about bumping the minimum OpenGL version up to 4.5
<azonenberg>
This requires a Radeon HD 5000 series, Broadwell or newer integrated GPU, or GeForce 400 or newer
<azonenberg>
Current minimum is 4.3, which has the same discrete GPU requirements but will work back to Haswell integrated gfx
<azonenberg>
So basically bumping to 4.5 means we will no longer work on 2013-vintage integrated GPUs
<azonenberg>
and anything with a more recent cpu or discrete gpu is unaffected
<azonenberg>
Any objections?
<azonenberg>
monochroma, lain, Degi?
<azonenberg>
noopwafel?
<lain>
doesn't bother me
<monochroma>
for what gain?
<azonenberg>
monochroma: Direct state access
<azonenberg>
basically allows you to modify a GL object by the handle rather than binding and then using the implied current object
<azonenberg>
it's a much nicer API
<monochroma>
i think restricting what hardware it can work on even more is not great, if both were supported in parallel that would be fine. but that's just my $0.02
<azonenberg>
yes I know. But at the same time, how many people are trying to use glscopeclient on something that supports 4.3 but not 4.5?
<azonenberg>
as far as i can tell the only possible configuration is a haswell system with no discrete gpu
<azonenberg>
That's a pretty niche thing to be worried about
<azonenberg>
If it required a new discrete gpu generation i'd be a lot more concerned
<azonenberg>
almost anybody who still has a system that old probably has a discrete gpu for it to make it usable for anything modern, in which case they're good
<azonenberg>
the only configuration i can imagine being affected would be a scope or other embedded platform with no free pcie slots and a haswell iGPU
<azonenberg>
and at least lecroy went from ivy bridge to skylake with their motherboards :p
<monochroma>
i guess we will see what happens
<azonenberg>
I mean in general i want to have good hardware support, but i also do not have the resources to have a zillion implementations of everything
<azonenberg>
and haswell is fairly old
<azonenberg>
and like i said a haswell desktop can just get a 2014-or-newer gpu installed and it'll be fine
<azonenberg>
It already doesn't work in virtualbox because gl newer than 2.1
* monochroma
looks at her main desktop with an i5-4440 haswell that usually doesn't have a discreete GPU in it :P
<azonenberg>
Lol
<azonenberg>
i mean i still have a haswell desktop too
<azonenberg>
But it has a discrete gpu
<azonenberg>
Do you actually use glscopeclient on it?
<azonenberg>
basically what i'm trying to do is multithread the "copy waveform data to the gpu" logic
<sorear>
I misread that as Radeon RX 5000 series and was about to object to your deprecation timeline
<azonenberg>
This is hard to do without direct state access
<monochroma>
Model name: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
<sorear>
they have the same number so they're the same year right
<monochroma>
:P
<azonenberg>
monochroma: ok fine i'll see what i can do to keep it at 4.3 for now
<azonenberg>
if you care that much?
<azonenberg>
I will give you a discrete gpu if it comes to that :p
<monochroma>
it's not a big deal, i havn't had time for scope dev in a while, so when i have time i will figure something out
<azonenberg>
ah ok
<monochroma>
but, idk how many other people are running similar configs
<azonenberg>
i mean to be quite honest you're due for an upgrade anyway
<azonenberg>
Yeah thats why i was asking around
<azonenberg>
i dont want to require the latest and greatest hardware, so for example all of the AVX optimization i've done recently has runtime cpu detection and there are fallbacks to the old versions
<azonenberg>
But i feel like needing a 6-year-old computer is not an unreasonable requirement
<azonenberg>
needing a modern xeon is :p
<azonenberg>
what i'm hoping to do here is eliminate a bunch of copies
<azonenberg>
right now i create temporary buffers, write waveform data into them, then glBufferData them to the GPU
<azonenberg>
What i want to do instead if glMapNamedBuffer() then write directly into the buffer as i convert from the internal representation to the GPU-friendly representation
<azonenberg>
is*
<Degi>
hi
<Degi>
Hm then it probably wont run on my laptop but I've never used it there either
<Degi>
Oh nevermind, I dont think that 4.3 does either
<Degi>
Hmh, my newest laptop has a 4210M... Itd be kinda nice if we can support both, or an alt mode where it does CPU rendering
<azonenberg>
Degi: that's not happening any time soon. i'm moving more and more stuff to compute shaders
<Degi>
Hmh, maybe some kinda legacy mode
<azonenberg>
that was kinda always the endgame, doing protocol decodes and math on the gpu
<azonenberg>
That would basically require two implementations of everything
<azonenberg>
I don't have the resources for that
<Degi>
Okay
<azonenberg>
Right now the core non-negotiable requirements are a 64-bit CPU and a gpu with compute shader support
<azonenberg>
I'm going to try to hold off on needing gl4.5
<_whitenotifier-b>
[scopehal-apps] azonenberg pushed 3 commits to master [+0/-0/±12] https://git.io/JJ1y4
<_whitenotifier-b>
[scopehal-apps] azonenberg 7367430 - Refactoring: split PrepareGeometry into PrepareGeometry (parallelizable) + DownloadGeometry (must run in render thread)
<_whitenotifier-b>
[scopehal-apps] azonenberg e51ce62 - WaveformArea::PrepareGeometry now outputs directly to OpenGL memory rather than doing separate glBufferData operations
<azonenberg>
There's still just a little bit more work being done in the main thread than I'd like but this is good progress and has eliminated a lot of wasted work
<Degi>
Oh neat
<noopwafel>
as long as it's just glscopeclient, do whatever makes sense for you, I think
<noopwafel>
(from my perspective)
<azonenberg>
noopwafel: right now my "has to still work on this" system is a haswell i5 i got near the end of my phd, with an nvidia discrete gpu
<noopwafel>
I will swap out haswell thing for something with a Radeon 540
<noopwafel>
myself
<azonenberg>
anyway i decided to stick with GL 4.3 for the short term
<azonenberg>
so haswell integrated gfx will work
<azonenberg>
sandy/ivy bridge will not, but they never did
* monochroma
pats her sad x220 :<
<azonenberg>
monochroma: oh come on get lain to put in some pcie bodgewires
<azonenberg>
swap out one of the usb2 ports w/ usb3 and put pcie on the SS pins
<azonenberg>
then make a 2080 ti dongle
<azonenberg>
:P
<noopwafel>
it is not going to work on my poor x200 whatever happens, which are the cheap disposable laptops I dumped on tables so I could do workshops at CCC
<azonenberg>
Yeah. I don't expect it will ever work without compute shader support
<azonenberg>
That was a core requirement almost from day one
<noopwafel>
but for this I can just do a horrible alternative frontend, given my previous UI was a python thing with a graph, some comboboxes and an 'arm!' button :-)
<azonenberg>
lol
<azonenberg>
yeah the scopehal library will work on whatever
<azonenberg>
I do have special case optimizations for avx2 and avx512 in the lecroy driver now
<azonenberg>
but i dynamically swap those in based on cpuid detection and fall back to a default "gcc -m64" build otherwise
<noopwafel>
ah right, I rescued my pico
<noopwafel>
so I should see how much I can get
<electronic_eel>
my pc on the electronics workbench is a sandy bridge i3-2100. currently I mostly need it for viewing ibom and hooking up some jtag or serial adapters. I guess I'll have to upgrade when I want a 40GbE link to the scope there ;)
<noopwafel>
it can pull 200MS/s so I guess I will be very cpu-limited too
<noopwafel>
might be good motivation to add oversampling support
<azonenberg>
noopwafel: oversampling what?
<noopwafel>
azonenberg: oversampling on the scope side, because playing with streaming
<azonenberg>
oh you mean to inflate the number of samples we get?
<azonenberg>
electronic_eel: yeah good luck saturating 40GbE with that
<noopwafel>
I mean getting the fpga to average together 10 traces or so, reducing stream to 100MS/s
<azonenberg>
oh you mean hardware averaging?
<noopwafel>
yeah
<noopwafel>
pico call it 'resolution enhancement' and people around me use all kinds of different terms to refer to this :-)
<noopwafel>
it's a bit fiddly driver-side because suddenly you can't just pass around int8_t any more
* monochroma
imagines an oscilloscope with a "Turbo" button
<azonenberg>
noopwafel: yeah i already have "HD mode" support on lecroy for reading 16-bit samples (typically 10/12 bits on the hardware side but padded to 16) from HDO series scopes
<noopwafel>
ah right they call it downsampling internally, because different modes
<noopwafel>
azonenberg: nice, I can re-use more of your code :-)
<azonenberg>
noopwafel: also based on current profiling of the code running on a lecroy at 400 Mbps
<azonenberg>
i have concerns re our ability to do 10G or 40G on a single thread viably
<azonenberg>
Even in push mode
<azonenberg>
We may want to consider a multi-stream protocol so we can use several threads for RX
<azonenberg>
say one socket per channel or something
<noopwafel>
that would I guess be very easy to do
<azonenberg>
Yeah one per channel would map nicely to what we have now. Would just need some sync logic to ensure the waveforms are aligned right to the same trigger
<_whitenotifier-b>
[scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JJ1NZ
<_whitenotifier-b>
[scopehal] azonenberg ede70ff - DeEmbedDecoder: Cache FFT plan between iterations
<azonenberg>
noopwafel, miek: ok so yeah i had no idea fft setup was such an expensive operation. i thought the function was just filling out a context object or something (I know ~zilch about how FFTs work under the hood)
<azonenberg>
Test setup: 1 minute on 4 channels, three just rendering and one with four different emulated channels on it (an AKL-PT1 with each ground accessory)
<azonenberg>
Profiling run 32: 90.48 CPU-sec, 11.139 sec in SParameterVector::InterpolatePoint, 23.661 in ffts_init_1d_real, and a total of 45.248 in DeEmbedDecoder::DoRefresh()
<azonenberg>
Run 34: 60.974 CPU-sec, interpolation doesnt even show up in the list of hot spots, init is just a couple of milliseconds
<azonenberg>
DeEmbedDecoder::DoRefresh() now takes 13.114 cpu-sec
<azonenberg>
So it's 2.5x faster now lol
<NeroTHz>
azonenberg, if you have low point count, my experience is that a spline interpolation can be good. You avoid sharp edges causing timedomain ripple
<azonenberg>
and that's without any vectorization or optimization on the actual de-embedding loop itself which is where most of the time goes i think
<azonenberg>
the main inner loop is 4.1 sec of that 13
<NeroTHz>
ofcourse, if you have enough points it is fine, but I also worked a lot with simulated results, and when you have your cluster spend 2 hours per frequency point, I want to have as few of those as I can get away with
<NeroTHz>
but I guess for now it is not a priority, as your application does mean you can measure it usually :p
<azonenberg>
Yeah one of the things on my near term todo is getting data to ground truth my channel emulation
<azonenberg>
so measure a signal directly, measure s-parameters of a 2-port network
<miek>
azonenberg: lol, nice
<azonenberg>
then measure the signal through that network and compare to channel emulation on the original signal
<azonenberg>
Anyway therre's definitely room to optimize more here
<miek>
azonenberg: iirc the fftw docs are pretty good on mentioning stuff like that - may be worth a read even if you're not using fftw itself, it probably applies across other impls
<azonenberg>
6.089 sec are in the output loop, of which most is spent in STL push-back's
<azonenberg>
because i forgot to preallocate the output buffer
<azonenberg>
then 4.178 is the actual de-embedding loop which should be possible to vectorize
<azonenberg>
then 0.882 is the forward FFT and 1.274 is the inverse fft
<Degi>
Is it possible to make that part more efficient
<azonenberg>
I think all of it except the FFT can be optimized quite a bit
<azonenberg>
i'm not going to try to make ffts faster :p
<Degi>
I mean maybe doing away with it idk
<azonenberg>
Once i'm done tuning this, the CTLE and FFT filters can likely get some of the same tweaks applied
<NeroTHz>
the way many commercial software packages use it is to use the s-parameters to generate a few taps worth of equalizer
<azonenberg>
NeroTHz: I may make a *separate* filter that does this
<azonenberg>
But this filter is specifically for full channel emulation and de-embedding, rather than basic cable loss compensation
<azonenberg>
Long term i want to have a large toolbox of filters that do similar things but have different implementations that are specialized for various purposes
<miek>
i spent quite a bit of time optimising fft stuff for my spectrogram viewer, that happily lets you pan through >>200GB sdr captures like it's nothing ;D
<azonenberg>
for example a FFT based CTLE that's very faithful to how an actual hardware CTLE would work
<Degi>
miek: Oh neat!
<azonenberg>
or a separate FIR based equalizer that has similar frequency response
<azonenberg>
and lets you trade speed off against accuracy
<azonenberg>
i.e. more or less taps
<azonenberg>
But first i want to get the mathematically "clean" implementation done
<azonenberg>
then worry about "close enough for most purposes" optimizations
<azonenberg>
All of the tuning i've done so far has been either trivial algorithmic optimizations like caching values instead of recomputing them, or straightforward vectorizations of an existing implementation
<azonenberg>
Long term i want the vast majority of the performance critical DSP code to be either GPU or heavily vectorized CPU
<azonenberg>
But obviously initial implementations are focusing on 'make it work first'
<miek>
btw, one other thing i found when profiling stuff like this way back was log10 being really slow. swapping out for log2(x)/log2(10) was way better ¯\_(ツ)_/¯
maartenBE has quit [Ping timeout: 265 seconds]
maartenBE has joined #scopehal
_whitelogger has joined #scopehal
NeroTHz has quit [Read error: Connection reset by peer]
m4ssi has joined #scopehal
m4ssi has quit [Remote host closed the connection]
m4ssi has joined #scopehal
m4ssi has quit [Remote host closed the connection]
bvernoux has quit [Quit: Leaving]
m4ssi has joined #scopehal
m4ssi has quit [Remote host closed the connection]