<promach_>
$global_clock is the same as smt_clk , right ?
<promach_>
smt_clock
promach_ has quit [Quit: WeeChat 2.1]
promach_ has joined #yosys
<ZipCPU>
emeb_mac: It's pipelined. That part isn't configurable. However, you can configure the size of the FFT, the number of bits in the input, the number of bits in the output, the number of multiplies used, whether or not the FFT is to accept two samples per clock, 1 sample per clock, 1 sample every two clocks, or 1 sample every three clocks.
<ZipCPU>
Beyond that, the FFT is limited by your hardware ...
<ZipCPU>
and by the fact that the updates I'm working through aren't (yet) working. So ... without the new updates that I'm working on, the FFT only does 2 samples per clock plus the other configurables.
<emeb_mac>
ZipCPU: sounds pretty useful.
<ZipCPU>
The 1 clock per sample and the 2 clocks per sample just passed my test at 2048 points! Yaaay ... (3 clocks per sample still fails)
<ZipCPU>
Oh, and thanks!
<emeb_mac>
the radio I'm working on now needs two 1024-pt 16-bit in/out transforms for the RX and a 2048 16 in/out on the TX.
<emeb_mac>
we're using the Xilinx IP core for these and we've been pretty luck that they are working well
<emeb_mac>
(other IP cores from Xilinx have turned out to be disasters)
<ZipCPU>
;) Yeah, the Xilinx cores seem to be reliable enough.
<ZipCPU>
Once mine start working, they'll still need some tuning and optimization to match what Xilinx has done ... if I can do it at all.
<emeb_mac>
ZipCPU: that's strong competition - they've got good performance and well optimized resource usage.
<ZipCPU>
Exactly. Like I said, I don't know if I'll make a good showing in the end there, but at least what I have will work.
<emeb_mac>
IIRC the 1kpt transforms we use require only 18 of the MAC cores
<ZipCPU>
Let's see ... in the one I'm doing, you can tell it how many DSP cores you have. If you go full bore, 1 sample per clock, you'll need (10-2)*3 = 24 multiplies.
<ZipCPU>
You could do it for less, at the cost of more LUT's.
<ZipCPU>
On the other hand, if you want 2 clocks per sample, it would take (10-2)*2 = 16 multiplies, or if you want 3 clocks per sample it will take 8 multiplies.
<emeb_mac>
That's not bad
<ZipCPU>
On the other hand, if you are doing two samples per clock, then you'll want (10-2)*6 = 48 multiplies. It's all a tradeoff.
<ZipCPU>
The soft multiply option isn't all that efficient though. I've got a slower option that's more efficient still, and I've thought of integrating that one in later.
<ZipCPU>
The current soft multiply used is fully pipelined, so ... it requires a lot of flip flops and luts at every stage of the multiply.
<emeb_mac>
Nice to have the option for soft multiplies
<ZipCPU>
Yep!
<promach_>
ZipCPU: which soft multiply are you referring to ?
<ZipCPU>
What do you mean?
<promach_>
you have your own multiply algo ?
<ZipCPU>
Yes.
<emeb_mac>
I try to avoid FPGAs w/o some hard multiplier resources for DSP stuff, but sometimes you gotta go with what's available
<promach_>
wallace, Is uppose ?
<ZipCPU>
It's a basic shift/add multiply, nothing fancy in this case.
<promach_>
ok
<ZipCPU>
I've built a wallace before, but ... the FFT doesn't use it.
<ZipCPU>
Ok, 1, 2, and 3 clocks per sample now works using hardware multiplies, time to double check the soft multiplies
<emeb_mac>
so does 1clk/sample allow continuous feed w/o any gaps?
<ZipCPU>
Yes.
<ZipCPU>
You can also feed it with unpredictable gaps too.
<emeb_mac>
roughly what latency do you see from input to output?
<ZipCPU>
Depends on the size of the FFT. Curious about a 1k FFT? I can go measure that.
<emeb_mac>
yeah!
<ZipCPU>
Looks like about 4176 clocks from the start of the first frame going in to the start of the first output frame.
AlexDaniel has quit [Read error: Connection reset by peer]
seldridge has quit [Ping timeout: 256 seconds]
AlexDaniel has joined #yosys
<ZipCPU>
There's probably a couple clocks in there I could whittle out if latency was an issue, but that's what it is currently.
<emeb_mac>
That's not bad.
<ZipCPU>
Are you looking for low latency?
<emeb_mac>
Generally yes - these radio designs tend to have fairly long datapaths with lots of things going on in them.
<ZipCPU>
I'm not quite sure how I would, or if I would, redesign things for lower latency.
<emeb_mac>
IIRC the cores we use have about 3k clocks latency. I don't think 4k would be a huge disadvantage tho
<ZipCPU>
Hmm ... not sure where I'd find a full 1k latency from this design ....
<ZipCPU>
Sure, there's a clock or two in each stage, but at ten stages that'd be at most 20 clocks.
<emeb_mac>
Well, you're ahead of me. I've never thought too much about how to build an FFT.
<emeb_mac>
about 20 years ago a guy I shared an office with architected one as a single-chip ASIC so I've only had peripheral exposure to it from discussing w/ him.
<ZipCPU>
:)
<ZipCPU>
I suppose I might go faster if I did something other than a Radix two FFT ...
* ZipCPU
tugs at his beard
<emeb_mac>
Aha - that must be it. Radix-4 was part of the optimization he did on his.
<ZipCPU>
I might have to look into that in the future.
<ZipCPU>
For now, I just want to get it running in the first place.
<ZipCPU>
I'm pretty close, but ... not all cases work (yet)
ar3itrary has quit [Ping timeout: 276 seconds]
ar3itrary has joined #yosys
AlexDaniel has quit [Read error: Connection reset by peer]
AlexDaniel has joined #yosys
ar3itrary has quit [Ping timeout: 245 seconds]
ar3itrary has joined #yosys
<cr1901_modern>
FFT was one of those things where I had to derive "how it works" exactly once and now I don't remember how to do it :(. I know you can split into bins by time or frequency (either works), but Idk if any way is better
emeb_mac has quit [Ping timeout: 265 seconds]
xerpi has joined #yosys
marbler has quit [Ping timeout: 240 seconds]
jfng has quit [Ping timeout: 240 seconds]
samayra has quit [Ping timeout: 245 seconds]
indefini has quit [Ping timeout: 245 seconds]
nrossi has quit [Ping timeout: 240 seconds]
lok[m] has quit [Ping timeout: 240 seconds]
swick has quit [Ping timeout: 240 seconds]
pointfree1 has quit [Ping timeout: 255 seconds]
Guest18568 has quit [Ping timeout: 256 seconds]
fevv8[m] has quit [Ping timeout: 276 seconds]
weebull[m] has quit [Ping timeout: 260 seconds]
cr1901_modern1 has joined #yosys
cr1901_modern1 has quit [Client Quit]
cr1901_modern has quit [Ping timeout: 245 seconds]
cr1901_modern1 has joined #yosys
cr1901_modern1 has quit [Client Quit]
cr1901_modern has joined #yosys
promach_ has quit [Ping timeout: 240 seconds]
promach_ has joined #yosys
cr1901_modern has quit [Read error: Connection timed out]
cr1901_modern has joined #yosys
samayra has joined #yosys
promach_ has quit [Ping timeout: 255 seconds]
Guest16831 has joined #yosys
lok[m] has joined #yosys
indefini has joined #yosys
nrossi has joined #yosys
marbler has joined #yosys
swick has joined #yosys
jfng has joined #yosys
fevv8[m] has joined #yosys
pointfree1 has joined #yosys
weebull[m] has joined #yosys
indy has quit [Ping timeout: 240 seconds]
promach_ has joined #yosys
pie_ has quit [Ping timeout: 260 seconds]
dys has joined #yosys
indy has joined #yosys
m_t has joined #yosys
emeb_mac has joined #yosys
<emeb_mac>
ZipCPU: You've spoken before about the difficulty of applying formal to multipliers. Would it be a safe assumption that formal is generally not practical for DSP datapaths which rely heavily on math operations like multiplication / division / transformation?
<ZipCPU>
Yes and no .... there are some ways around the problems.
<emeb_mac>
I get the impression that the best way to apply formal in these types of designs is to partition complex control logic out and apply formal at the unit level.
<ZipCPU>
I've had mixed success with data paths including multiplication or division.
* ZipCPU
rummages through his designs for an example ....
<emeb_mac>
Would you call that exercise difficult? I have very little basis for comparison, but it seems somewhat contorted compared to simply running a stimulus / response simulation. Does it provide you with significantly more confidence in the design than a simpler approach?
<ZipCPU>
Not sure.
<ZipCPU>
Let's just say that, in this example, the jury is still out.
<ZipCPU>
Consider this, I'm working with a perfect example right now ... I have 6 types of code for a butterfly. Three use DSP elements, three do not.
<ZipCPU>
The three that use DSP elements work, the three that do not ... don't.
<ZipCPU>
I'm trying to find out why.
<ZipCPU>
If I try to apply formal methods to those other three right now, the formal methods don't complete. The multiply is just too difficult for them.
<ZipCPU>
Even when I bring it down to a three bit multiply they are struggling.
<ZipCPU>
For example, one of those soft multiply-based butterflies has now run its formal proof for over 12 hours, and has only made it to state 14 of 30.
<ZipCPU>
On the other hand, the two butterflies that didn't require hardware multiplies could be formally verified quite quickly.
<cr1901_modern>
it does look cool. Idk what I could do w/ it tho
<awygle>
cr1901_modern is correct
<ZipCPU>
Thanks, that makes a lot more sense than the other.
<cr1901_modern>
I've had my ham radio license since the end of 2013; I've made like 4 or so contacts b/c I don't like voice all that much, and there's little to no digital activity