0

"grand central"? Mega format SAMD51? YES!
Moderators: adafruit_support_bill, adafruit

Please be positive and constructive with your questions and comments.

Re: "grand central"? Mega format SAMD51? YES!

by danhalbert on Tue Dec 25, 2018 4:46 pm

The DSP support means the additional "DSP" instructions provided in SAMD51 (M4) instruction set in the SAMD51J microcontroller. There's no additional DSP capability on the board. Here's an overview: https://community.arm.com/processors/b/ ... 1544851197. This is not a specialized DSP chip, but it does have hardware support for various different multiply operations, some simple SIMD instructions, etc.

danhalbert
 
Posts: 1420
Joined: Tue Aug 08, 2017 12:37 pm

Re: "grand central"? Mega format SAMD51? YES!

by westfw on Wed Dec 26, 2018 5:33 am

The Cortex-M4 CPU in the chip has "DSP instructions", which can in theory be used by the compiler, or via the CMSIS-DSP library.
You have to look at an ARM manual for the details. (and it's the same DSP support that the Metro M4, ItsyBitsy M4, Feather M4, or any other CM4-based board (Teensy 3.x) would have...

(Let's see: parallel 8/16 bit add/subtract, assorted multiply&add, saturated math, packing and unpacking...)
User avatar
westfw
 
Posts: 1509
Joined: Fri Apr 27, 2007 1:01 pm
Location: SF Bay area

Re: "grand central"? Mega format SAMD51? YES!

by dar Kale on Wed Dec 26, 2018 5:49 pm

Thanks @danhalbert and @westfw! Who says you can't teach an old dog new tricks? So this stuff brings up 10 more questions of course. The last time I touched an Arm chip, it was to write some simple assembly language code about 15 years ago; at present, I have no new knowledge of the Arm ecosystem. So...
1) Guessing here that there is probably not much Arduino support for those DSP instructions? No? Yes?
-and-
2) What is available in low cost (or free) compilers that could be used for tickling those same DSP instructions?

Thanks again!!!
dar Kale
 
Posts: 39
Joined: Thu Feb 13, 2014 10:23 pm

Re: "grand central"? Mega format SAMD51? YES!

by paulstoffregen on Wed Dec 26, 2018 6:18 pm

I can comment a bit on the DSP instuctions... I'm the guy behind Teensy and I wrote the Teensy Audio Library, which is (as far as I know) the only library out there today that's making substantial use of the special M4 DSP instructions. The CMSIS library also has a big collection of vector math functions, which do make use of these instructions when your vectors are "q15" data type.

First, to clear up a common misunderstanding (or wishful thinking), the compiler makes no significant use of the M4's DSP instructions. The IAR & Keil compilers have special "intrinsic" functions for them. In gcc, you need to use inline asm. CMSIS and my custom headers in the audio library give you something pretty similar to the "intrinsic" stuff in those other compilers using inline functions with inline asm, containing just 1 instruction.

The most important thing to understand about M4's DSP instructions is they're mostly meant for 16 bit signed integers. There is some limited support for 8 bits too. If you want to use floats or 32 or 24 bit integer data, the DSP instructions won't help. It's all built around that the "S" in DSP is 16 bit signed integers.

Actually programming with the DSP instructions involves a lot of careful planning for how the compiler will actually allocate registers. It's not like normal C programming where you can mostly ignore the low-level details. As you can see in parts of the Teensy Audio Library, especially stuff like the variable & biquad filters, I have indeed put these instructions to very effective use. The process involves carefully planning how the 13 available ARM registers will get used, then compiling and reading a disassembly of the compiler generated code. As soon as you add code that uses more, the compiler "spills" local variables onto the stack, ruining any speedup you hoped to accomplish! In just those 2 parts of the library, I compiled hundreds of times and carefully checked which registers the compiler actually used. You don't see that when just reading the C++ source code, but I can assure you this is the reality of actually using M4's DSP instructions.

ARM Cortex-M4 also has another not-so-widely-known hardware optimization that works together with the DSP instructions. When your code has a sequence of "similar" 32 bit load or store instructions, the M4 will automatically optimize them at runtime using a special burst mode on its internal bus. Normally a load or store takes 2 cycles. With this burst mode, only the first memory access takes 2 cycles. Then the rest take only 1 cycle. This is critically important for overall optimization of simpler signal processing. If you can use 4 registers to hold 8 signal samples, you can read those 8 samples into registers using only 5 cycles.

Most of the M4 DSP instructions are organized to let you use 32 bit registers as pairs of 16 bit data. Each instruction comes in a "top" and "bottom" version which do the same thing, but access different halves of a normal 32 bit register. The intention is you consume the input from packed data and create intermediate results (usually more than 16 bits) into the other registers. The DSP instructions gives you special special shift-and-saturate to get your results back to 16 bit samples, and ways to pack the results back into pairs in 32 bit registers. Then you can use another optimized store sequence to quickly get the result written back out to memory. Doing 4 or 8 samples per loop this way gives a huge speedup, often mostly due to the quicker memory access and suffering the loop overhead 4 or 8 times less. M4 lacks branch prediction like you get in M7, so a branch taken always costs 3 cycles on M4, plus any math or logic to compute the branch condition. Even if you don't speed up the math at all, fewer times looping and the 2X faster memory access really adds up when you process a big buffer of signal data.

Math-wise, the flagship DSP instruction does two 16x16 signed multiplies and adds them both into a 64 bit accumulator. But it's limited to only certain top-bottom combinations from 2 source registers. The CMSIS math functions use for this FFT and FIR filters, which is the main place those limitations are useful. In the Teensy Audio Library, the only other place I've found to really use it is the RMS analysis, which uses hardly any CPU time to give you the RMS value of a real-time audio stream due to these speedups.

For general DSP programming, the instruction I've often found the most useful is the 32x16 multiply which discards the low 16 bits of its 48 bit result. The 32 bit input gives you a huge numerical range for coefficients & scaling factors, and the 16 bit input accesses either top or bottom packed samples. The *really* awesome part about this instruction is the result only consumes a single 32 bit register. That may sound underwhelming, but I can assure you it's pretty awesome where you're actually programming and struggling to come up with a strategy that lets you do 4 or 8 samples in a loop and still fit within the 13 registers.

If you read all or even part of this, hopefully it gives you some idea of the M4's DSP features. The main points are 16 bit signed data, and the compiler doesn't magically make use of these instructions. To really use them requires pretty intense programming where you carefully plan and check the actual register allocation the compiler does.

paulstoffregen
 
Posts: 434
Joined: Sun Oct 11, 2009 11:23 am
Location: Portland, Oregon, USA

Re: "grand central"? Mega format SAMD51? YES!

by paulstoffregen on Wed Dec 26, 2018 6:34 pm

Might also be worth point out M7 has exactly the same set of DSP instructions. But on M7 you get 2 tightly coupled memory buses, both 64 bits wide. Well, actually the data TCM is a pair of 32 bit buses, which means M7 can sometimes perform 2 loads in parallel, or 1 load while doing 1 store - if you've planned the memory addresses carefully. But stores on M7 usually go to the cache, if it's been configured as write-back mode - and the large caches do indeed make a huge difference (M4 can theoretically have data cache too - but few vendors use it - typically only instruction cache, if any). M7 also saves 2 cycles on predictable branches, and M7 is a superscaler dual-issue architecture, so sometimes you get 2 instructions per cycle. The performance on M7 is really quite incredible!

paulstoffregen
 
Posts: 434
Joined: Sun Oct 11, 2009 11:23 am
Location: Portland, Oregon, USA

Re: "grand central"? Mega format SAMD51? YES!

by dar Kale on Wed Dec 26, 2018 6:49 pm

@paulstoffregen: Thank you for the fantastic explanation! You are obviously the 'go-to' guy for this topic. Me, not so much. I've only beat out a few IIR and FIR filters at the extreme simpleton level. But I really appreciate your reply. Especially the integer math details; I've probably used floating point math one time in 40 years.... integer math rules!

Thanks again!!!
dar Kale
 
Posts: 39
Joined: Thu Feb 13, 2014 10:23 pm

Re: "grand central"? Mega format SAMD51? YES!

by paulstoffregen on Wed Dec 26, 2018 7:27 pm

One of the strange things that takes some adjustment if you're used to older microcontrollers, is 32 bit float on M4 (for the M4s with FPU) is almost as fast as integer math. For integer stuff where you end up with several complicated if-else checks due to numerical ranges, sometimes 32 bit float ends up being a net win. Even when it's slower, it's usually pretty close to integer speed. But 64 bit float on M4 is still slow software. Due to how the compiler automatically promotes to 64 bit double in many cases, and the common math library functions are 64 bit double, optimally using 32 bit float on M4 requires a bit of care, especially with how you write constants and which math functions you use.

Most M7 comes with both 32 and 64 bit float in hardware. 64 bit double runs at about half the speed of 32 bit float, and of course on M7 integers are even faster because you get 2 integer execution units.

Pretty amazing how far things have come from the old days of 8 bits without even hardware multiply.

paulstoffregen
 
Posts: 434
Joined: Sun Oct 11, 2009 11:23 am
Location: Portland, Oregon, USA

Re: "grand central"? Mega format SAMD51? YES!

by tomjennings on Wed Dec 26, 2018 8:10 pm

first, thanks so much for the detailed explanation. and i now better appreciate the care you put into the sound libs!


paulstoffregen wrote:ARM Cortex-M4 also has another not-so-widely-known hardware optimization ... [w]hen your code has a sequence of "similar" 32 bit load or store instructions, the M4 will automatically optimize them at runtime using a special burst mode on its internal bus. ...


oooooh this appeals to my inherent laziness, and appeases my dread of losing portability. can you give us any pointer on where that 'similar ... burst mode' pipeline thing is described? personally i avoid feature-specific features, i find they come back to bite me later. i stick to conservative compiler rules/subsets for this reason (for my car stuff i follow the low-level recommendations in MISRA-C:2004). but if i can code 'up next to' known optimizations without violating conservative compiler practice, that's best of all.

tomjennings
 
Posts: 37
Joined: Thu Aug 17, 2006 1:21 am

Re: "grand central"? Mega format SAMD51? YES!

by westfw on Thu Dec 27, 2018 3:23 am

I've added the (perhaps preliminary) GCS pinouts to my spreadsheet at https://docs.google.com/spreadsheets/d/1hWHiM1Sk-gxcUmfgm5XQctqf9yaOnkmY66cwd_sRCiY/edit?usp=sharing, based on the contents of the variant file in Adafruit's SAMD Arduino Core github repository.

I like having the spreadsheet format, because by sorting on a particular column you can get a good idea of where stuff is, much more quickly than you could from a datasheet. And it helps me to understand what capabilities are present and/or "complicated" on a given chip.

As usual, this is a rough draft that may contain errors, as it's difficult to debug/error-check this sort of thing. And in the case of Grand Central (which is not released), there may be changes before the board actually ships.
User avatar
westfw
 
Posts: 1509
Joined: Fri Apr 27, 2007 1:01 pm
Location: SF Bay area

Re: "grand central"? Mega format SAMD51? YES!

by roborich on Sat Jan 12, 2019 12:40 pm

Related to westfw's post, (and now that its shipping) have the schematics for the GCM been posted anywhere yet? there doesn't seem to be a downloads page associated with the product yet.

On another note, not sure if this is the right place to do it but, I'd like to put in a request or an interest for a version of this product *without* the headers. Many of my projects come together with the processor board in the top or the middle of the stack. Even with Mega size boards I've had several project where Mega is attached on top of a base board that breaks out the I/O and power even further, and provides more secure mounting.

Yes, there's a lot of work-arounds and you could de-solder and pull off all the headers, but that's a lot of work, and risks damaging the board; especially with all the connectors on a Mega size board.

I would even argue that the GCM without headers provides more flexibility with connection's. It's much easier to put in bottom headers, top headers, stacking pass-thoughs, 90-degree. or any combination of those without having to do some connector translation.

Just a thought.

roborich
 
Posts: 10
Joined: Sun Mar 31, 2013 8:55 pm

Re: "grand central"? Mega format SAMD51? YES!

by adafruit2 on Sat Jan 12, 2019 1:25 pm

hiya just doing all the documentatin - here's the guide as it is now
https://learn.adafruit.com/adafruit-grand-central
we can look at a headerless version - would be easier on us for sure :)

adafruit2
Site Admin
 
Posts: 18053
Joined: Fri Mar 11, 2005 7:36 pm

Please be positive and constructive with your questions and comments.