0

Please explain Here be Dragons setting Itsy Bitsy M4 Arduino
Moderators: adafruit_support_bill, adafruit

Please be positive and constructive with your questions and comments.

Please explain Here be Dragons setting Itsy Bitsy M4 Arduino

by blnkjns on Sat Mar 20, 2021 4:04 am

I was benchmarking some boards in the Arduino IDE, and a the same calculations in Circuit Python for comparison, just by throwing 2500 floating point calculations at them, and got these results:
Code: Select all | TOGGLE FULL SIZE
 * Teensy 4.1 M7 (600MHz)            0.001481s
 * Itsy Bitsy M4 (200MHz) Dragons    0.013533s
 * Itsy Bitsy M4 (120MHz) Dragons    0.021547s
 * ESP32 (240MHz)                    0.039620s
 * Itsy Bitsy M4 (Circuit Python)    0.044998s
 * Itsy Bitsy M4 (200MHz)            0.045947s
 * Pi Pico (133MHz Circuit Python)   0.052979s
 * Itsy Bitsy M4 (120MHz)            0.072773s
 * Leonardo  8-bit AVR (16MHz)       0.417328s
 * Adafruit Metro M0 Express (48MHz) 0.576714s
As you can see, the performance of the Itsy Bitsy M4 is very much affected by the setting "Here be dragons", it becomes faster than the ESP32 and Pico. What does happen in this case? Is this needed to activate the FPU uberhaupt? Is there more detailed documentation on that somewhere? This page is pretty vague about it:
https://learn.adafruit.com/introducing- ... ches-to-m0

Code: Select all | TOGGLE FULL SIZE
  long starttime=micros();
  for (int n=1;n<501;n++){
    a=sqrt(n);
    b=log(n);
    c=sin(n);
    d=atan2(a,c);
    e=pow(1.01234,n);
  }
  long finishtime=micros();
And why is the M0 doing so bad? It gets beaten by a 3 times lower clockspeed 8-bit chip.
Nice to see the Itsy Bitsy M4 outperform the Pi Pico in Circuit Python speed.

Two minor strange things:
If you don't plot value a-e afterwards, some compilers (AVR 8-bit) just kick out the whole routine and have about 1 microsecond to complete!
The circuitpython results are off for the power function, 460.374 instead of the usual 460.467102 on all boards with C++, which is quite inaccurate. The other functions do fine.
My Mac calculator thinks it should be 460,4671004284367 in 64 bit precision, Windows Phone thinks it should be 460,4671004284327, so C++ seems right here.

blnkjns
 
Posts: 617
Joined: Fri Oct 02, 2020 3:33 am

Re: Please explain Here be Dragons setting Itsy Bitsy M4 Ard

by adafruit_support_mike on Mon Mar 22, 2021 10:28 pm

blnkjns wrote:the Itsy Bitsy M4 is very much affected by the setting "Here be dragons", it becomes faster than the ESP32 and Pico. What does happen in this case?

Code optimization is considered a black art: a field so complex that it's nearly impossible to explain to non-experts, and even experts find it impossible to predict what the system will do.

There are several basic forms of optimization, like unrolling loops: replacing this:

Code: Select all | TOGGLE FULL SIZE
    for ( uint8_t i=0 ; i < 3 ; i++ ) {
        do_something_with( i );
    }
with:

Code: Select all | TOGGLE FULL SIZE
    do_something_with( 0 );
    do_something_with( 1 );
    do_something_with( 2 );
if the increment, comparison, and branch steps take longer than the function call. Another technique is inlining: replacing a function call with the code inside the function definition to save the cost of the function call.

More advanced techniques include speculative evaluation: calculating results for both the TRUE and FALSE cases of a conditional so the first value is ready as soon as the ALU knows which case to use. The goal there is to keep the processor's pipeline full.. the M4 processor takes 3 clock cycles to execute most instructions, but breaks the execution into 3 steps and handles 3 instructions at a time. On average, it completes one operation and starts one new operation every clock cycle.

Things like conditionals cause problems for the pipeline because it doesn't know which instruction will execute in three clock cycles. The naive approach to process the conditional and then load the next appropriate instruction. But then you have to wait for three clock cycles for that instruction to work its way through the pipeline. Speculative execution loads instructions from both paths into the pipeline to get some use out of those dead cycles. Even if you have to throw away two instructions, you've lost two clock cycles instead of three.

Even more advanced is branch prediction: choosing one branch over the other based on the statistical probability that the choice will be correct. As a simple example, if a loop will probably execute 100 times before it ends, branch prediction will preload the next iteration of the loop every time. Statistically, it expects to save 300 clock cycles at the cost of losing 3 at the end of the loop.

When you add another processing unit like an FPU or a peripheral, the optimizer will work to feed instructions to each device at the right time and in the right order to avoid making any part of the system wait on any other.

By the time an optimization system says "here there be dragons", it's using aggressive techniques that will only work if the code obeys certain restrictions. The fun part is that the optimizer doesn't tell the programmer what those restrictions are, it just gambles on the chance that its tricks will work.

adafruit_support_mike
 
Posts: 62799
Joined: Thu Feb 11, 2010 2:51 pm

Please be positive and constructive with your questions and comments.