Optimize C/C++ Code Performance for Deep Learning Applications without Deep Learning Libraries - MATLAB & Simulink (original) (raw)

Main Content

Vectorization and multithreading can improve the performance of embedded applications. You can use vectorization to execute the same instruction on multiple data elements simultaneously or multithreading to divide a workload into threads for concurrent execution across several cores. Both techniques allow processors to make more efficient use of available resources and complete tasks faster.

For deep learning networks that are resilient to precision loss, compressing learnables from single-precision to bfloat16 data type greatly reduces memory usage with little change in inference accuracy. This process does not require calibration data and increases inference speeds. Any hardware that supports single-precision floating-point data type can benefit frombfloat16.

Optimize the performance of C/C++ code generation by:

Code Generation for ARM Cortex Targets

With MATLAB Coder, you can enable vectorization through the use of SIMD intrinsics available in the code replacement libraries for Cortex-M targets or by setting theInstructionSetExtensions property for ARM® Cortex-A targets.

Code Generation for Intel and AMD

To generate code for x86 hardware, set theInstructionSetExtensions property to an instruction set extension that your processor supports. If you use Embedded Coder, you can also select from the instruction sets SSE, SSE4.1, AVX, AVX2, FMA, and AVX512F.

Generate bfloat16 Code

Generate code with learnables compression in bfloat16 format by setting the LearnablesCompression property of your coder.DeepLearningCodeConfig object dlcfg,

dlcfg = coder.DeepLearningConfig(TargetLibrary = 'none'); dlcfg.LearnablesCompression = 'bfloat16';

Alternatively, in the MATLAB® Coder™ app or the Configuration Parameters dialog box, on the Deep Learning tab, set Target library tonone. Then set the Learnables Compression property to bfloat16.

See Also

Functions