dlhdl.Workflow.deploy - Deploy the specified neural network to the target FPGA board - MATLAB (original) (raw)

Class: dlhdl.Workflow
Namespace: dlhdl

Deploy the specified neural network to the target FPGA board

Since R2020b

Syntax

Description

deploy([workflowObject](#mw%5F78770475-e740-4380-b425-296d801fcd5c%5Fsep%5Fmw%5F348d2c6b-da53-4e76-ac9f-682c2e359b46)) programs the specified target board with the bitstream and deploys the deep learning network on it.

example

Input Arguments

expand all

workflowObject — Deep learning network deployment options

dlhdl.Workflow object

Deep learning network deployment options, specified as adlhdl.Workflow object.

Examples

expand all

Get Started with Deep Learning FPGA Deployment on Intel Arria 10 SoC

This example shows how to create, compile, and deploy a dlhdl.Workflow object that has a handwritten character detection series network object by using the Deep Learning HDL Toolbox™ Support Package for Intel FPGA and SoC. Use MATLAB® to retrieve the prediction results from the target device.

Prerequisites

Load the Pretrained SeriesNetwork

To load the pretrained network, that has been trained on the Modified National Institute Standards of Technology (MNIST) database[1], enter:

To view the layers of the pretrained series network, enter:

Create Target Object

Create a target object that has a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install Intel™ Quartus™ Prime Standard Edition 22.1. Set up the path to your installed Intel Quartus Prime executable if it is not already set up. For example, to set the toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Altera Quartus II','ToolPath', 'C:\altera\22.1\quartus\bin64');

hTarget = dlhdl.Target('Intel')

hTarget = Target with properties:

   Vendor: 'Intel'
Interface: JTAG

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained MNIST neural network, snet, as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Intel Arria 10 SOC board and the bitstream uses a single data type.

hW = dlhdl.Workflow('network', net, 'Bitstream', 'arria10soc_single','Target',hTarget);

Compile the MNIST Series Network

To compile the MNIST series network, run the compile function of the dlhdl.Workflow object.

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream arria10soc_single.

An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.

Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'

Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.

The network includes the following layers:

 1   'imageinput'        Image Input         28×28×1 images with 'zerocenter' normalization                (SW Layer)
 2   'conv_1'            2-D Convolution     8 3×3×1 convolutions with stride [1  1] and padding 'same'    (HW Layer)
 3   'relu_1'            ReLU                ReLU                                                          (HW Layer)
 4   'maxpool_1'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
 5   'conv_2'            2-D Convolution     16 3×3×8 convolutions with stride [1  1] and padding 'same'   (HW Layer)
 6   'relu_2'            ReLU                ReLU                                                          (HW Layer)
 7   'maxpool_2'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
 8   'conv_3'            2-D Convolution     32 3×3×16 convolutions with stride [1  1] and padding 'same'  (HW Layer)
 9   'relu_3'            ReLU                ReLU                                                          (HW Layer)
10   'fc'                Fully Connected     10 fully connected layer                                      (HW Layer)
11   'softmax'           Softmax             softmax                                                       (SW Layer)
12   'Output1_softmax'   Regression Output   mean-squared-error                                            (SW Layer)
                                                                                                         

Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.

Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.

Compiling layer group: conv_1>>maxpool_2 ...

Compiling layer group: conv_1>>maxpool_2 ... complete.

Compiling layer group: conv_3>>relu_3 ...

Compiling layer group: conv_3>>relu_3 ... complete.

Compiling layer group: fc ...

Compiling layer group: fc ... complete.

Allocating external memory buffers:

      offset_name          offset_address     allocated_space 
_______________________    ______________    _________________

"InputDataOffset"           "0x00000000"     "368.0 kB"       
"OutputResultOffset"        "0x0005c000"     "4.0 kB"         
"SchedulerDataOffset"       "0x0005d000"     "220.0 kB"       
"SystemBufferOffset"        "0x00094000"     "76.0 kB"        
"InstructionDataOffset"     "0x000a7000"     "28.0 kB"        
"ConvWeightDataOffset"      "0x000ae000"     "28.0 kB"        
"FCWeightDataOffset"        "0x000b5000"     "100.0 kB"       
"EndOffset"                 "0x000ce000"     "Total: 824.0 kB"

Network compilation complete.

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Intel Arria 10 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

Programming FPGA Bitstream using JTAG...

Programming the FPGA bitstream has been completed successfully.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 18-Jul-2024 10:54:36

Loading weights to FC Processor.

FC Weights loaded. Current time is 18-Jul-2024 10:54:37

Run Prediction for Example Image

To load the example image, execute the predict function of the dlhdl.Workflow object, and then display the FPGA result, enter:

inputImg = imread('five_28x28.pgm'); inputImg = dlarray(single(inputImg), 'SSCB');

Run prediction with the profile 'on' to see the latency and throughput results.

[prediction, speed] = hW.predict(inputImg,'Profile','on');

Finished writing input activations.

Running single input activation.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 31905 0.00021 1 32854 4565.7 imageinput_norm 2913 0.00002 conv_1 6819 0.00005 maxpool_1 4493 0.00003 conv_2 5200 0.00003 maxpool_2 3549 0.00002 conv_3 6045 0.00004 fc 2854 0.00002

[val, idx] = max(prediction); fprintf('The prediction result is %d\n', idx-1);

The prediction result is 5

Bibliography

  1. LeCun, Y., C. Cortes, and C. J. C. Burges. "The MNIST Database of Handwritten Digits." https://yann.lecun.com/exdb/mnist/.

Classify Images on FPGA Using Quantized Neural Network

This example shows how to use Deep Learning HDL Toolbox™ to deploy a quantized deep convolutional neural network (CNN) to an FPGA. In the example you use the pretrained ResNet-18 CNN to perform transfer learning and quantization. You then deploy the quantized network and use MATLAB ® to retrieve the prediction results.

ResNet-18 has been trained on over a million images and can classify images into 1000 object categories, such as keyboard, coffee mug, pencil, and many animals. The network has learned rich feature representations for a wide range of images. The network takes an image as input and outputs a label for the object in the image together with the probabilities for each of the object categories.

For this example, you need:

To perform classification on a new set of images, you fine-tune a pretrained ResNet-18 CNN by transfer learning. In transfer learning, you can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights. You can quickly transfer learned features to a new task using a smaller number of training images.

Load Pretrained Network

Load the pretrained ResNet-18 network.

View the layers of the pretrained network.

deepNetworkDesigner(net);

The first layer, the image input layer, requires input images of size 227-by-227-by-3, where three is the number of color channels.

inputSize = net.Layers(1).InputSize;

Load Data

This example uses the MathWorks MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).

curDir = pwd; unzip('MerchData.zip'); imds = imageDatastore('MerchData', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames');

Specify Training and Validation Sets

Divide the data into training and validation data sets, so that 30% percent of the images go to the training data set and 70% of the images to the validation data set. splitEachLabel splits the datastore imds into two new datastores, imdsTrain and imdsValidation.

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Replace Final layers

To retrain ResNet-18 to classify new images, replace the last fully connected layer and final classification layer of the network. In ResNet-18 , these layers have the names 'fc1000' and 'ClassificationLayer_predictions', respectively. The fully connected layer and classification layer of the pretrained network net are configured for 1000 classes. These two layers fc1000 and ClassificationLayer_predictionsin ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels. These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.

lgraph = LayerGraph with properties:

 InputNames: {'data'}
OutputNames: {'ClassificationLayer_predictions'}
     Layers: [71×1 nnet.cnn.layer.Layer]
Connections: [78×2 table]

numClasses = numel(categories(imdsTrain.Labels))

newLearnableLayer = fullyConnectedLayer(numClasses, ... 'Name','new_fc', ... 'WeightLearnRateFactor',10, ... 'BiasLearnRateFactor',10); lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer); newClassLayer = classificationLayer('Name','new_classoutput'); lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);

Prepare Data for Training

The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30]; imageAugmenter = imageDataAugmenter( ... 'RandXReflection',true, ... 'RandXTranslation',pixelRange, ... 'RandYTranslation',pixelRange);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ... 'DataAugmentation',imageAugmenter); augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify Training Options

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ... 'MiniBatchSize',10, ... 'MaxEpochs',6, ... 'InitialLearnRate',1e-4, ... 'Shuffle','every-epoch', ... 'ValidationData',augimdsValidation, ... 'ValidationFrequency',3, ... 'Verbose',false, ... 'Plots','training-progress');

Train Network

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available. Using this function on a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox). If a GPU is not available, the network uses a CPU (requires MATLAB Coder Interface for Deep learning). You can also specify the execution environment by using the ExecutionEnvironment name-value argument of trainingOptions.

netTransfer = trainNetwork(augimdsTrain,lgraph,options);

Quantize Network

Quantize the network using the dlquantizer object. Set the target execution environment to FPGA.

dlquantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');

Calibrate Quantized Network

Use the calibrate function to exercise the network with sample inputs and collect the range information. The calibrate function collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network and the dynamic ranges of the activations in all layers of the network. The function returns the information as a table, in which each row contains range information for a learnable parameter of the quantized network.

calibrate(dlquantObj,augimdsTrain)

ans=95×5 table Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue __________________________ __________________ ________________________ ________ ________

{'conv1_Weights'         }    {'conv1'         }           "Weights"            -0.79143     1.2547 
{'conv1_Bias'            }    {'conv1'         }           "Bias"               -0.66949    0.67671 
{'res2a_branch2a_Weights'}    {'res2a_branch2a'}           "Weights"            -0.42074    0.34251 
{'res2a_branch2a_Bias'   }    {'res2a_branch2a'}           "Bias"                -0.8039     1.2488 
{'res2a_branch2b_Weights'}    {'res2a_branch2b'}           "Weights"            -0.78524    0.59222 
{'res2a_branch2b_Bias'   }    {'res2a_branch2b'}           "Bias"                -1.3835     1.7661 
{'res2b_branch2a_Weights'}    {'res2b_branch2a'}           "Weights"             -0.3174    0.33645 
{'res2b_branch2a_Bias'   }    {'res2b_branch2a'}           "Bias"                -1.1203     1.5238 
{'res2b_branch2b_Weights'}    {'res2b_branch2b'}           "Weights"             -1.1915    0.93059 
{'res2b_branch2b_Bias'   }    {'res2b_branch2b'}           "Bias"               -0.81928     1.2022 
{'res3a_branch2a_Weights'}    {'res3a_branch2a'}           "Weights"            -0.19735    0.22659 
{'res3a_branch2a_Bias'   }    {'res3a_branch2a'}           "Bias"               -0.53009    0.69532 
{'res3a_branch2b_Weights'}    {'res3a_branch2b'}           "Weights"            -0.53557    0.72768 
{'res3a_branch2b_Bias'   }    {'res3a_branch2b'}           "Bias"               -0.67756     1.1733 
{'res3a_branch1_Weights' }    {'res3a_branch1' }           "Weights"            -0.63395    0.97791 
{'res3a_branch1_Bias'    }    {'res3a_branch1' }           "Bias"               -0.95277    0.75618 
  ⋮

**Define FPGA Board Interface

Define the target FPGA board programming interface by using the dlhdl.Target object. Create a programming interface with custom name for your target device and an Ethernet interface to connect the target device to the host computer.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Prepare Network for Deployment

Prepare the network for deployment by creating a dlhdl.Workflow object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® UltraScale+™ MPSoC ZCU102 board and the bitstream uses the int8 data type.

hW = dlhdl.Workflow(Network=dlquantObj,Bitstream='zcu102_int8',Target=hTarget);

Compile Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment.

dn = compile(hW,'InputFrameNumberLimit',15)

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream zcu102_int8.

Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'

The network includes the following layers:

 1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
 2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
 3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
 4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
 5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
 7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
 9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
50   'prob'                  Softmax                      softmax                                                               (SW Layer)
51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                              

Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.

Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.

Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.

Compiling layer group: conv1>>pool1 ...

Compiling layer group: conv1>>pool1 ... complete.

Compiling layer group: res2a_branch2a>>res2a_branch2b ...

Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.

Compiling layer group: res2b_branch2a>>res2b_branch2b ...

Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.

Compiling layer group: res3a_branch1 ...

Compiling layer group: res3a_branch1 ... complete.

Compiling layer group: res3a_branch2a>>res3a_branch2b ...

Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.

Compiling layer group: res3b_branch2a>>res3b_branch2b ...

Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.

Compiling layer group: res4a_branch1 ...

Compiling layer group: res4a_branch1 ... complete.

Compiling layer group: res4a_branch2a>>res4a_branch2b ...

Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.

Compiling layer group: res4b_branch2a>>res4b_branch2b ...

Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.

Compiling layer group: res5a_branch1 ...

Compiling layer group: res5a_branch1 ... complete.

Compiling layer group: res5a_branch2a>>res5a_branch2b ...

Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.

Compiling layer group: res5b_branch2a>>res5b_branch2b ...

Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.

Compiling layer group: pool5 ...

Compiling layer group: pool5 ... complete.

Compiling layer group: new_fc ...

Compiling layer group: new_fc ... complete.

Allocating external memory buffers:

      offset_name          offset_address    allocated_space 
_______________________    ______________    ________________

"InputDataOffset"           "0x00000000"     "8.0 MB"        
"OutputResultOffset"        "0x00800000"     "4.0 MB"        
"SchedulerDataOffset"       "0x00c00000"     "4.0 MB"        
"SystemBufferOffset"        "0x01000000"     "28.0 MB"       
"InstructionDataOffset"     "0x02c00000"     "4.0 MB"        
"ConvWeightDataOffset"      "0x03000000"     "16.0 MB"       
"FCWeightDataOffset"        "0x04000000"     "4.0 MB"        
"EndOffset"                 "0x04400000"     "Total: 68.0 MB"

Network compilation complete.

dn = struct with fields: weights: [1×1 struct] instructions: [1×1 struct] registers: [1×1 struct] syncInstructions: [1×1 struct] constantData: {} ddrInfo: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

Programming FPGA Bitstream using Ethernet...

Attempting to connect to the hardware board at 192.168.1.101...

Connection successful

Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...

Copying FPGA programming files to SD card...

Setting FPGA bitstream and devicetree for boot...

Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd

Set Bitstream to hdlcoder_rd/zcu102_int8.bit

Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd

Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb

Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'

Rebooting Xilinx SoC at 192.168.1.101...

Reboot may take several seconds...

Attempting to connect to the hardware board at 192.168.1.101...

Connection successful

Programming the FPGA bitstream has been completed successfully.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 21-Dec-2022 10:45:19

Loading weights to FC Processor.

FC Weights loaded. Current time is 21-Dec-2022 10:45:19

Test Network

Load the example image.

imgFile = fullfile(pwd,'MathWorks_cube_0.jpg'); inputImg = imresize(imread(imgFile),[224 224]); imshow(inputImg)

Classify the image on the FPGA by using the predict method of the dlhdl.Workflow object and display the results.

[prediction,speed] = predict(hW,single(inputImg),'Profile','on');

Finished writing input activations.

Running single input activation.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 7392114 0.02957 1 7394677 33.8 conv1 1115165 0.00446 pool1 199164 0.00080 res2a_branch2a 270125 0.00108 res2a_branch2b 269946 0.00108 res2a 102255 0.00041 res2b_branch2a 269792 0.00108 res2b_branch2b 269902 0.00108 res2b 102695 0.00041 res3a_branch1 155120 0.00062 res3a_branch2a 156480 0.00063 res3a_branch2b 244913 0.00098 res3a 51456 0.00021 res3b_branch2a 245366 0.00098 res3b_branch2b 245123 0.00098 res3b 51286 0.00021 res4a_branch1 135535 0.00054 res4a_branch2a 136117 0.00054 res4a_branch2b 238454 0.00095 res4a 25602 0.00010 res4b_branch2a 237909 0.00095 res4b_branch2b 238282 0.00095 res4b 26742 0.00011 res5a_branch1 324642 0.00130 res5a_branch2a 325897 0.00130 res5a_branch2b 623521 0.00249 res5a 13881 0.00006 res5b_branch2a 624028 0.00250 res5b_branch2b 624631 0.00250 res5b 13051 0.00005 pool5 37083 0.00015 new_fc 17764 0.00007

[val,idx] = max(prediction); dlquantObj.NetworkObject.Layers(end).ClassNames{idx}

Performance Comparison

Compare the performance of the quantized network to the performance of the single data type network.

optionsFPGA = dlquantizationOptions('Bitstream','zcu102_int8','Target',hTarget); predictionFPGA = validate(dlquantObj,imdsValidation,optionsFPGA)

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream zcu102_int8.

Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'

The network includes the following layers:

 1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
 2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
 3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
 4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
 5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
 7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
 9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
50   'prob'                  Softmax                      softmax                                                               (SW Layer)
51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                              

Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.

Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.

Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.

Compiling layer group: conv1>>pool1 ...

Compiling layer group: conv1>>pool1 ... complete.

Compiling layer group: res2a_branch2a>>res2a_branch2b ...

Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.

Compiling layer group: res2b_branch2a>>res2b_branch2b ...

Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.

Compiling layer group: res3a_branch1 ...

Compiling layer group: res3a_branch1 ... complete.

Compiling layer group: res3a_branch2a>>res3a_branch2b ...

Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.

Compiling layer group: res3b_branch2a>>res3b_branch2b ...

Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.

Compiling layer group: res4a_branch1 ...

Compiling layer group: res4a_branch1 ... complete.

Compiling layer group: res4a_branch2a>>res4a_branch2b ...

Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.

Compiling layer group: res4b_branch2a>>res4b_branch2b ...

Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.

Compiling layer group: res5a_branch1 ...

Compiling layer group: res5a_branch1 ... complete.

Compiling layer group: res5a_branch2a>>res5a_branch2b ...

Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.

Compiling layer group: res5b_branch2a>>res5b_branch2b ...

Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.

Compiling layer group: pool5 ...

Compiling layer group: pool5 ... complete.

Compiling layer group: new_fc ...

Compiling layer group: new_fc ... complete.

Allocating external memory buffers:

      offset_name          offset_address    allocated_space 
_______________________    ______________    ________________

"InputDataOffset"           "0x00000000"     "12.0 MB"       
"OutputResultOffset"        "0x00c00000"     "4.0 MB"        
"SchedulerDataOffset"       "0x01000000"     "4.0 MB"        
"SystemBufferOffset"        "0x01400000"     "28.0 MB"       
"InstructionDataOffset"     "0x03000000"     "4.0 MB"        
"ConvWeightDataOffset"      "0x03400000"     "16.0 MB"       
"FCWeightDataOffset"        "0x04400000"     "4.0 MB"        
"EndOffset"                 "0x04800000"     "Total: 72.0 MB"

Network compilation complete.

FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 21-Dec-2022 10:46:36

Loading weights to FC Processor.

FC Weights loaded. Current time is 21-Dec-2022 10:46:36

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

Finished writing input activations.

Running single input activation.

          Deep Learning Processor Bitstream Build Info

Resource Utilized Total Percentage


LUTs (CLB/ALM)* 249703 274080 91.11 DSPs 391 2520 15.52 Block RAM 583 912 63.93

Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'

Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.

The network includes the following layers:

 1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
 2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
 3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
 4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
 5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
 7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
 8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
 9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
50   'prob'                  Softmax                      softmax                                                               (SW Layer)
51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                              

Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.

Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.

          Deep Learning Processor Estimator Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 23502752 0.10683 1 23502752 9.4 data_norm_add 210750 0.00096 data_norm 210750 0.00096 conv1 2164124 0.00984 pool1 515064 0.00234 res2a_branch2a 966221 0.00439 res2a_branch2b 966221 0.00439 res2a 210750 0.00096 res2b_branch2a 966221 0.00439 res2b_branch2b 966221 0.00439 res2b 210750 0.00096 res3a_branch1 540861 0.00246 res3a_branch2a 540749 0.00246 res3a_branch2b 919117 0.00418 res3a 105404 0.00048 res3b_branch2a 919117 0.00418 res3b_branch2b 919117 0.00418 res3b 105404 0.00048 res4a_branch1 503405 0.00229 res4a_branch2a 509261 0.00231 res4a_branch2b 905421 0.00412 res4a 52724 0.00024 res4b_branch2a 905421 0.00412 res4b_branch2b 905421 0.00412 res4b 52724 0.00024 res5a_branch1 1039437 0.00472 res5a_branch2a 1046605 0.00476 res5a_branch2b 2005197 0.00911 res5a 26368 0.00012 res5b_branch2a 2005197 0.00911 res5b_branch2b 2005197 0.00911 res5b 26368 0.00012 pool5 54594 0.00025 new_fc 22571 0.00010

Resource Utilized Total Percentage


LUTs (CLB/ALM)* 168099 274080 61.33 DSPs 807 2520 32.02 Block RAM 453 912 49.67

Finished writing input activations.

Running single input activation.

predictionFPGA = struct with fields: NumSamples: 20 MetricResults: [1×1 struct] Statistics: [2×7 table]

View the frames per second performance for the quantized network and single-data-type network. The quantized network has a performance of 33.8 frames per second compared to 9.2 frames per second for the single-data-type network. You can use quantization to improve your frames per second performance, however you could lose accuracy when you quantize your networks.

predictionFPGA.Statistics.FramesPerSecond

Version History

Introduced in R2020b