Defect Detection - MATLAB & Simulink (original) (raw)

This example shows how to deploy a custom trained series network to detect defects in objects such as hexagon nuts. The custom networks were trained by using transfer learning. Transfer learning is commonly used in deep learning applications. You can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch. You can quickly transfer learned features to a new task using a smaller number of training signals. This example uses two trained series networks, trainedDefNet.mat and trainedBlemDetNet.mat.

Prerequisites

Load Pretrained Networks

Load the custom pretrained series network trainedDefNet.

if ~isfile('trainedDefNet.mat') url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedDefNet.mat'; websave('trainedDefNet.mat',url); end net1 = load('trainedDefNet.mat'); snet_defnet = net1.custom_alexnet

snet_defnet = SeriesNetwork with properties:

     Layers: [25×1 nnet.cnn.layer.Layer]
 InputNames: {'data'}
OutputNames: {'output'}

Analyze the network. analyzeNetwork displays an interactive plot of the network architecture and a table containing information about the network layers.

analyzeNetwork(snet_defnet)  

Load the network snet_blemdetnet.

if ~isfile('trainedBlemDetNet.mat') url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedBlemDetNet.mat'; websave('trainedBlemDetNet.mat',url); end net2 = load('trainedBlemDetNet.mat'); snet_blemdetnet = net2.convnet

snet_blemdetnet = SeriesNetwork with properties:

     Layers: [12×1 nnet.cnn.layer.Layer]
 InputNames: {'imageinput'}
OutputNames: {'classoutput'}

Analyze the network. analyzeNetwork displays an interactive plot of the network architecture and a table containing information about the network layers.

analyzeNetwork(snet_blemdetnet)

Create Target Object

Create a target object that has a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use the JTAG connection, install the Xilinx Vivado® Design Suite 2020.2.

Set the Xilinx Vivado toolpath.

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');

hT = dlhdl.Target('Xilinx','Interface','Ethernet')

hT = Target with properties:

   Vendor: 'Xilinx'
Interface: Ethernet
IPAddress: '192.168.1.101'
 Username: 'root'
     Port: 22

Generate Bitstream to Run Network

The defect detection network consists of multiple Cross Channel Normalization layers. To support this layer on hardware, the 'LRNBlockGeneration' property of the conv module needs to be turned on in the bitstream used for FPGA inference. The shipping zcu102_single bitstream does not have this property turned on. A new bitstream can be generated using the following lines of code. The generated bitstream can be used along with a dlhdl.Workflow object for inference.

When creating a dlhdl.ProcessorConfig object for an existing shipping bitstream, make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102 SoC board and the date type is single. Update the processor configuration with 'LRNBlockGeneration' turned on and 'SegmentationBlockGeneration' turned off. Turn the latter off to fit the Deep Learning IP on the FPGA and avoid overutilization of resources.

hPC = dlhdl.ProcessorConfig('Bitstream', 'zcu102_single'); hPC.setModuleProperty('conv', 'LRNBlockGeneration', 'on'); hPC.setModuleProperty('conv', 'SegmentationBlockGeneration', 'off'); dlhdl.buildProcessor(hPC)

To learn how to use the generated bitstream file, see Generate Custom Bitstream.

Create Workflow Object for trainedDefNet Network

Create an object of the dlhdl.Workflow class. When you create the class, specify the network and the bitstream name. Make sure to use the generated bitstream which enables processing of Cross Channel Normalization layers on the FPGA. Specify the saved pretrained neural network, snet_defnet, as the network.

hW = dlhdl.Workflow('Network',snet_defnet,'Bitstream','dlprocessor.bit','Target',hT);

Compile trainedDefNet Series Network

Run the compile function of the dlhdl.Workflow object.

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream zcu102_single ...

The network includes the following layers:

 1   'data'     Image Input                   128×128×1 images with 'zerocenter' normalization                                  (SW Layer)
 2   'conv1'    Convolution                   96 11×11×1 convolutions with stride [4  4] and padding [0  0  0  0]               (HW Layer)
 3   'relu1'    ReLU                          ReLU                                                                              (HW Layer)
 4   'norm1'    Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
 5   'pool1'    Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
 6   'conv2'    Grouped Convolution           2 groups of 128 5×5×48 convolutions with stride [1  1] and padding [2  2  2  2]   (HW Layer)
 7   'relu2'    ReLU                          ReLU                                                                              (HW Layer)
 8   'norm2'    Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
 9   'pool2'    Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
10   'conv3'    Convolution                   384 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]              (HW Layer)
11   'relu3'    ReLU                          ReLU                                                                              (HW Layer)
12   'conv4'    Grouped Convolution           2 groups of 192 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
13   'relu4'    ReLU                          ReLU                                                                              (HW Layer)
14   'conv5'    Grouped Convolution           2 groups of 128 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
15   'relu5'    ReLU                          ReLU                                                                              (HW Layer)
16   'pool5'    Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
17   'fc6'      Fully Connected               4096 fully connected layer                                                        (HW Layer)
18   'relu6'    ReLU                          ReLU                                                                              (HW Layer)
19   'drop6'    Dropout                       50% dropout                                                                       (HW Layer)
20   'fc7'      Fully Connected               4096 fully connected layer                                                        (HW Layer)
21   'relu7'    ReLU                          ReLU                                                                              (HW Layer)
22   'drop7'    Dropout                       50% dropout                                                                       (HW Layer)
23   'fc8'      Fully Connected               2 fully connected layer                                                           (HW Layer)
24   'prob'     Softmax                       softmax                                                                           (SW Layer)
25   'output'   Classification Output         crossentropyex with classes 'ng' and 'ok'                                         (SW Layer)

3 Memory Regions created.

Skipping: data Compiling leg: conv1>>pool5 ... Compiling leg: conv1>>pool5 ... complete. Compiling leg: fc6>>fc8 ... Compiling leg: fc6>>fc8 ... complete. Skipping: prob Skipping: output Creating Schedule... ....... Creating Schedule...complete. Creating Status Table... ...... Creating Status Table...complete. Emitting Schedule... ...... Emitting Schedule...complete. Emitting Status Table... ........ Emitting Status Table...complete.

Allocating external memory buffers:

      offset_name          offset_address     allocated_space 
_______________________    ______________    _________________

"InputDataOffset"           "0x00000000"     "8.0 MB"         
"OutputResultOffset"        "0x00800000"     "4.0 MB"         
"SchedulerDataOffset"       "0x00c00000"     "4.0 MB"         
"SystemBufferOffset"        "0x01000000"     "28.0 MB"        
"InstructionDataOffset"     "0x02c00000"     "4.0 MB"         
"ConvWeightDataOffset"      "0x03000000"     "12.0 MB"        
"FCWeightDataOffset"        "0x03c00000"     "84.0 MB"        
"EndOffset"                 "0x09000000"     "Total: 144.0 MB"

Network compilation complete.

ans = struct with fields: weights: [1×1 struct] instructions: [1×1 struct] registers: [1×1 struct] syncInstructions: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device and displays progress messages and the time it takes to deploy the network.

Programming FPGA Bitstream using Ethernet...

Downloading target FPGA device configuration over Ethernet to SD card ...

Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd

Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd

Set Bitstream to hdlcoder_rd/hdlcoder_system.bit

Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd

Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb

Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'

Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect.

System is rebooting . . . . . .

Programming the FPGA bitstream has been completed successfully.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 16-Dec-2020 16:16:31

Loading weights to FC Processor.

20% finished, current time is 16-Dec-2020 16:16:32.

40% finished, current time is 16-Dec-2020 16:16:32.

60% finished, current time is 16-Dec-2020 16:16:33.

80% finished, current time is 16-Dec-2020 16:16:34.

FC Weights loaded. Current time is 16-Dec-2020 16:16:34

Run Prediction for One Image

Load an image from the attached testImages folder and resize the image to match the network image input layer dimensions. Run the predict function of the dlhdl.Workflow object to retrieve and display the defect prediction from the FPGA.

wi = uint32(320); he = uint32(240); ch = uint32(3); filename = fullfile(pwd,'ng1.png'); img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img);

% Extract ROI for preprocessing
[Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img);

% Row-major to column-major conversion
imgPacked2 = zeros([128,128,4],'uint8');
for c = 1:4
    for i = 1:128
        for j = 1:128
            imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1);
        end
    end
end

% Classify detected nuts by using CNN
scores = zeros(2,4);
for i = 1:num
     [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on');
end

Finished writing input activations.

Running single input activations.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 12231156 0.05560 1 12231156 18.0 conv1 414021 0.00188 norm1 172325 0.00078 pool1 56747 0.00026 conv2 654112 0.00297 norm2 119403 0.00054 pool2 43611 0.00020 conv3 777446 0.00353 conv4 595551 0.00271 conv5 404425 0.00184 pool5 17831 0.00008 fc6 1759699 0.00800 fc7 7030188 0.03196 fc8 185672 0.00084

Create Workflow Object for trainedBlemDetNet Network

Create an object of the dlhdl.Workflow class. When you create the class, specify the network and the bitstream name. Make sure to use the generated bitstream which enables processing of Cross Channel Normalization layers on the FPGA. Specify the saved pretrained neural network, trainedblemDetNet, as the network.

hW = dlhdl.Workflow('Network',snet_blemdetnet,'Bitstream','dlprocessor.bit','Target',hT)

Compile trainedBlemDetNet Series Network

Run the compile function of the dlhdl.Workflow object.

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream zcu102_single ...

The network includes the following layers:

 1   'imageinput'    Image Input                   128×128×1 images with 'zerocenter' normalization                    (SW Layer)
 2   'conv_1'        Convolution                   20 5×5×1 convolutions with stride [1  1] and padding [0  0  0  0]   (HW Layer)
 3   'relu_1'        ReLU                          ReLU                                                                (HW Layer)
 4   'maxpool_1'     Max Pooling                   2×2 max pooling with stride [2  2] and padding [0  0  0  0]         (HW Layer)
 5   'crossnorm'     Cross Channel Normalization   cross channel normalization with 5 channels per element             (HW Layer)
 6   'conv_2'        Convolution                   20 5×5×20 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
 7   'relu_2'        ReLU                          ReLU                                                                (HW Layer)
 8   'maxpool_2'     Max Pooling                   2×2 max pooling with stride [2  2] and padding [0  0  0  0]         (HW Layer)
 9   'fc_1'          Fully Connected               512 fully connected layer                                           (HW Layer)
10   'fc_2'          Fully Connected               2 fully connected layer                                             (HW Layer)
11   'softmax'       Softmax                       softmax                                                             (SW Layer)
12   'classoutput'   Classification Output         crossentropyex with classes 'ng' and 'ok'                           (SW Layer)

3 Memory Regions created.

Skipping: imageinput Compiling leg: conv_1>>maxpool_2 ... Compiling leg: conv_1>>maxpool_2 ... complete. Compiling leg: fc_1>>fc_2 ... Compiling leg: fc_1>>fc_2 ... complete. Skipping: softmax Skipping: classoutput Creating Schedule... ....... Creating Schedule...complete. Creating Status Table... ...... Creating Status Table...complete. Emitting Schedule... ...... Emitting Schedule...complete. Emitting Status Table... ........ Emitting Status Table...complete.

Allocating external memory buffers:

      offset_name          offset_address    allocated_space 
_______________________    ______________    ________________

"InputDataOffset"           "0x00000000"     "8.0 MB"        
"OutputResultOffset"        "0x00800000"     "4.0 MB"        
"SchedulerDataOffset"       "0x00c00000"     "4.0 MB"        
"SystemBufferOffset"        "0x01000000"     "28.0 MB"       
"InstructionDataOffset"     "0x02c00000"     "4.0 MB"        
"ConvWeightDataOffset"      "0x03000000"     "4.0 MB"        
"FCWeightDataOffset"        "0x03400000"     "36.0 MB"       
"EndOffset"                 "0x05800000"     "Total: 88.0 MB"

Network compilation complete.

ans = struct with fields: weights: [1×1 struct] instructions: [1×1 struct] registers: [1×1 struct] syncInstructions: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device and displays progress messages and the time it takes to deploy the network.

FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 16-Dec-2020 16:16:47

Loading weights to FC Processor.

50% finished, current time is 16-Dec-2020 16:16:48.

FC Weights loaded. Current time is 16-Dec-2020 16:16:48

Run Prediction for One Image

Load an image from the attached testImages folder and resize the image to match the network image input layer dimensions. Run the predict function of the dlhdl.Workflow object to retrieve and display the defect prediction from the FPGA.

wi = uint32(320); he = uint32(240); ch = uint32(3);

filename = fullfile(pwd,'ok1.png'); img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img);

% Extract ROI for preprocessing
[Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img);

% Row-major to column-major conversion
imgPacked2 = zeros([128,128,4],'uint8');
for c = 1:4
    for i = 1:128
        for j = 1:128
            imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1);
        end
    end
end

% classify detected nuts by using CNN
scores = zeros(2,4);
for i = 1:num
     [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on');
end

Finished writing input activations.

Running single input activations.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 4892622 0.02224 1 4892622 45.0 conv_1 467921 0.00213 maxpool_1 188086 0.00085 crossnorm 159500 0.00072 conv_2 397561 0.00181 maxpool_2 41455 0.00019 fc_1 3614625 0.01643 fc_2 23355 0.00011

Quantize and Deploy trainedBlemDetNet Network

The trainedBlemDetNet network improves performance to 45 frames per second. The target performance of the deployed network is 100 frames per second while staying within the target resource utilization budget. The resource utilization budget takes into consideration parameters such as memory size and onboard IO. While you can increase the resource utilization budget by choosing a larger board, doing so increases the cost. Instead, improve the deployed network performance and stay within the resource utilization budget by quantizing the network. Quantize and deploy the trainedBlemDetNet network.

Load the data set as an image datastore. The imageDatastore labels the images based on folder names and stores the data. Divide the data into calibration and validation data sets. Use 50% of the images for calibration and 50% of the images for validation. Expedite the calibration and validation process by using a subset of the calibration and validation image sets.

if ~isfile('dataSet.zip') url = 'https://www.mathworks.com/supportfiles/dlhdl/dataSet.zip'; websave('dataSet.zip',url); end unzip('dataSet.zip') imageData = imageDatastore(fullfile('dataSet'),... 'IncludeSubfolders',true,'FileExtensions','.PNG','LabelSource','foldernames'); [calibrationData, validationData] = splitEachLabel(imageData, 0.5,'randomized'); calibrationData_reduced = calibrationData.subset(1:20); validationData_reduced = validationData.subset(1:1);

Create a quantized network by using the dlquantizer object. Set the target execution environment to FPGA.

dlQuantObj = dlquantizer(snet_blemdetnet,'ExecutionEnvironment','FPGA')

dlQuantObj = dlquantizer with properties:

       NetworkObject: [1×1 SeriesNetwork]
ExecutionEnvironment: 'FPGA'

Use the calibrate function to exercise the network by using sample inputs and collect the range information. The calibrate function exercises the network and collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network and the dynamic ranges of the activations in all layers of the network. The calibrate function returns a table. Each row of the table contains range information for a learnable parameter of the quantized network.

dlQuantObj.calibrate(calibrationData_reduced)

ans=21×5 table Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue ____________________________ __________________ ________________________ __________ _________

{'conv_1_Weights'          }      {'conv_1'    }           "Weights"                -0.29022      0.21403
{'conv_1_Bias'             }      {'conv_1'    }           "Bias"                  -0.021907    0.0053595
{'conv_2_Weights'          }      {'conv_2'    }           "Weights"                -0.10499      0.13732
{'conv_2_Bias'             }      {'conv_2'    }           "Bias"                  -0.010084     0.025773
{'fc_1_Weights'            }      {'fc_1'      }           "Weights"               -0.051599     0.054506
{'fc_1_Bias'               }      {'fc_1'      }           "Bias"                 -0.0048897    0.0072463
{'fc_2_Weights'            }      {'fc_2'      }           "Weights"               -0.071356     0.064882
{'fc_2_Bias'               }      {'fc_2'      }           "Bias"                  -0.062086     0.062084
{'imageinput'              }      {'imageinput'}           "Activations"                   0          255
{'imageinput_normalization'}      {'imageinput'}           "Activations"             -184.37       241.75
{'conv_1'                  }      {'conv_1'    }           "Activations"             -112.18       150.51
{'relu_1'                  }      {'relu_1'    }           "Activations"                   0       150.51
{'maxpool_1'               }      {'maxpool_1' }           "Activations"                   0       150.51
{'crossnorm'               }      {'crossnorm' }           "Activations"                   0       113.27
{'conv_2'                  }      {'conv_2'    }           "Activations"             -117.79       67.125
{'relu_2'                  }      {'relu_2'    }           "Activations"                   0       67.125
  ⋮

The trainedBlemDetNet network consists of a Cross Channel Normalization layer. To support this layer on hardware, the 'LRNBlockGeneration' property of the conv module needs to be turned on in the bitstream used for FPGA inference. The shipping zcu102_int8 bitstream does not have this property turned on. A new bitstream can be generated using the following lines of code. The generated bitstream can be used along with a dlhdl.Workflow object for inference.

When creating a dlhdl.ProcessorConfig object for an existing shipping bitstream, make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102 SoC board and the date type is int8. Update the processor configuration with 'LRNBlockGeneration' turned on and 'SegmentationBlockGeneration' turned off. Turn the latter off to fit the Deep Learning IP on the FPGA and avoid overutilization of resources.

% hPC = dlhdl.ProcessorConfig('Bitstream', 'zcu102_int8'); % hPC.setModuleProperty('conv', 'LRNBlockGeneration', 'on'); % hPC.setModuleProperty('conv', 'SegmentationBlockGeneration', 'off'); % dlhdl.buildProcessor(hPC)

To learn how to use the generated bitstream file, see Generate Custom Bitstream.

Create an object of the dlhdl.Workflow class. When you create the class, specify the network and the bitstream name. Make sure to use this newly generated bitstream which enables processing of Cross Channel Normalization layers on the FPGA. Specify the saved pretrained quantized trainedblemDetNet object dlQuantObj as the network.

hW = dlhdl.Workflow('Network', dlQuantObj, 'Bitstream', 'dlprocessor.bit','Target',hT);

To compile the quantized network, run the compile function of the dlhdl.Workflow object.

hW.compile('InputFrameNumberLimit',30)

Compiling network for Deep Learning FPGA prototyping ...

Targeting FPGA bitstream zcu102_int8 ...

The network includes the following layers:

 1   'imageinput'    Image Input                   128×128×1 images with 'zerocenter' normalization                    (SW Layer)
 2   'conv_1'        Convolution                   20 5×5×1 convolutions with stride [1  1] and padding [0  0  0  0]   (HW Layer)
 3   'relu_1'        ReLU                          ReLU                                                                (HW Layer)
 4   'maxpool_1'     Max Pooling                   2×2 max pooling with stride [2  2] and padding [0  0  0  0]         (HW Layer)
 5   'crossnorm'     Cross Channel Normalization   cross channel normalization with 5 channels per element             (HW Layer)
 6   'conv_2'        Convolution                   20 5×5×20 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
 7   'relu_2'        ReLU                          ReLU                                                                (HW Layer)
 8   'maxpool_2'     Max Pooling                   2×2 max pooling with stride [2  2] and padding [0  0  0  0]         (HW Layer)
 9   'fc_1'          Fully Connected               512 fully connected layer                                           (HW Layer)
10   'fc_2'          Fully Connected               2 fully connected layer                                             (HW Layer)
11   'softmax'       Softmax                       softmax                                                             (SW Layer)
12   'classoutput'   Classification Output         crossentropyex with classes 'ng' and 'ok'                           (SW Layer)

3 Memory Regions created.

Skipping: imageinput Compiling leg: conv_1>>maxpool_2 ... Compiling leg: conv_1>>maxpool_2 ... complete. Compiling leg: fc_1>>fc_2 ... Compiling leg: fc_1>>fc_2 ... complete. Skipping: softmax Skipping: classoutput Creating Schedule... ......... Creating Schedule...complete. Creating Status Table... ........ Creating Status Table...complete. Emitting Schedule... ...... Emitting Schedule...complete. Emitting Status Table... .......... Emitting Status Table...complete.

Allocating external memory buffers:

      offset_name          offset_address    allocated_space 
_______________________    ______________    ________________

"InputDataOffset"           "0x00000000"     "16.0 MB"       
"OutputResultOffset"        "0x01000000"     "4.0 MB"        
"SchedulerDataOffset"       "0x01400000"     "4.0 MB"        
"SystemBufferOffset"        "0x01800000"     "28.0 MB"       
"InstructionDataOffset"     "0x03400000"     "4.0 MB"        
"ConvWeightDataOffset"      "0x03800000"     "4.0 MB"        
"FCWeightDataOffset"        "0x03c00000"     "12.0 MB"       
"EndOffset"                 "0x04800000"     "Total: 72.0 MB"

Network compilation complete.

ans = struct with fields: weights: [1×1 struct] instructions: [1×1 struct] registers: [1×1 struct] syncInstructions: [1×1 struct]

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device and displays progress messages and the time it takes to deploy the network.

Programming FPGA Bitstream using Ethernet...

Downloading target FPGA device configuration over Ethernet to SD card ...

Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd

Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd

Set Bitstream to hdlcoder_rd/hdlcoder_system.bit

Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd

Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb

Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'

Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect.

System is rebooting .

. . . . .

Programming the FPGA bitstream has been completed successfully.

Loading weights to Conv Processor.

Conv Weights loaded. Current time is 16-Dec-2020 16🔞03

Loading weights to FC Processor.

FC Weights loaded. Current time is 16-Dec-2020 16🔞03

Load an image from the attached testImages folder and resize the image to match the network image input layer dimensions. Run the predict function of the dlhdl.Workflow object to retrieve and display the defect prediction from the FPGA.

wi = uint32(320); he = uint32(240); ch = uint32(3);

filename = fullfile(pwd,'ok1.png'); img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img);

% Extract ROI for preprocessing
[Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img);

% row-major > column-major conversion
imgPacked2 = zeros([128,128,4],'uint8');
for c = 1:4
    for i = 1:128
        for j = 1:128
            imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1);
        end
    end
end

% classify detected nuts by using CNN
scores = zeros(2,4);
for i = 1:num
     [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on');
end

Finished writing input activations.

Running single input activations.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 1754969 0.00798 1 1754969 125.4 conv_1 271340 0.00123 maxpool_1 87533 0.00040 crossnorm 125737 0.00057 conv_2 149972 0.00068 maxpool_2 19657 0.00009 fc_1 1085683 0.00493 fc_2 14928 0.00007

To test that the quantized network can identify all test cases deploy an additional image, resize the image to match the network image input layer dimensions, and run the predict function of the dlhdl.Workflow object to retrieve and display the defect prediction from the FPGA.

wi = uint32(320); he = uint32(240); ch = uint32(3);

filename = fullfile(pwd,'okng.png'); img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img);

% Extract ROI for preprocessing
[Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img);

% row-major > column-major conversion
imgPacked2 = zeros([128,128,4],'uint8');
for c = 1:4
    for i = 1:128
        for j = 1:128
            imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1);
        end
    end
end

% classify detected nuts by using CNN
scores = zeros(2,4);
for i = 1:num
     [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on');
end

Finished writing input activations.

Running single input activations.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 1754614 0.00798 1 1754614 125.4 conv_1 271184 0.00123 maxpool_1 87557 0.00040 crossnorm 125768 0.00057 conv_2 149819 0.00068 maxpool_2 19602 0.00009 fc_1 1085664 0.00493 fc_2 14930 0.00007

Finished writing input activations.

Running single input activations.

          Deep Learning Processor Profiler Performance Results

               LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                     -------------             -------------              ---------        ---------       ---------

Network 1754486 0.00797 1 1754486 125.4 conv_1 271014 0.00123 maxpool_1 87662 0.00040 crossnorm 125835 0.00057 conv_2 149789 0.00068 maxpool_2 19661 0.00009 fc_1 1085505 0.00493 fc_2 14930 0.00007

Quantizing the network improves the performance from 45 frames per second to 125 frames per second and reduces the deployed network size from 88 MB to 72 MB.

See Also

dlhdl.Workflow | dlhdl.Target | compile | deploy | predict | Deep Network Designer

Topics