Slide 1

Slide 1 text

Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA Takuto Ando1 Yusuke Inoue2 1 Electrical, Electronics Information Engineering Major, National Institute of Technology, Oita College Advanced Course 2 Department of Information Engineering, National Institute of Technology, Oita College 2024 11/28 9th International Workshop on GPU Computing and AI (GCA’24)

Slide 2

Slide 2 text

1/37 Outline

Slide 3

Slide 3 text

2/37 Outline

Slide 4

Slide 4 text

3/37 Facial expression and robot “Happy” Facial expression : Effective Non-Verbal Communication ・Human-to-Human : Conveys emotions and intentions ・Human-Computer : Enables intuitive interactions Applied to pet robots and medical robots

Slide 5

Slide 5 text

4/37 What is facial expression recognition? Robots need to understand emotions Ekman basic emotions classify them into six types [1] Anger Disgust Fear Happy Sad Surprise [1] P. Ekman and W. V. Friesen, ‘‘Constants across cultures in the face and emotion,’’ J. Pers. Soc. Psychol., vol. 17, no. 2, pp. 124–129, 1971. Ekman’s six basic emotions

Slide 6

Slide 6 text

5/37 Implementation on robot Installation on battery-powered robots - Low-power processing for longer operation - DNN recognition needs a high-performance unit (e.g., GPU) FPGA board Xilinx Kria KV260 FPGA Implementation : Balanced Solution low power consumption and high computing performance GPU provides high performance BUT high-power consumption

Slide 7

Slide 7 text

6/37 Previous work Vinh et al. implemented a facial expression recognition system using an SoC FPGA [2] FPGA プロセッサ “Happy” face image Camera Image (Input) inference result Face detection non-CNN Facial expression recognition CNN 1 2 [2] Pham The Vinh and Truong Quang Vinh. Facial expression recognition system on soc fpga. In 2019 International Symposium on Electrical and Electronics Engineering (ISEE), pp. 1–4. IEEE, 2019. DE-10 standards board CPU FPGA

Slide 8

Slide 8 text

7/37 In this work Previous work FPGA プロセッサ CPU FPGA ACC FD (non-DNN) Overall control FER (DNN) Our work FPGA プロセッサ CPU DPU Overall control FD (DNN) FER (DNN) FPGA ACC : accelerator DPU:Deep learning Processing Unit FD : Face Detection FER:Facial Expression Recognition

Slide 9

Slide 9 text

8/37 Objectives System Implementation and Evaluation - Offloaded two DNN inferences to FPGA - Face detection and facial expression recognition Running Two DNN Models on the Same DPU - Improve DPU utilization efficiency with multi-threading - Achieve high throughput and low power consumption

Slide 10

Slide 10 text

9/37 Outline

Slide 11

Slide 11 text

10/37 Two DNN models 02 Facial expression recognition : CNN 01 Face detection :DenseBox Input :640 x 460 x 3 Output :Coordinates of the face region (x, y, w, h) HAPPY Input :48 x 48 x 1 Output :Label of the expression class (7 categories)

Slide 12

Slide 12 text

11/37 Face detection model Dense Box [3]:Face detection model provided by Xilinx - Lightweight and simple network - The Wider Face dataset [4] was used for training [3] Vitis AI Library User Guide UG1354 (v3.5) June 29, 2023 [4] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5525–5533, 2016. Face detection using Dense box example [3]

Slide 13

Slide 13 text

12/37 Facial expression recognition model Guarniz's CNN architecture [5] [5] https://github.com/kckeiks/Facial-Expression-Recognition-2018/tree/master (Accessed on 2024,9/1) - Feature extraction with 4 repeated blocks (convolution, batch normalization, activation, pooling, dropout layers) - Trained using FER-2013 dataset

Slide 14

Slide 14 text

13/37 FER-2013 [6] 7 facial expression labels (Ekman's basic emotions + "Neutral"): [6] Carrier, P.L.; Courville, A.; Goodfellow, I.J.; Mirza, M.; Bengio, Y. FER-2013 Face Database; Universit de Montral: Montreal, QC, Canada, (2013). Anger Disgust Fear Happy Sadness Surprise Neutral Face images: 48x48 pixels Training set: Approx. 27,000 images Test set: Approx. 3,500 images

Slide 15

Slide 15 text

14/37 Hardware configuration 1 Camera Image (Input) FPGA CPU “Happy” ①FD&②FER DPU Output face coordinates Overall system control 2 Dense Box CNN model

Slide 16

Slide 16 text

15/37 What is “DPU” ? DPU - Implemented on FPGA as a CNN accelerator - Provided by Xilinx - Multiple DNN models executed in time division on the same DPU HAPPY 1 2 Deep learning Processing Unit

Slide 17

Slide 17 text

16/37 What is “DPU” ? DPU Architecture LUT Register Block RAM DSP B512 26922 34543 72 118 B1024 34074 48057 104 230 B2304 42127 68829 165 438 B4096 52161 98249 255 710 Processing performance FPGA resources for different DPU sizes Purposeful use is possible - Select an architecture from B512 to B4096 for each application - The higher the number, the higher the performance Lightweight circuits

Slide 18

Slide 18 text

17/37 Multi-threading strategy - The DPU has idle time until it is given instructions - This method does not utilize the DPU efficiently Utilization efficiency of the DPU in single-threading CPU DPU Input frame FD inference FER inference ・・・ ・・・ next frame processing … wait wait wait Processing FD step Processing FER step FD : Face Detection FER : Facial Expression Recognition The idle time is long

Slide 19

Slide 19 text

18/37 Multi-threading strategy - Increased frequency of tasks assigned to DPU - Reduced waiting time, enabling more efficient DPU operation Utilization efficiency of the DPU in multi-threading The idle time is short CPU Thread 1 DPU ・・・ ・・・ ・・・ 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 Frame number 2 CPU Thread 2 CPU Thread 3 Input only thread Main processing threads

Slide 20

Slide 20 text

19/37 Outline

Slide 21

Slide 21 text

20/37 Experimental objectives Assess the effectiveness of offloading DNN inference to the DPU Comparison with previous work ・Recognition performance evaluation for each DNN model - Recognition accuracy - Processing time ・System evaluation : overall system performance

Slide 22

Slide 22 text

21/37 Evaluation board CPU :ARM Cortex-A53 FPGA :Xilinx UltraScale+ DDR memory:4 GB Operating frequency :1.3 GHz Integrates a CPU and FPGA on the same chip Target board : Xilinx Kria KV260

Slide 23

Slide 23 text

22/37 Hardware architecture [7] https://docs.amd.com/r/en-US/pg338-dpu/Hardware Architecture Hardware Architecture Overview [7] - CPU and DPU communicate via AXI bus - DPU operation is controlled by fetching instructions from off-chip memory DPU CPU

Slide 24

Slide 24 text

23/37 Evaluate each inference [8] [8] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2879–2886, 2012.

Slide 25

Slide 25 text

24/37 Evaluate each inference [8] [8] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2879–2886, 2012.

Slide 26

Slide 26 text

25/37 Accuracy of face detection Comparison of the DNN model (this work) with the Haar Cascade detector (previous work) Comparison of accuracy with previous work 0.531 0.917 0 0.2 0.4 0.6 0.8 1 Previous work This work Average Precision (IoU=0.5) 0.386 ↑

Slide 27

Slide 27 text

26/37 Processing time face detection Comparison of average processing time with previous work This work Previous work Average processing time [ms] 42.1 798 0 200 400 600 800 1000 Approx. 9.8 times faster Comparison of the DNN model (this work) with the Haar Cascade detector (previous work)

Slide 28

Slide 28 text

27/37 Evaluate each inference

Slide 29

Slide 29 text

28/37 Facial expression recognition results 66 67.4 0 20 40 60 80 100 Previous Work This Work Accuracy [%] [3] L. Pham, T. H. Vu and T. A. Tran, "Facial Expression Recognition Using Residual Masking Network," 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 2021, pp. 4513-4519. Comparison of processing time with previous work Previous work This work 7.34 6.36 0 2 4 6 8 10 Average Processing time [ms] 1. Recognition accuracy [ % ] 2. Processing time / image [ ms ] SOTA [3] - Previous work is about 1 ms superior - Almost ignorable due to face detection time - Slightly higher than the previous work - Lower than SOTA but acceptable Comparison with CNN model from previous work Comparison of accuracy with previous work

Slide 30

Slide 30 text

29/37 Overall system comparison Evaluate throughput and power consumption ・ Throughput - Measure overall system throughput - Compare idle state (7.8W) with system runtime - Measure using KETOTEK KTEM02 connected to the board’s power socket ・ Power consumption (SoC) KETOTEK KTEM02 digital energy meter

Slide 31

Slide 31 text

30/37 Overall system comparison Previous work : 11.67 FPS Our work (single-threading): 14.69 FPS Our work (multi-threading) : 25.00 FPS Comparison of throughput per power consumption Evaluate throughput and power consumption Throughput Realize real-time performance 3.89 6.12 9.26 0 2 4 6 8 10 Previous work Our work (Single-threading) Our work (Multi-threading) Throught per Watt [FPS/W]

Slide 32

Slide 32 text

31/37 Compare circuit size Compare FPGA resource utilization with previous work This system is slightly larger Two DNN inferences could be performed without a significant increase in circuit size Comparison of FPGA resource utilization ALM or LUT DSP BRAM Previous work (Intel: ALM) 22,465 112 44 Our work (Xilinx: LUT) 27,023 118 12

Slide 33

Slide 33 text

32/37 Verify utilization efficiency of DPU Calculate DPU processing time as a percentage of system runtime DPU utilization improved by approximately 3.43x Compare DPU utilization by threading 0 20 40 60 80 100 Single-threading Multi-threading DPU utilization [%] DPU utilization comparison by threading Facial expression recogniton Face detection The percentage of DPU idle time

Slide 34

Slide 34 text

33/37 Outline

Slide 35

Slide 35 text

34/37 Analysis of optimum operating frequency The throughput improvement rate is low even when the operating frequency exceeds 400 MHz Investigate optimal DPU operating frequency Throughput per power consumption by frequency

Slide 36

Slide 36 text

35/37 Investigated optimal DPU size Comparison made between B512 and larger DPU at 400 MHz For larger DPU, throughput is not worth the circuit size FPGA resources and performance Comparison of FPGA resources and performance with different DPU sizes 9.26 FPS/W

Slide 37

Slide 37 text

36/37 Outline

Slide 38

Slide 38 text

37/37 Conclusion • We implemented facial expression recognition system on DPU • We utilized a systolic array accelerator for time-division inference of two DNNs on the same DPU • We proposed a multi-threaded system to improve throughput and DPU utilization efficiency • Future work : Reduce power consumption, optimize processing for real- world applications (e.g., face detection every few frames)

Slide 39

Slide 39 text

appendix

Slide 40

Slide 40 text

39/37 Processing Times for System

Slide 41

Slide 41 text

40/37 Privious work challengings Very lightweight Not robust to detect oblique or sideways faces Very sensitive to lighting Low accuracy with Haar Cascade detector running on CPU × 〇 × Poor facial expression recognition accuracy due to no proper face detection

Slide 42

Slide 42 text

41/37 Quantization & Compliation Quantization : Converts 32-bit floating type → 8-bit integer type Reduces the number of hardware operations Compilation: Converted to a format executable by DPU 32 bit Float model 8 bit Quantized model 8 bit Compiled model

Slide 43

Slide 43 text

42/37 Accuracy by Expression Classes Fear class Sad class Confusion matrix △ △ Otherwise, high accuracy

Slide 44

Slide 44 text

43/37 Compare with other sizes of the DPU Investigated throughput per power consumption B512 multi-threading execution achieves highest efficiency of 9.26 FPS/W Comparison made between B512 and larger DPU at 400 MHz Throughput per power consumption for hardware with each DPU

Slide 45

Slide 45 text

44/37 Evaluation by DPU size 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% B512 B1024 B2304 B4096 Resource utilization Use this work Comparison of circuit size embedding each DPU FPGA resource utilization :Kria KV260 LUT DSP BRAM