CN114663953B

CN114663953B - A facial expression recognition method based on facial key points and deep neural network

Info

Publication number: CN114663953B
Application number: CN202210318271.6A
Authority: CN
Inventors: 李春国; 吴宇凡; 刘周勇; 杨绿溪
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2024-11-12
Anticipated expiration: 2042-03-29
Also published as: CN114663953A

Abstract

The invention discloses a facial expression recognition method based on facial key points and a deep neural network, which can be used for realizing real-time detection of facial expression of a human face. The invention mainly comprises the following steps: acquiring a facial expression data set, extracting 68 facial key points through preprocessing, and dividing eye, nose and mouth areas to obtain multi-path image data and key point coordinate data respectively; training an expression classification network, extracting features of the image data and the coordinate data by adopting a convolution layer and a picture convolution layer respectively, and outputting classification results; and evaluating the recognition effect of the network on the facial expression classification data set. Compared with the current main facial expression classification algorithm, the facial expression classification method based on the face recognition has the advantages that on the premise of guaranteeing the recognition real-time performance, higher average classification accuracy is obtained, and the facial expression classification method based on the face recognition is a high-quality facial expression recognition algorithm.

Description

Facial expression recognition method based on facial key points and deep neural network

Technical Field

The invention relates to a facial expression recognition method based on facial key points and a deep neural network, which is applicable to the technical field of facial expression recognition in computer vision.

Background

Intelligent recognition of facial expressions is always an important scientific research direction and is also a hot spot direction. The facial expression analysis application scene is very wide, has obvious effect on improving the life quality of human beings, and has great research value. Specific fields of application include, but are not limited to: in a social public area, a video analysis technology is utilized to realize multi-target tracking and expression analysis, so that potential hazards in a public environment are timely identified, and public safety management and control are enhanced; the facial expression of the motor vehicle driver is identified and analyzed to judge whether the driver is in fatigue driving or drunk driving, so that the occurrence rate of traffic accidents is reduced; the micro-expression of the suspects is analyzed by the aid of the high-definition high-speed camera equipment, so that effective assistance is provided for police to transact a case; the system provides life assistance for disabled people, helps the disabled people understand the emotion states of other people, and facilitates communication.

With the development of technology, people increasingly attempt to realize automatic facial expression recognition by using machine vision and image processing technology. In recent years, deep learning algorithms have been widely used in various fields such as natural language processing, data mining, image processing, and the like, due to their strong learning ability and adaptation ability. Some convolutional neural network-based methods have also been introduced into the field of facial expression analysis, including AlexNet, resNet-18, leNet, etc. The method has the advantages that the parameters and the calculated amount of the face image processing are greatly reduced due to the local connection and the weight sharing mechanism of the convolutional neural network, and the accuracy of expression recognition is obviously improved compared with that of the traditional method. Although facial expression analysis methods based on deep learning have achieved good results, the current methods have problems including: accumulating convolution layers uniformly can cause that the model is difficult to finish extraction and analysis of the human face in extremely short time, and the real-time requirement in the actual use scene can not be met; the lack of preprocessing work on facial images usually cuts the whole image after extracting the position of the face and uses the position as the input of a deep neural network, and the network needs to analyze the whole image to obtain the factors affecting the expression by itself, so that the effective extraction of the facial key point information is neglected, and the accurate recognition of the micro expression is further limited.

Therefore, the development of the algorithm research based on the facial key point features has important significance.

Disclosure of Invention

The invention provides a facial expression recognition method based on facial key points and a deep neural network, which is named CGNet. CGNet in the invention adopts image characteristics and key point characteristic information at the same time, and comprehensively carries out accurate recognition on the expression. Meanwhile, the efficiency of feature extraction is improved by using a spatial attention mechanism and a channel attention mechanism. The trained CGNet network model has better recognition performance, and compared with the current algorithm FAN (Frame Attention Network) and the like, CGNet has higher recognition rate, which proves that the invention effectively improves the accuracy of face recognition expression.

In order to achieve the above object, the present invention provides the following technical solutions:

a facial expression recognition method based on facial key points and a deep neural network comprises the following steps:

Step S1: constructing a training set, a verification set and a test set according to the disclosed facial expression data set; extracting a plurality of key point characteristic information of the face of the person through preprocessing, and dividing eye, nose and mouth areas to obtain multi-path image data and key point coordinate data respectively;

Step S2: constructing a classification recognition network CGNet based on a convolution layer and a graph convolution layer; performing supervised training on the CGNet network model by using the constructed training set until CGNet converges to optimal performance; in the training process, a verification set is used for reflecting the training process;

Step S3: and testing the converged CGNet network model on the constructed testing set, and evaluating the network performance of CGNet according to the detection result.

Preferably, each piece of data in the facial expression data set in step S1 is given in the form of a data pair, and includes an RGB image to be classified as input data and a labeled expression class as a true value.

Preferably, the preprocessing in step S1 includes: firstly, extracting faces and N corresponding key point positions in an input RGB face image to be classified to obtain N multiplied by 2 coordinate information; then dividing eyes, noses and mouth regions of a face according to key point positions, changing the sizes of RGB face images to be classified and image pixels of three sub-images obtained after division into 224X 224 through bilinear interpolation, and carrying out data enhancement processing, wherein the enhancement means comprises: and (3) randomly rotating, randomly cutting, randomly modifying the saturation and the tone of the RGB image, and finally normalizing the RGB image.

Preferably, the classification recognition network CGNet includes:

a) RGB image feature extraction module: the method comprises the steps of inputting three sub-images of a face image and an eye region, a nose region and a mouth region, which are obtained after the face image is divided, firstly extracting facial features based on ResNet networks after the images are obtained, then layering and taking out feature information obtained by ResNet, strengthening the perception of a model on local information through a spatial attention mechanism, and finally outputting 4 paths of image features;

b) The key point position feature extraction module: the method is characterized in that the method inputs N multiplied by 2 coordinate information, characteristic extraction is firstly carried out based on a graph convolution layer after the coordinate information is obtained, a node is represented by graph convolution through calculating aggregation of adjacent nodes of the current node, so that some nodes with unobvious characteristics can be filtered, importance degree information of different key points is obtained, and finally 1-path position characteristics are output;

c) And a feature fusion module: the feature fusion module adopts a channel attention mechanism, calculates the weight of each channel feature by using the channel attention through inputting the 4-channel image features and the 1-channel position features, and performs weighted summation with the corresponding feature values, thereby generating the effect of fusing the image features and the position features.

Preferably, the graph roll stacking layer in the key point position feature extraction module specifically comprises: the graph convolution layer in the neural network is represented as a nonlinear function of:

H^l+1＝f(H^l,A) (1)

For the above formula, H ⁰ = X is the input of the first layer, X e R ^N*D, N is the number of nodes of the graph, D is the dimension of each node feature vector, a is the adjacency matrix; each layer of the gallery stack is represented as:

Wherein σ (·) is a ReLU activation function, and W ^l is a weight parameter matrix of the first layer.

The beneficial effects of the invention are as follows:

The CGNet network model provided by the invention adopts the fusion of the facial image information and the facial key point characteristics to carry out facial expression recognition, strengthens facial expression characteristic representation, improves the representation capacity and further improves the detection progress. The gain is embodied as: compared with the prior art, the method can obtain higher accuracy in the average classification accuracy.

Drawings

Fig. 1 is a flowchart of a facial expression recognition method based on facial key points and a deep neural network in embodiment 1.

FIG. 2 is a flow chart of the table-like training and prediction network in embodiment 1.

Fig. 3 is a diagram showing a schematic diagram of a hierarchical network model in embodiment 1.

Fig. 4 is an explanatory diagram of ResNet network in embodiment 1.

Fig. 5 is a facial expression recognition effect diagram in embodiment 1.

FIG. 6 is a graph of recognition accuracy values in example 1.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-6, the present embodiment provides a facial expression recognition method based on facial key points and a deep neural network, which trains an expression classification model by a deep learning method, and the model can output predicted expression types according to an input facial RGB image.

Specifically, referring to fig. 1, the method specifically includes:

Step S1: acquiring a facial expression data set, extracting 68 key point characteristic information of a face through preprocessing, and dividing eye, nose and mouth areas to obtain multi-path image data and key point coordinate data respectively;

More specifically, step S1 includes: a dataset of facial expression recognition is obtained. Each piece of data in the dataset is presented in the form of a data pair, comprising an RGB image to be classified as input data and a labeled expression class as a true value, wherein the expression class is represented by a number of 0-6 (0-anger, 1-despise, 2-nausea, 3-fear, 4-happiness, 5-injury, 6-surprise).

Each piece of data in the data set needs to be preprocessed before the data set is input into the model, and firstly, a face and 68 corresponding key point positions in an input image are extracted through a face detection module to obtain 68 multiplied by 2 coordinate information. Then dividing the eyes, nose and mouth regions of the face according to the positions of the key points, changing the image pixel sizes of the original image and the three sub-images into 224×224 by bilinear interpolation, and carrying out data enhancement processing, wherein the enhancement means comprises: and (3) randomly rotating, randomly cutting, randomly modifying the properties of saturation, hue and the like of the RGB image, and finally normalizing the RGB image.

Step S2: training an expression classification network, extracting features of the image data and the key point feature information by adopting a convolution layer and a picture convolution layer respectively, and outputting classification results;

more specifically, in the present embodiment, step S2 includes: splitting a data set into a training set containing 654 images and a testing set containing 327 images, training an expression classification network based on Pytorch frames by using the training set, fixing model network parameters after each network parameter in a network model reaches a convergence standard to obtain a trained expression classification network, and finally predicting the testing set by using the trained expression network to obtain an expression classification result.

It should be noted that, the expression classification method provided in this embodiment is not limited to Pytorch frames, as long as the dataset X can be trained, and after the training process is iterated for several times (the order of magnitude of the times), the loss function converges, and finally, the facial expression can be accurately classified according to the RGB image.

It should be noted that, in this example, the expression classification network in step S2 includes the following main components:

a) RGB image feature extraction module: the RGB image feature extraction module inputs face images and three sub-images of eyes, noses and mouth areas obtained after the face images are divided, facial feature extraction is firstly carried out based on ResNet networks after the images are obtained, feature information obtained by ResNet is extracted in a layered mode, perception of the model to local information is strengthened through a spatial attention mechanism, and finally 4 paths of image features are output.

B) The key point position feature extraction module: the input of the key point position feature extraction module is 68 multiplied by 2 coordinate information in the step S1, feature extraction is firstly carried out based on a graph convolution layer after the coordinate information is obtained, and the graph convolution represents one node by calculating the aggregation of adjacent nodes of the current node, so that some nodes with unobvious features can be filtered, importance degree information of different key points is obtained, and finally 1-path position features are output.

C) And a feature fusion module: the feature fusion module adopts a channel attention mechanism, distinguishes the importance of each path of feature under the current expression by inputting the 5 paths of features, and performs weighted summation with corresponding feature values, so as to generate the effect of fusing the image features and the position features.

Step S3: calculating average classification accuracy according to the classification result and the real label;

More specifically, in the present embodiment, the classification accuracy is calculated from the number of correctly classified samples and the number of test lumped samples by using the prediction result in step S2 and the true value in step S1.

It should be noted that, the index for measuring the expression recognition effect is not limited to the classification accuracy, and only the index capable of showing the difference between the obtained expression classification result and the true expression label can be used as the measurement index.

As shown in fig. 3, the model is an overall structure diagram of the expression classification network, and mainly includes three modules: the system comprises an RGB image feature extraction module, a key point position feature extraction module and a feature fusion module.

The RGB image feature extraction module inputs three sub-images of a human face image and an eye region, a nose region and a mouth region obtained after the human face image is divided, facial feature extraction is firstly carried out based on ResNet network after the image is obtained, layer1 of ResNet is firstly used for downsampling after the image is input into the network, then layer2-layer5 is used for multi-scale feature extraction, feature information of the layer2-layer5 is taken out in a layered mode, the three sub-images are sent into a spatial attention mechanism (Spatial Attention) to strengthen the perception degree of a model on local information, and finally 4 paths of image features are output.

The input of the key point position feature extraction module is 68×2 coordinate information described in step S1, the feature extraction is firstly performed based on a graph convolution network after the coordinate information is obtained, the graph convolution network is composed of 4 graph convolution layers (Graph Convolutional Network layer, GCN), each graph convolution layer characterizes one node by calculating the aggregation of adjacent nodes of the current node, so that nodes with unobvious features can be filtered, importance degree information of different key points is obtained, and finally 1-path position features are output.

The feature fusion module adopts a channel attention mechanism (Channel Attention), distinguishes the importance of each path of feature under the current expression by inputting the 5 paths of features, and performs weighted summation with the corresponding feature values, so as to generate the effect of fusing the image features and the position features.

As shown in fig. 5, a partial effect diagram of facial expression recognition by using an expression classification network is shown, and obviously, from a qualitative point of view, compared with a comparison algorithm, the classification result output by the method provided by the invention is better and more accurate. As shown in fig. 6, from the quantitative point of view, the classification accuracy of the method proposed by the present invention is higher than that of other algorithms, and therefore the performance is better.

The present invention is not described in detail in the present application, and is well known to those skilled in the art.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The facial expression recognition method based on the facial key points and the deep neural network is characterized by comprising the following steps of:

Step S3: testing the converged CGNet network model on the constructed testing set, and evaluating CGNet network performance according to the detection result;

the classification recognition network CGNet includes:

b) The key point position feature extraction module: the method comprises the steps of inputting N multiplied by 2 coordinate information, extracting features based on a graph convolution layer after the coordinate information is obtained, representing one node by computing aggregation of adjacent nodes of a current node by graph convolution, filtering out some nodes with unobvious features, obtaining importance degree information of different key points, and finally outputting 1-path position features;

c) And a feature fusion module: the feature fusion module adopts a channel attention mechanism, calculates the weight of each path of feature by using the channel attention for each path of feature by inputting the 4 paths of image features and the 1 paths of position features, and performs weighted summation with corresponding feature values so as to generate the effect of fusing the image features and the position features;

The picture volume lamination layer in the key point position characteristic extraction module specifically comprises the following steps: the graph convolution layer in the neural network is represented as a nonlinear function of:

H^l+1＝f(H^l,A) (1)

2. The facial expression recognition method based on the facial key points and the deep neural network according to claim 1, wherein each piece of data in the facial expression data set in the step S1 is given in the form of a data pair, and an RGB image to be classified is included as input data and the labeled expression class is included as a true value.

3. The facial expression recognition method based on the facial key points and the deep neural network according to claim 1, wherein the preprocessing in step S1 comprises: firstly, extracting faces and N corresponding key point positions in an input RGB face image to be classified to obtain N multiplied by 2 coordinate information; then dividing eyes, noses and mouth regions of a face according to key point positions, changing the sizes of RGB face images to be classified and image pixels of three sub-images obtained after division into 224X 224 through bilinear interpolation, and carrying out data enhancement processing, wherein the enhancement means comprises: and (3) randomly rotating, randomly cutting, randomly modifying the saturation and the tone of the RGB image, and finally normalizing the RGB image.