CN111325190B

CN111325190B - Expression recognition method and device, computer equipment and readable storage medium

Info

Publication number: CN111325190B
Application number: CN202010248558.7A
Authority: CN
Inventors: 陈冠男
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2023-06-30
Anticipated expiration: 2040-04-01
Also published as: US20220343683A1; WO2021196928A1; US12002289B2; CN111325190A

Abstract

The invention discloses an expression recognition method, which includes: detecting the position of key points of the face on the face image to obtain the position information of the key points of the face; inputting the face image into four cascaded convolution modules to perform feature processing, Obtain the characteristic response map output by the fourth convolution module; input the characteristic response map into the global average pooling layer module to obtain the feature vector of the first dimension; perform the characteristic response map output by the first three convolution modules Key point feature extraction to obtain key point feature information; connect the feature vector of the first dimension with the key point feature information to obtain the feature vector of the second dimension; input the feature vector of the second dimension to the fully connected layer module In , the feature vector of the third dimension is obtained; the feature vector of the third dimension is input into the trained neural network classifier, and the expression category information of the face image is output. The invention has a simple structure and a small amount of parameters.

Description

A facial expression recognition method, device, computer equipment and readable storage medium

技术领域technical field

本发明涉及图形处理技术领域。更具体地，涉及一种表情识别方法、装置、计算机设备及可读存储介质。The invention relates to the technical field of graphics processing. More specifically, it relates to an expression recognition method, device, computer equipment and readable storage medium.

背景技术Background technique

深度学习技术如今已取得了突飞猛进的发展，Google、facebook、百度等企业投入了巨大资本和人力进行深度学习的技术研究，不断推出其特有的产品和技术，其他诸如IBM、微软、亚马逊等企业也在不断进军深度学习领域，并取得了一定的成果。Deep learning technology has achieved rapid development. Google, Facebook, Baidu and other companies have invested huge capital and manpower in deep learning technology research, and have continuously launched their unique products and technologies. Other companies such as IBM, Microsoft, Amazon, etc. In the field of deep learning, and has achieved certain results.

深度学习技术在人类数据感知领域取得了突破性的进展，例如描述图像内容、识别图像中的复杂环境下的物体以及在嘈杂环境中进行语音识别，同时，深度学习技术还��以解决图像生成和融合的问题。Deep learning technology has made breakthroughs in the field of human data perception, such as describing image content, recognizing objects in complex environments in images, and performing speech recognition in noisy environments. At the same time, deep learning technology can also solve image generation and fusion. The problem.

目前，人脸特征识别是近年来生物模式识别中的热点技术，该技术要求对人脸的面部特征点进行检测定位，并根据这些特征点进行人脸匹配，表情分析等应用，近些年来，很多研究机构和企业都在目标识别领域进行了大量的资源投入，并且获得了一系列的成果，这些成果在安防、金融、生活娱乐等行业也有了很多的应用，表情识别是人脸特征识别技术的延伸，也是该领域的一个难点，由于人类面部表情的复杂性，利用机器学习的方法对表情进行分类的准确率一直难以有实质性突破，深度学习的发展为图像模式识别的性能提升提供了更多的可能性，所以基于深度学习技术的表情识别研究也是近年来人脸特征识别领域的热门关注点。At present, face feature recognition is a hot technology in biological pattern recognition in recent years. This technology requires the detection and positioning of facial feature points, and based on these feature points for face matching, expression analysis and other applications. In recent years, Many research institutions and companies have invested a lot of resources in the field of target recognition, and obtained a series of results. These results have also been used in many industries such as security, finance, life and entertainment. Expression recognition is a facial feature recognition technology. It is also a difficult point in this field. Due to the complexity of human facial expressions, it has been difficult to achieve a substantial breakthrough in the accuracy of machine learning methods for classifying expressions. The development of deep learning provides a basis for the performance improvement of image pattern recognition. There are more possibilities, so the research on expression recognition based on deep learning technology is also a hot focus in the field of facial feature recognition in recent years.

现有技术中，目前的表情识别方法大多是利用人脸关键点对人脸图像进行截取，将截取出的眼睛和嘴部图像都放大成人脸图像大小，并一起输入深度学习网络中进行训练，得到表情识别的深度学习模型，但是这种方法模型结构复杂，且参数量较��。In the prior art, most of the current expression recognition methods use the key points of the face to intercept the face image, enlarge the intercepted eyes and mouth images to the size of the adult face image, and input them into the deep learning network for training together. A deep learning model for expression recognition is obtained, but this method has a complex model structure and a large number of parameters.

发明内容Contents of the invention

为了解决背景技术中所提出的技术问题，本发明第一方面提出了一种表情识别方法，包括以下步骤：In order to solve the technical problems proposed in the background technology, the first aspect of the present invention proposes an expression recognition method, comprising the following steps:

对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息；Perform face key point position detection on the face image to obtain face key point position information;

将所述人脸图像输入四个级联的卷积模块中，对输入的人脸图像依次进行特征处理，得到第四个卷积模块所输出的特征响应图；The face image is input into four cascaded convolution modules, and the input face images are sequentially subjected to feature processing to obtain a feature response map output by the fourth convolution module;

将所述第四个卷积模块所输出的特征响应图输入至全局平均池化层模块中，得到第一维数的特征向量；Inputting the feature response map output by the fourth convolution module into the global average pooling layer module to obtain the feature vector of the first dimension;

利用所述人脸关键点位置信息对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，得到前三个卷积模块所分别输出的特征响应图的关键点特征信息；Using the key point position information of the human face to carry out key point feature extraction to the feature response graphs output respectively by the first three convolution modules, to obtain the key point feature information of the feature response graphs output respectively by the first three convolution modules;

将所述第一维数的特征向量与所述前三个卷积模块所分别输出的特征响应图的关键点特征信息进行连接，得到第二维数的特征向量；Connecting the feature vectors of the first dimension with the key point feature information of the feature response graphs respectively output by the first three convolution modules to obtain the feature vectors of the second dimension;

将所述第二维数的特征向量输入至全连接层模块中进行处理，得到第三维数的特征向量；The feature vector of the second dimension is input into the fully connected layer module for processing to obtain the feature vector of the third dimension;

将第三维数的特征向量输入至已训练的神经网络分类器中，以由所述神经网络分类器输出所述人脸图像的表情类别信息。The feature vector of the third dimension is input into the trained neural network classifier, so that the neural network classifier outputs the expression category information of the human face image.

可选地，所述对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息包括：Optionally, the performing detection of key points of the face on the face image to obtain the position information of the key points of the face includes:

基于Dlib库对所述人脸图像进行人脸关键点位置检测，获取人脸图像中的眼睛和嘴部的关键点来作为人脸关键点位置信息。Based on the Dlib library, the facial key point position detection is performed on the human face image, and the key points of eyes and mouth in the human face image are obtained as the key point position information of the human face.

可选地，所述卷积模块包括：输入层、卷积层、归一化层、激活函数层、池化层以及输出层；Optionally, the convolution module includes: an input layer, a convolution layer, a normalization layer, an activation function layer, a pooling layer, and an output layer;

其中，所述卷积层的输入端与所述输入层连接，所述归一化层的输入端与所述卷积层的输出端连接，所述激活函数层的输入端与所述归一化层的输出端连接，所述池化层的输入端与所述激活函数层的输出端连接，所述输出层的输入端与所述池化层的输出端连接。Wherein, the input end of the convolution layer is connected to the input layer, the input end of the normalization layer is connected to the output end of the convolution layer, and the input end of the activation function layer is connected to the normalization layer. The output end of the pooling layer is connected, the input end of the pooling layer is connected to the output end of the activation function layer, and the input end of the output layer is connected to the output end of the pooling layer.

可选地，所述利用所述人脸关键点位置信息对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，得到前三个卷积模块所分别输出的特征响应图的关键点特征信息包括：Optionally, using the key point position information of the human face to perform key point feature extraction on the feature response graphs respectively output by the first three convolution modules, to obtain the feature response graphs respectively output by the first three convolution modules Key point feature information includes:

利用所述人脸关键点位置信息，在各个卷积模块所输出的特征响应图中提取与所述人脸关键点位置信息相对应的响应值；Using the key point position information of the human face, extract the response value corresponding to the key point position information of the human face in the feature response graph output by each convolution module;

将所述人脸关键点位置信息在各个特征响应图中相对应的响应值进行加权平均，得到各个卷积模块所分别输出的特征响应图的关键点特征信息。The weighted average of the corresponding response values of the key point position information of the human face in each feature response graph is obtained to obtain the key point feature information of the feature response graphs respectively output by each convolution module.

可选地，所述关键点特征信息通过下式得到：Optionally, the key point feature information is obtained by the following formula:

其中，K_i，j为关键点特征信息，

为人脸关键点位置信息在特征响应图中第n个通道的响应值，N为特征响应图的通道数量。Among them, K _{i, j} is the key point feature information,

is the response value of the nth channel in the feature response map of the position information of the key points of the face, and N is the number of channels in the feature response map.

可选地，在所述利用所述人脸关键点位置信息，在各个卷积模块所输出的特征响应图中提取与所述人脸关键点位置信息相对应的响应值之前的步骤还包括：Optionally, the step before extracting the response value corresponding to the key point position information of the face from the feature response map output by each convolution module by using the key point position information of the face further includes:

将各个卷积模块所输出的特征响应图的尺寸调整至与所述人脸图像的尺寸相同。The size of the characteristic response map output by each convolution module is adjusted to be the same as the size of the face image.

可选地，所述对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息之前的步骤还包括：Optionally, the steps before the face key point position detection is carried out on the face image to obtain the face key point position information also include:

获取输入图像，对输入图像进行人脸检测，将检测到的人脸图像的尺寸调整至预设尺寸。Get the input image, perform face detection on the input image, and adjust the size of the detected face image to a preset size.

可选地，所述神经网络分类器通过随机梯度下降法训练得到。Optionally, the neural network classifier is obtained through stochastic gradient descent training.

本发明第二方面提出了一种表情识别装置，包括：The second aspect of the present invention proposes an expression recognition device, comprising:

人脸关键点位置检测模块，用于对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息；The face key point position detection module is used to detect the face key point position on the face image to obtain the face key point position information;

四个级联的卷积模块，用于输入所述人脸图像，对输入的人脸图像依次进行特征处理，得到第四个卷积模块所输出的特征响应图；Four cascaded convolution modules are used to input the face image, and sequentially perform feature processing on the input face image to obtain a feature response map output by the fourth convolution module;

全局平均池化层模块，用于根据输入的第四个卷积模块所输出的特征响应图来得到第一维数的特征向量；The global average pooling layer module is used to obtain the feature vector of the first dimension according to the characteristic response map output by the input fourth convolution module;

关键点特征信息模块，用于利用所述人脸关键点位置信息对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，得到前三个卷积模块所分别输出的特征响应图的关键点特征信息；The key point feature information module is used to use the key point position information of the human face to perform key point feature extraction on the feature response graphs output by the first three convolution modules respectively, and obtain the feature responses output respectively by the first three convolution modules The key point feature information of the graph;

特征向量连接模块，用于将所述第一维数的特征向量与所述前三个卷积模块所分别输出的特征响应图的关键点特征信息进行连接，得到第二维数的特征向量；A eigenvector connection module, configured to connect the eigenvectors of the first dimension with the key point feature information of the eigenresponse graphs respectively output by the first three convolution modules to obtain the eigenvectors of the second dimension;

全连接层模块，用于将输入的所述第二维数的特征向量进行处理，得到第三维数的特征向量；A fully connected layer module, configured to process the input feature vector of the second dimension to obtain a feature vector of the third dimension;

神经网络分类器，用于将输入第三维数的特征向量输入至已训练的神经网络分类器中，以由所述神经网络分类器输出所述人脸图像的表情类别信息。The neural network classifier is used to input the feature vector of the third dimension into the trained neural network classifier, so that the neural network classifier outputs the expression category information of the human face image.

本发明第三方面提出了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如本发明第一方面所述的方法。The third aspect of the present invention proposes a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the program, the computer program described in the first aspect of the present invention is implemented described method.

本发明第四方面提出了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当所述计算机可读存储介质在计算机上运行时，使得所述计算机执行本发明第一方面所述的方法。The fourth aspect of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when the computer-readable storage medium runs on a computer, the computer executes the first method of the present invention. method described in the aspect.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

本发明所述技术方案具有原理明确、设计简单的优点，具体利用了人脸关键点位置信息对特征响应图进行关键点特征提取的机制，达到对输入的人脸图像进行相应的表情识别的目的，结构简单，且参数量小。The technical scheme of the present invention has the advantages of clear principle and simple design, and specifically utilizes the mechanism of key point feature extraction from the feature response graph based on the position information of key points of the face, so as to achieve the purpose of performing corresponding expression recognition on the input face image , the structure is simple and the number of parameters is small.

��图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1示出本发明的一个实施例提出的一种表情识别方法的流程图；Fig. 1 shows the flowchart of a kind of facial expression recognition method that an embodiment of the present invention proposes;

图2示出本实施例中表情识别方法的算法结构的示意图；Fig. 2 shows the schematic diagram of the algorithm structure of expression recognition method in the present embodiment;

图3示出人脸关键点位置的示意图；Fig. 3 shows the schematic diagram of the key point position of human face;

图4示出本实施例中卷积模块的结构示意图；FIG. 4 shows a schematic structural diagram of a convolution module in this embodiment;

图5示出本实施例中对前三个卷积模块的特征响应图进行关键点特征提取的流程图；Fig. 5 shows the flowchart of key point feature extraction for the feature response graphs of the first three convolution modules in this embodiment;

图6示出本发明的另一个实施例提出的一种计算机设备的结构示意图。FIG. 6 shows a schematic structural diagram of a computer device proposed by another embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the technical solutions and advantages of the present invention clearer, the implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

图1示出本发明的一个实施例提出的一种表情识别方法的步骤流程图，该表情识别方法可以应用于终端设备，该终端设备可以是智能手机、平板电脑、个人计算机或服务器等，为了便于理解，下面先简要地介绍一下该表情识别方法的算法结构。Fig. 1 shows a flow chart of the steps of an expression recognition method proposed by an embodiment of the present invention, the expression recognition method can be applied to a terminal device, and the terminal device can be a smart phone, a tablet computer, a personal computer or a server, etc., for To facilitate understanding, the following briefly introduces the algorithm structure of the facial expression recognition method.

如图2所示，本实施例中的表情识别方法的算法结构包括有人脸图像输入层、人脸关键点位置检测模块、4个级联的卷积层模块、全局平均化池化层模块、关键点特征信息模块、特征向量连接模块、全连接层以及分类器；As shown in Figure 2, the algorithm structure of the facial expression recognition method in the present embodiment includes a human face image input layer, a human face key point position detection module, 4 cascaded convolutional layer modules, a global average pooling layer module, Key point feature information module, feature vector connection module, fully connected layer and classifier;

其中，in,

人脸关键点位置检测用于对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息；The face key point position detection is used to detect the face key point position on the face image to obtain the face key point position information;

四个级联的卷积模块用于输入所述人脸图像，对输入的人脸图像依次进行特征处理，得到第四个卷积模块所输出的特征响应图；Four cascaded convolution modules are used to input the face image, and feature processing is performed on the input face image in turn to obtain a feature response map output by the fourth convolution module;

全局平均池化层模块用于根据输入的第四个卷积模块所输出的特征响应图来得到第一维数的特征向量；The global average pooling layer module is used to obtain the feature vector of the first dimension according to the characteristic response map output by the input fourth convolution module;

关键点特征信息模块用于利用所述人脸关键点位置信息分别对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，得到前三个卷积模块所分别输出的特征响应图的关键点特征信息；The key point feature information module is used to use the key point position information of the face to extract key point features from the feature response maps output by the first three convolution modules respectively, and obtain the feature responses output by the first three convolution modules respectively The key point feature information of the graph;

特征向量连接模块用于将所述第一维数的特征向量与所述前三个卷积模块所分别输出的特征响应图的关键点特征信息进行连接，得到第二维数的特征向量；The feature vector connection module is used to connect the feature vectors of the first dimension with the key point feature information of the feature response graphs respectively output by the first three convolution modules to obtain the feature vectors of the second dimension;

全连接层模块用于将输入的所述第二维数的特征向量进行处理，得到第三维数的特征向量；The fully connected layer module is used to process the input feature vector of the second dimension to obtain a feature vector of the third dimension;

分类器用于将输入第三维数的特征向量输入至已训练的神经网络分类器中，以由所述神经网络分类器输出所述人脸图像的表情类别信息。The classifier is used to input the feature vector of the third dimension into the trained neural network classifier, so that the neural network classifier can output the expression category information of the human face image.

在这里，表情类别信息可以为高兴、惊讶、平静、悲伤、生气、厌恶和恐惧，当然，也可以预设其他种类的表情。Here, the facial expression category information may be happy, surprised, calm, sad, angry, disgusted and fearful, and of course, other types of facial expressions may also be preset.

以上介绍了表情识别方法的算法结构，下面详细的介绍表情识别方法，该表情识别方法如图1所示，包括：The algorithm structure of the expression recognition method has been introduced above, and the expression recognition method is described in detail below. The expression recognition method is shown in Figure 1, including:

S100、对人脸图像进行人脸关键点位置检测，得到人脸关键点位置信息；S100. Perform position detection of key points of the face on the face image to obtain position information of key points of the face;

S200、将所述人脸图像输入四个级联的卷积模块中，对输入的人脸图像依次进行特征处理，得到第四个卷积模块所输出的特征响应图；S200. Input the face images into four cascaded convolution modules, and sequentially perform feature processing on the input face images to obtain a feature response map output by the fourth convolution module;

S300、将所述第四个卷积模块所输出的特征响应图输入至全局平均池化层模块中，得到第一维数的特征向量；S300. Input the feature response map output by the fourth convolution module into the global average pooling layer module to obtain a feature vector of the first dimension;

S400、利用所述人脸关键点位置信息对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，得到前三个卷积模块所分别输出的特征响应图的关键点特征信息；S400. Using the key point position information of the face to extract key point features from the feature response graphs respectively output by the first three convolution modules, and obtain key point feature information of the feature response graphs respectively output by the first three convolution modules ;

S500、将所述第一维数的特征向量与所述前三个卷积模块所分别输出的特征响应图的关键点特征信息进行连接，得到第二维数的特征向量；S500. Connect the feature vector of the first dimension with the key point feature information of the feature response graph respectively output by the first three convolution modules to obtain the feature vector of the second dimension;

S600、将所述第二维数的特征向量输入至全连接层模块中进行处理，得到第三维数的特征向量；S600. Input the feature vector of the second dimension into the fully connected layer module for processing to obtain the feature vector of the third dimension;

S700、将第三维数的特征向量输入至已训练的神经网络分类器中，以由所述神经网络分类器输出所述人脸图像的表情类别信息。S700. Input the feature vector of the third dimension into the trained neural network classifier, so that the neural network classifier outputs the expression category information of the human face image.

具体的，在S100中，还包括：基于Dlib库对所述人脸图像进行人脸关键点位置检测，获取人脸图像中的眼睛和嘴部的关键点来作为人脸关键点位置信息。Specifically, in S100, it also includes: detecting the positions of key points of the face on the face image based on the Dlib library, and acquiring key points of eyes and mouth in the face image as position information of key points of the face.

需要说明的是，Dlib库是一种类似OpenCV的图像处理算法综合应用库，属于现有技术，而人脸关键点识别是该库的一类亮点功能，Dlib库的人脸关键点位置检测是基于机器学习中的随机森林算法开发而成，可描述人脸内68个关键点位置，如图3所示，包括有眉、眼、鼻、口以及下颚，且运算速度较快，在本实施例中，为了能够使得深度学习网络对表情特征更聚焦，因此，从68个关键点中选取了与表情关联最大的眼睛和嘴巴共32个关键点来作为人脸关键点位置信息。It should be noted that the Dlib library is a comprehensive application library of image processing algorithms similar to OpenCV, which belongs to the prior art, and face key point recognition is a kind of highlight function of the library. The face key point position detection of the Dlib library is Developed based on the random forest algorithm in machine learning, it can describe the positions of 68 key points in the face, as shown in Figure 3, including eyebrows, eyes, nose, mouth and jaw, and the calculation speed is fast. In this implementation In the example, in order to make the deep learning network more focused on the expression features, 32 key points of eyes and mouth, which are most related to expressions, are selected from the 68 key points as the key point position information of the face.

进一步的，在本实施例中，在S100之前的步骤还包括有：获取输入图像，对输入图像进行人脸检测，将检测到的人脸图像的尺寸调整至预设尺寸。Further, in this embodiment, the steps before S100 also include: acquiring an input image, performing face detection on the input image, and adjusting the size of the detected face image to a preset size.

具体的，可通过Dlib库来对获取的输入图像中的人脸进行相应的检测，将检测到的人脸图像的尺寸统一变化为预设尺寸，在这里，预设尺寸的具体大小可由工作人员的实际需要自行进行设定，本实施例对此不做限定，示例性的，预设尺寸可为48×48。Specifically, the face in the acquired input image can be detected through the Dlib library, and the size of the detected face image can be uniformly changed to a preset size. Here, the specific size of the preset size can be determined by the staff. The actual needs can be set by yourself, which is not limited in this embodiment. Exemplarily, the preset size can be 48×48.

在S200中，如图4所示，卷积模块具体可包括：输入层、卷积层、��一化层、��活函数层、池化层以及输出层。In S200 , as shown in FIG. 4 , the convolution module may specifically include: an input layer, a convolution layer, a normalization layer, an activation function layer, a pooling layer, and an output layer.

具体的，所述卷积层的输入端与所述输入层连接，所述归一化层的输入端与所述卷积层的输出端连接，所述激活函数层的输入端与所述归一化层的输出端连接，所述池化层的输入端与所述激活函数层的输出端连接，所述输出层的输入端与所述池化层的输出端连接。Specifically, the input end of the convolution layer is connected to the input layer, the input end of the normalization layer is connected to the output end of the convolution layer, and the input end of the activation function layer is connected to the normalization layer. The output end of the pooling layer is connected, the input end of the pooling layer is connected to the output end of the activation function layer, and the input end of the output layer is connected to the output end of the pooling layer.

在本实施例中，四个级联卷积模块的作用就是在对输入的不同尺度的特征响应图进行特征提取，并输出处理后的特征响应图，为了便于理解，按照四个卷积模块由上至下的排列顺序，将四个卷积模块分别定义为第一卷积模块、第二卷积模块、第三卷积模块以及第四卷积模块。In this embodiment, the role of the four cascaded convolution modules is to perform feature extraction on the input feature response maps of different scales, and output the processed feature response maps. In order to facilitate understanding, according to the four convolution modules by Arranging from top to bottom, the four convolution modules are respectively defined as a first convolution module, a second convolution module, a third convolution module, and a fourth convolution module.

在具体实施中，人脸图像会首先输入至第一卷积模块内，如图4所示，第一卷积模块的尺度包括3×3卷积核、32通道，人脸图像经由第一卷积模块处理后得到尺度为24×24、通道数为32的特征响应图，第一卷积模块的输出作为第二卷积模块的输入，第二卷积模块包括3×3卷积核、64通道，将24×24、通道数为32的特征响应图输入至第二卷积模块进行处理后得到尺度为12×12、通道数为64的特征响应图，第二卷积模块的输出作为第三卷积模块的输入，第三卷积模块包括3×3卷积核、128通道，将12×12、通道数为64的特征响应图输入至第三卷积模块进行处理后得到尺度为6×6、通道数为128的特征响应图，第三卷积模块的输出作为第四卷积模块的输入，第四卷积模块包括3×3卷积核、256通道，将6×6、通道数128的特征响应图输入至第四卷积模块进行处理后得到尺度为3×3、通道数256的特征响应图。In the specific implementation, the face image will first be input into the first convolution module, as shown in Figure 4, the scale of the first convolution module includes a 3×3 convolution kernel and 32 channels, and the face image is passed through the first volume After processing by the product module, a characteristic response map with a scale of 24×24 and a channel number of 32 is obtained. The output of the first convolution module is used as the input of the second convolution module, which includes a 3×3 convolution kernel, 64 channel, input the characteristic response map with 24×24 and 32 channels into the second convolution module for processing, and obtain the characteristic response map with the scale of 12×12 and 64 channels, the output of the second convolution module is used as the The input of the three convolution modules, the third convolution module includes a 3×3 convolution kernel and 128 channels, and the characteristic response map with 12×12 and 64 channels is input to the third convolution module for processing to obtain a scale of 6 ×6, the characteristic response map with 128 channels, the output of the third convolution module is used as the input of the fourth convolution module, the fourth convolution module includes 3×3 convolution kernels, 256 channels, and the 6×6, channel The feature response map with a number of 128 is input to the fourth convolution module for processing to obtain a feature response map with a scale of 3×3 and a channel number of 256.

在S300中，全局平均池化层模块的作用是将第四个卷积模块所输出的特征响应图以求均值的方式变成第一维数的特征向量，在这里，第一维数具体为1×256。In S300, the function of the global average pooling layer module is to convert the characteristic response map output by the fourth convolution module into a feature vector of the first dimension by means of averaging. Here, the first dimension is specifically 1×256.

在S400中，如图5所示，通过关键点特征信息模块来利用人脸关键点位置信息对前三个卷积模块所分别输出的特征响应图进行关键点特征提取，也就是对第一卷积模块、第二卷积模块以及第三卷积模块所分别输出的特征响应图进行关键点特征提取，从而得到第一卷积模块、第二卷积模块以及第三卷积模块所分别输出的特征响应图的关键点特征信息。In S400, as shown in Figure 5, the key point feature extraction is performed on the feature response maps output by the first three convolution modules by using the key point feature information module, that is, the first volume The feature response maps output by the convolution module, the second convolution module, and the third convolution module are used to extract key point features, so as to obtain the output of the first convolution module, the second convolution module, and the third convolution module. Keypoint feature information for the eigenresponse plot.

具体的，S400包括如下子步骤：Specifically, S400 includes the following sub-steps:

在本实施例中，根据前述步骤中所得到的人脸关键点位置信息，分别在第一卷积模块、第二卷积模块以及第三卷积模块所输出的特征响应图中提取与人脸关键点位置信息所相对应的响应值，也就是提取与眼睛和嘴巴共32个关键点所相对应的响应值，并将各个关键点在特征响应图中的响应值进行加权平均，最后获取的是各个卷积模块所输出的特征响应图所对应的一组32个响应值。In this embodiment, according to the location information of the key points of the face obtained in the preceding steps, the features corresponding to the face are extracted from the feature response maps output by the first convolution module, the second convolution module, and the third convolution module. The response value corresponding to the position information of the key point is to extract the response value corresponding to 32 key points of eyes and mouth, and weight the response value of each key point in the characteristic response graph, and finally obtain the is a set of 32 response values corresponding to the characteristic response maps output by each convolution module.

进一步的，所述关键点特征信息通过下式得到：Further, the key point feature information is obtained by the following formula:

其中，K_i，j为关键点特征信息，

在本实施例中，在所述利用所述人脸关键点位置信息，在各个卷积模块所输出的特征响应图中提取与所述人脸关键点位置信息相对应的响应值之前的步骤还包括：In this embodiment, the step before extracting the response value corresponding to the key point position information of the human face from the feature response map output by each convolution module by using the key point position information of the human face is further include:

具体的，可通过上采样的操作，来将各个卷积模块所分别输出的特征响应图的尺寸调整至与输入的人脸图像的尺寸一致。Specifically, the size of the characteristic response map output by each convolution module can be adjusted to be consistent with the size of the input face image through an upsampling operation.

在S500中，将S300中得到的第一维数的特征向量前三个卷积模块所分别输出的特征响应图的关键点特征信息进行连接，得到第二维数的特征向量，在这里，在提取了前三个卷积模块所输出的特征响应图的关键点特征信息后，将3个1×32维的特征向量与第一维数的特征向量进行连接，从而获得第二维数的特征向量，在这里，第二维数具体为1×352。In S500, the feature vectors of the first dimension obtained in S300 are connected to the key point feature information of the characteristic response graphs respectively output by the first three convolution modules to obtain the feature vectors of the second dimension. Here, in After extracting the key point feature information of the feature response map output by the first three convolution modules, three 1×32-dimensional feature vectors are connected with the feature vector of the first dimension to obtain the feature of the second dimension Vector, here, the second dimension is specifically 1×352.

在S600中，全连接层模块的输出向量的每个元素均与输入向量的每个元素进行连接，可以为输入向量的所有特征进行融合，因此，经过全连接层模块后，全局平均池化层模块所输出的第一维数的特征向量与前三个卷积模块所输出的特征响应图的关键点特征信息进行融合，从而得到第三维数的特征向量，具体的，全连接层模块的输入第二维数的特征向量，输出第三维数的特征向量，第三维数为1×128。In S600, each element of the output vector of the fully connected layer module is connected with each element of the input vector, which can fuse all the features of the input vector. Therefore, after the fully connected layer module, the global average pooling layer The feature vector of the first dimension output by the module is fused with the key point feature information of the feature response map output by the first three convolution modules to obtain the feature vector of the third dimension, specifically, the input of the fully connected layer module The eigenvector of the second dimension, output the eigenvector of the third dimension, the third dimension is 1×128.

在S700中，可通过将第三维数的特征向量输入至已训练的神经网络中的Softmax层中来计算每种预设的表情类别的置信度，其中，置信度可由下式得到：In S700, the confidence degree of each preset expression category can be calculated by inputting the feature vector of the third dimension into the Softmax layer in the trained neural network, wherein the confidence degree can be obtained by the following formula:

其中，j为表情类别的序号，x为softmax层的输入向量(也就是本实施例中的第三维数的特征向量)，w为网络权重参数，P(y＝j|x)为Ssoftmax层的输入向量为x��，对应的表情类别为��j��表情类别的置信度。Wherein, j is the sequence number of the expression category, x is the input vector of the softmax layer (that is, the feature vector of the third dimension in the present embodiment), w is the network weight parameter, and P (y=j|x) is the value of the Ssoftmax layer When the input vector is x, the corresponding expression category is the confidence level of the jth expression category.

在本实施例中，可根据每种表情类别的置信度确定待识别的人脸图像所对应的表情类别，具体可以将置信度最大的表情类别确定为人脸图像对应的表情类别。In this embodiment, the expression category corresponding to the face image to be recognized may be determined according to the confidence of each expression category. Specifically, the expression category with the highest confidence may be determined as the expression category corresponding to the human face image.

需要说明的是，本实施例中的神经网络分类器可以通过随机梯度下降法训练得到，首先可以获取待训练的神经网络以及各种预设的表情类别的人脸图像样本，然后每次获取一定数量的人脸图像的样本并将其进行预处理，将预处理后的人脸图像样本输入至神经网络中进行梯度下降迭代训练，直至达到预设训练条件，获得训练好的神经网络分类器，其中，预设训练条件可以为：迭代次数达到预设次数，或者损伤函数的取值小于预设值，在本实施例中，可以采用交叉熵来作为损伤函数。It should be noted that the neural network classifier in this embodiment can be trained by the stochastic gradient descent method. First, the neural network to be trained and face image samples of various preset expression categories can be obtained, and then a certain amount of face image samples can be obtained each time. A large number of face image samples are preprocessed, and the preprocessed face image samples are input into the neural network for gradient descent iterative training until the preset training conditions are reached, and a trained neural network classifier is obtained. Wherein, the preset training condition may be: the number of iterations reaches the preset number, or the value of the damage function is smaller than the preset value. In this embodiment, cross-entropy may be used as the damage function.

在本实施例中，预设的表情类别可以包括：高兴、惊讶、平静、悲伤、生气、厌恶和恐惧，当然，也可以预设其他数量，其他种类的表情类别。In this embodiment, the preset expression categories may include: happiness, surprise, calm, sadness, anger, disgust and fear, and of course, other numbers and other types of expression categories may also be preset.

综上所述，本发明所述技术方案具有原理明确、设计简单的优点，具体利用了人脸关键点位置信息对特征响应图进行关键点特征提取的机��，达到对输入的人脸图像进行相应的表情识别的目的，结构简单，且参数量小。In summary, the technical solution of the present invention has the advantages of clear principle and simple design, and specifically utilizes the mechanism of key point feature extraction of the feature response map based on the key point position information of the human face, so as to achieve corresponding The purpose of facial expression recognition, the structure is simple, and the number of parameters is small.

本发明的另一个实施例提出的一种表情识别装置，包括：A kind of facial expression recognition device that another embodiment of the present invention proposes, comprises:

本发明的再一个实施例提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行程序时实现上述表情识别方法。如图6所示，适于用来实现本实施例提供的服务器的计算机系统，包括中央处理单元(CPU)，其可以根据存储在只读存储器(ROM)中的程序或者从存储部分加载到随机访问存储器(RAM)中的程序而执行各种适当的动作和处理。在RAM中，还存储有计算机系统操作所需的各种程序和数据。CPU、ROM以及RAM通过总线被此相连。输入/输入(I/O)接口也连接至总线。Yet another embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. The above expression recognition method is realized when the processor executes the program. As shown in Figure 6, the computer system suitable for realizing the server provided by this embodiment includes a central processing unit (CPU), which can be loaded into a random computer according to a program stored in a read-only memory (ROM) or from a storage part. Various appropriate actions and processes are executed by accessing programs in the memory (RAM). In RAM, various programs and data necessary for the operation of the computer system are also stored. The CPU, ROM, and RAM are connected here via a bus. An input/input (I/O) interface is also connected to the bus.

以下部件连接至I/O接口:包括键盘、鼠标等的输入部分；包括诸如液晶显示器(LCD)等以及扬声器等的输出部分；包括硬盘等的存储部分；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分。通信部分经由诸如因特网的网络执行通信处理。驱动器也根据需要连接至I/O接口。可拆卸介质，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器上，以便于从其上读出的计算机程序根据需要被安装入存储部分。The following parts are connected to the I/O interface: an input section including a keyboard, a mouse, etc.; an output section including a liquid crystal display (LCD) etc., a speaker, etc.; a storage section including a hard disk, etc.; and a network including a LAN card, a modem, etc. The communication part of the interface card. The communication section performs communication processing via a network such as the Internet. Drives are also connected to the I/O interface as required. A removable medium, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive as necessary so that a computer program read therefrom is installed into the storage section as necessary.

特别地，提据本实施例，上文流程图描述的过程可以被实现为计算机软件程序。例如，本实施例包括一种计算机程序产品，其包括有形地包含在计算机可读介质上的计算机程序，上述计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分从网络上被下载和安装，和/或从可拆卸介质被安装。In particular, according to this embodiment, the process described in the flow chart above can be implemented as a computer software program. For example, the present embodiment includes a computer program product, which includes a computer program tangibly embodied on a computer-readable medium, the computer program including program codes for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication part, and/or installed from a removable medium.

附图中的流程图和示意图，图示了本实施例的��统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或示意图中的每个方框可以代表一个模块、程序段或代码的一部分，上述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，示意图和/或流程图中的每个方框、以及示意和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and schematic diagrams in the accompanying drawings illustrate the architecture, functions and operations of possible implementations of the system, method and computer program product of the present embodiment. In this regard, each block in a flowchart or schematic diagram may represent a module, program segment, or part of code that includes one or more programmable components for implementing specified logical functions. Execute instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the illustrations and/or flowchart illustrations, and combinations of blocks in the illustrations and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

描述于本实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中，例如，可以描述为：一种处理器包括人脸关键点位置检测模块、四个级联的卷积模块、全局平均池化层模块等。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the description in this embodiment can be realized by software or by hardware. The described unit can also be set in a processor, for example, it can be described as: a processor includes a face key point position detection module, four cascaded convolution modules, a global average pooling layer module, and the like. Wherein, the names of these units do not constitute a limitation of the unit itself under certain circumstances.

作为另一方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中所述装置中所包含的计算机可读存储介质；也可以是单独存在，未装配入终端中的计算机可读存储介质。所述计算机可读存储介质存储有一个或者一个以上程序，所述程序被一个或者一个以上的处理器用来执行描述于本发明的表情识别方法。As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium contained in the device described in the above-mentioned embodiments; A computer-readable storage medium assembled in a terminal. The computer-readable storage medium stores one or more programs, and the programs are used by one or more processors to execute the facial expression recognition method described in the present invention.

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定，对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动，这里无法对所有的实施方式予以穷举，凡是属于本发明的技术方案所引伸出的显而易见的变化或变动仍处于本发明的保护范围之列。Apparently, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, and are not intended to limit the implementation of the present invention. Those of ordinary skill in the art can also make It is not possible to exhaustively list all the embodiments here, and any obvious changes or changes derived from the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. An expression recognition method is characterized by comprising the following steps:

detecting key point positions of the face image to obtain the position information of the key point positions of the face;

inputting the face image into four cascaded convolution modules, and sequentially carrying out feature processing on the input face image to obtain a feature response diagram output by a fourth convolution module;

inputting the characteristic response diagram output by the fourth convolution module into a global average pooling layer module to obtain a characteristic vector of a first dimension number;

extracting key point characteristics of the characteristic response graphs respectively output by the first three convolution modules by utilizing the key point position information of the human face to obtain key point characteristic information of the characteristic response graphs respectively output by the first three convolution modules;

connecting the feature vector of the first dimension with the key point feature information of the feature response graphs respectively output by the first three convolution modules to obtain a feature vector of a second dimension;

inputting the feature vector of the second dimension into a full-connection layer module for processing to obtain a feature vector of a third dimension;

and inputting the feature vector of the third dimension into a trained neural network classifier so as to output expression category information of the face image by the neural network classifier.

2. The expression recognition method of claim 1, wherein,

the step of detecting the key point positions of the human face image to obtain the key point position information of the human face comprises the following steps:

and detecting key points of the face image based on the Dlib library, and acquiring key points of eyes and mouths in the face image as the position information of the key points of the face.

3. The expression recognition method of claim 1, wherein the convolution module comprises: an input layer, a convolution layer, a normalization layer, an activation function layer, a pooling layer and an output layer;

the input end of the convolution layer is connected with the input layer, the input end of the normalization layer is connected with the output end of the convolution layer, the input end of the activation function layer is connected with the output end of the normalization layer, the input end of the pooling layer is connected with the output end of the activation function layer, and the input end of the output layer is connected with the output end of the pooling layer.

4. The expression recognition method of claim 1, wherein,

the step of extracting key point features of the feature response graphs respectively output by the first three convolution modules by utilizing the key point position information of the face comprises the following steps:

extracting response values corresponding to the position information of the key points of the human face from the characteristic response graphs output by each convolution module by utilizing the position information of the key points of the human face;

and carrying out weighted average on response values corresponding to the facial key point position information in each characteristic response graph to obtain key point characteristic information of the characteristic response graph respectively output by each convolution module.

5. The expression recognition method of claim 4, wherein,

the key point characteristic information is obtained by the following formula:

wherein K is _i，j As the feature information of the key points,

and the response value of the nth channel in the characteristic response diagram for the position information of the key points of the human face is obtained, and N is the number of channels of the characteristic response diagram.

6. The expression recognition method of claim 4, wherein,

the step before extracting the response value corresponding to the face key point position information from the feature response graph output by each convolution module by utilizing the face key point position information further comprises the following steps:

and adjusting the size of the characteristic response graph output by each convolution module to be the same as the size of the face image.

7. The expression recognition method of claim 1, wherein,

the step before the step of detecting the key point position information of the human face to the key point position information of the human face further comprises the following steps:

and acquiring an input image, performing face detection on the input image, and adjusting the size of the detected face image to a preset size.

8. The expression recognition method of any one of claims 1 to 7,

the neural network classifier is obtained through training by a random gradient descent method.

9. An expression recognition apparatus, characterized by comprising:

the face key point detection module is used for detecting the face key points of the face image to obtain the position information of the face key points;

the four cascade convolution modules are used for inputting the face images, and sequentially carrying out feature processing on the input face images to obtain a feature response diagram output by the fourth convolution module;

the global average pooling layer module is used for obtaining the feature vector of the first dimension according to the feature response diagram output by the fourth convolution module;

the key point feature information module is used for extracting key point features of the feature response graphs respectively output by the first three convolution modules by utilizing the face key point position information to obtain key point feature information of the feature response graphs respectively output by the first three convolution modules;

the feature vector connection module is used for connecting the feature vector of the first dimension with the key point feature information of the feature response graphs respectively output by the first three convolution modules to obtain the feature vector of the second dimension;

the full-connection layer module is used for processing the input feature vector of the second dimension to obtain a feature vector of a third dimension;

and the neural network classifier is used for inputting the feature vector with the third dimension into the trained neural network classifier so as to output the expression category information of the face image by the neural network classifier.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-8 when the program is executed by the processor.

11. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of any of claims 1-8.