CN108229268A

CN108229268A - Expression recognition and convolutional neural network model training method and device and electronic equipment

Info

Publication number: CN108229268A
Application number: CN201611268009.6A
Authority: CN
Inventors: 金啸; 胡晨晨; 旷章辉; 张伟
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2016-12-31
Filing date: 2016-12-31
Publication date: 2018-06-29

Abstract

The embodiment of the invention provides an expression recognition and convolutional neural network model training method, a device and electronic equipment, wherein the method comprises the following steps: extracting facial expression features of the facial image to be detected through a convolution layer part of the convolution neural network model and the acquired facial key points in the facial image to be detected to obtain a facial expression feature map; determining ROI (region of interest) corresponding to each face key point in the face expression feature map; performing pooling treatment on each determined ROI through a pooling layer part of the convolutional neural network model to obtain a pooled ROI feature map; and obtaining an expression recognition result of the face image at least according to the ROI characteristic diagram. By the embodiment of the invention, fine expression changes can be effectively captured, differences caused by different facial postures can be better processed, and the detailed information of changes of a plurality of areas of the face is fully utilized, so that the fine expression changes and faces in different postures can be more accurately identified.

Description

Expression recognition and convolutional neural network model training method, device and electronic equipment

技术领域technical field

本发明实施例涉及人工智能技术领域，尤其涉及一种表情识别方法、装置和电子设备，以及，一种卷积神经网络模型训练方法、装置和电子设备。The embodiments of the present invention relate to the technical field of artificial intelligence, and in particular to an expression recognition method, device and electronic equipment, and a convolutional neural network model training method, device and electronic equipment.

背景技术Background technique

人脸表情识别技术是指对给定的人脸图像指定一个表情类别，包括：愤怒、厌恶、开心、伤心、恐惧、惊讶等等。目前，人脸表情识别技术在人机交互、临床诊断、远程教育、侦查审讯等领域逐渐显现广阔的应用前景，是计算机视觉和人工智能的热门研究方向。Facial expression recognition technology refers to assigning an expression category to a given face image, including: anger, disgust, happiness, sadness, fear, surprise, etc. At present, facial expression recognition technology has gradually shown broad application prospects in the fields of human-computer interaction, clinical diagnosis, distance education, investigation and interrogation, and is a popular research direction of computer vision and artificial intelligence.

一种现有的人脸表情识别技术为基于传统机器学习框架的识别技术。使用��传统机器学习框架��行表情识别可以包括4个基本步骤：人脸检测、人脸特征提取、特征降维和根据特征分类。但是：第一，人脸特征提取需要人工设计并提取，其需要特定领域的专业知识；第二，相比深度特征(feature map)，经典的几何特征如Gabor filter、SIFT等抽象程度和表达能力弱；第三，传统的机器学习方法难以利用越来越多的训练数据，训练时长，且训练过程分散复杂。An existing facial expression recognition technology is a recognition technology based on a traditional machine learning framework. Expression recognition using this traditional machine learning framework can include four basic steps: face detection, face feature extraction, feature dimensionality reduction, and feature classification. However: first, face feature extraction requires manual design and extraction, which requires expertise in a specific field; second, compared with deep feature (feature map), classic geometric features such as Gabor filter, SIFT, etc. Weak; Third, traditional machine learning methods are difficult to use more and more training data, the training time is long, and the training process is scattered and complicated.

由此，导致现有的表情识别成本较高，表情识别准确率较低。As a result, the cost of existing expression recognition is high, and the accuracy of expression recognition is low.

发明内容Contents of the invention

本发明实施例提供了一种表情识别技术方案。An embodiment of the present invention provides a technical solution for facial expression recognition.

根据本发明实施例的第一方面，提供了一种表情识别方法，包括：通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图获取所述人脸图像的表情识别结果。According to the first aspect of an embodiment of the present invention, a method for facial expression recognition is provided, including: using the convolutional layer part of the convolutional neural network model and the key points of the human face in the acquired human face image to be detected, the facial expression to be detected Face image carries out facial expression feature extraction, obtains facial expression feature map; Determines the region of interest ROI corresponding to each face key point respectively in described facial expression feature map; Through the pooling layer of convolutional neural network model Partially performing pooling processing on each of the determined ROIs to obtain a pooled ROI feature map; at least acquiring an expression recognition result of the face image according to the ROI feature map.

可选地，所述人脸图像包括静态人脸图像。Optionally, the face image includes a static face image.

可选地，所述人脸图像包括视频帧序列中的人脸图像。Optionally, the face image includes a face image in a sequence of video frames.

可选地，至少根据所述ROI特征图获得所述人脸图像的表情识别结果，包括：根据所述当前帧的人脸图像的所述ROI特征图，获取所述当前帧的人脸图像的初步表情识别结果；根据所述当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取所述当前帧的人脸图像的表情识别结果。Optionally, obtaining the facial expression recognition result of the face image at least according to the ROI feature map includes: acquiring the face image of the current frame according to the ROI feature map of the face image of the current frame. Preliminary facial expression recognition results; according to the preliminary facial expression recognition results of the current frame and the facial expression recognition results of at least one previous frame of human facial images, the facial expression recognition results of the current frame's facial images are obtained.

可选地，根据所述当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取所述当前帧的人脸图像的表情识别结果，包括：将所述当前帧的人脸图像的初步人脸表情识别结果与至少一在先帧的人脸图像的人脸表情识别结果进行加权处理，获得所述当前帧的人脸图像的表情识别结果，其中，所述当前帧的人脸图像的初步表情识别结果的权重大于任一在先帧的人脸图像的表情识别结果的权重。Optionally, according to the preliminary expression recognition result of the current frame and the expression recognition result of the face image of at least one previous frame, acquiring the expression recognition result of the face image of the current frame includes: The preliminary facial expression recognition result of the human face image and the facial expression recognition result of at least one previous frame of the human face image are weighted to obtain the expression recognition result of the human face image of the current frame, wherein the current The weight of the preliminary expression recognition result of the face image of the frame is greater than the weight of the expression recognition result of any previous frame of the face image.

可选地，根据所述当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取所述当前帧的人脸图像的表情识别结果之前，还包括：确定所述当前帧在视频帧序列中的位置大于或等于设定位置阈值。Optionally, according to the preliminary expression recognition result of the current frame and the expression recognition result of the face image of at least one previous frame, before acquiring the expression recognition result of the face image of the current frame, it also includes: determining the The position of the current frame in the video frame sequence is greater than or equal to the set position threshold.

可选地，所述方法还包括：响应于所述当前帧在所述视频帧序列中的位置小于设定的位置阈值，输出所述当前帧的人脸图像的人脸表情识别结果，和/或，保存所述当前帧的人脸图像的人脸表情识别结果。Optionally, the method further includes: in response to the position of the current frame in the sequence of video frames being smaller than a set position threshold, outputting the facial expression recognition result of the face image of the current frame, and/or Or, save the facial expression recognition result of the facial image of the current frame.

可选地，所述通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图，包括：对待检测的人脸图像进行人脸关键点检测，获得所述人脸图像中的人脸关键点；根据所述人脸关键点，通过所述卷积神经网络模型的卷积层部分对所述人脸图像进行人脸表情特征提取，获得人脸表情特征图。Optionally, through the convolution layer part of the convolutional neural network model and the key points of the face in the face image to be detected, the facial expression feature extraction is performed on the face image to be detected to obtain the facial expression A feature map, comprising: performing face key point detection on a face image to be detected to obtain the face key points in the face image; according to the face key points, through the convolution of the convolutional neural network model The layer part performs facial expression feature extraction on the face image to obtain a facial expression feature map.

可选地，在所述通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取之前，所述方法还包括：获��训练用的样本图像，使用所述样本图像训练所述卷积神经网络模型，其中，所述样本图像中包含有人脸关键点的信息和人脸表情的标注信息。Optionally, before performing facial expression feature extraction on the face image to be detected through the convolutional layer part of the convolutional neural network model and the acquired face key points in the face image to be detected, the The method also includes: acquiring a sample image for training, and using the sample image to train the convolutional neural network model, wherein the sample image includes information about key points of a human face and labeling information of facial expressions.

可选地，所述获取训练用的样本图像，使用所述样本图像训练所述卷积神经网络模型，包括：获取训练用的样本图像，通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。Optionally, the acquiring a sample image for training, and using the sample image to train the convolutional neural network model includes: acquiring a sample image for training, and using the convolutional layer part of the convolutional neural network model to The sample image is subjected to facial expression feature extraction to obtain a facial expression feature map; determine the region of interest ROI corresponding to each face key point in the facial expression feature map; through the pooling layer part of the convolutional neural network model performing pooling processing on each determined ROI to obtain a pooled ROI feature map; adjusting network parameters of the convolutional neural network model at least according to the ROI feature map.

可选地，确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI，包括：在所述人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Optionally, determining the region of interest (ROI) corresponding to each key point in the facial expression feature map includes: determining the corresponding ROI according to the coordinates of each key point in the facial expression feature map. Each position of each of the determined positions is used as a reference point to acquire corresponding areas of each set range, and each acquired area is determined as a corresponding ROI.

可选地，所述通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图包括：通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数，包括：将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, performing pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtaining the ROI feature map after pooling includes: determining Each ROI of each ROI is pooled to obtain a pooled ROI feature map of a set size; said adjusting the network parameters of the convolutional neural network model at least according to the ROI feature map includes: setting the The size of the ROI is input into the loss layer to obtain the error of the expression classification result of the expression classification of the sample image; according to the error of the expression classification result, the network parameters of the convolutional neural network model are adjusted.

可选地，将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差，包括：将所述设定尺寸的ROI输入损失层，通过所述损失层的逻辑回归损失函数计算所述表情分类结果误差并输出。Optionally, inputting the ROI of the set size into the loss layer, and obtaining the expression classification result error of the expression classification of the sample image includes: inputting the ROI of the set size into the loss layer, passing through the loss layer The logistic regression loss function calculates the expression classification result error and outputs it.

可选地，所述逻辑回归损失函数为具有设定表情分类数量的逻辑回归损失函数。Optionally, the logistic regression loss function is a logistic regression loss function with a set number of expression classifications.

可选地，所述待训练的人脸表情样本图像为视频帧序列的样本图像。Optionally, the sample images of human facial expressions to be trained are sample images of video frame sequences.

可选地，在所述获取训练用的样本图像及对应的人脸关键点的信息之前，所述方法还包括：对所述训练用的样本图像进行检测，获得人脸关键点的信息。Optionally, before the acquisition of the sample image for training and the corresponding key point information of the human face, the method further includes: detecting the sample image for training to obtain the key point information of the human face.

根据本发明实施例的第二方面，提供了一种表情识别装置，包括：第一确定模块，用��通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图；第二第五确定模块，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI��第��确定��块，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；第四确定模块，用于至少根据所述ROI特征图获取所述人脸图像的表情识别结果。According to a second aspect of an embodiment of the present invention, there is provided an expression recognition device, including: a first determination module, which is used to pass the convolutional layer part of the convolutional neural network model and the acquired human face image to be detected Face key point, the face image to be detected carries out facial expression feature extraction, obtains facial expression feature map; The second fifth determining module is used to determine that in the described human face expression feature map, it corresponds to each key point of human face respectively ROI of the region of interest; the third determination module is used to pool the determined ROIs through the pooling layer part of the convolutional neural network model to obtain the ROI feature map after pooling; the fourth determination module is used to Acquiring an expression recognition result of the face image at least according to the ROI feature map.

可选地，所述第三确定模块，包括：第一获取子模块，用于根据所述当前帧的人脸图像的所述ROI特征图，获取所述当前帧的人脸图像的初步表情识别结果；第二获取子模块，用于根据所述当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取所述当前帧的人脸图像的表情识别结果。Optionally, the third determination module includes: a first acquisition submodule, configured to acquire the preliminary expression recognition of the face image of the current frame according to the ROI feature map of the face image of the current frame Result; the second acquisition sub-module is used to acquire the expression recognition result of the face image of the current frame according to the preliminary expression recognition result of the current frame and the expression recognition result of the face image of at least one previous frame.

可选地，所述第二获取子模块，用于将所述当前帧的人脸图像的初步人脸表情识别结果与至少一在先帧的人脸图像的人脸表情识别结果进行加权处理，获得所述当前帧的人脸图像的表情识别结果，其中，所述当前帧的人脸图像的初步表情识别结果的权重大于任一在先帧的人脸图像的表情识别结果的权重。Optionally, the second acquisition submodule is configured to perform weighting processing on the preliminary facial expression recognition result of the face image of the current frame and the facial expression recognition result of at least one previous frame of the face image, The expression recognition result of the face image of the current frame is obtained, wherein the weight of the preliminary expression recognition result of the face image of the current frame is greater than the weight of the expression recognition result of any previous frame of the face image.

可选地，所述装置还包括：第五确定模块，用于确定所述当前帧在视频帧序列中的位置大于或等于设定位置阈值。Optionally, the device further includes: a fifth determining module, configured to determine that the position of the current frame in the video frame sequence is greater than or equal to a set position threshold.

可选地，所述装置还包括：响应模块，用于响应于所述当前帧在所述视频帧序列中的位置小于设定的位置阈值，输出所述当前帧的人脸图像的人脸表情识别结果，和/或，保存所述当前帧的人脸图像的人脸表情识别结果。Optionally, the device further includes: a response module, configured to output the facial expression of the facial image of the current frame in response to the position of the current frame in the sequence of video frames being smaller than a set position threshold The recognition result, and/or, saving the facial expression recognition result of the face image of the current frame.

可选地，所述第一确定模块，用于对待检测的人脸图像进行人脸关键点检测，获得所述人脸图像中的人脸关键点；根据所述人脸关键点，通过所述卷积神经网络模型的卷积层部分对所述人脸图像进行人脸表情特征提取，获得人脸表情特征图。Optionally, the first determination module is configured to perform face key point detection on the face image to be detected, and obtain face key points in the face image; according to the face key points, through the The convolutional layer part of the convolutional neural network model performs facial expression feature extraction on the face image to obtain a facial expression feature map.

可选地，所述第一确定模块，用于通过所述卷积神经网络模型的卷积层部分对待检测的人脸图像进行人脸关键点提取；根据提取的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。Optionally, the first determination module is configured to extract face key points from the face image to be detected through the convolutional layer part of the convolutional neural network model; according to the extracted face key points, the face image to be detected The facial expression feature extraction is performed on the face image, and the facial expression feature map is obtained.

可选地，所述装置还包括：训练模块，用于获取训练用的样本图像，使用所述样本图像训练所述卷积神经网络模型，其中，所述样本图像中包含有人脸关键点的信息和人脸表情的标注信息。Optionally, the device further includes: a training module, configured to acquire a sample image for training, and use the sample image to train the convolutional neural network model, wherein the sample image contains information about key points of a human face and annotation information of facial expressions.

可选地，所述训练模块，包括：第一子模块，用于获取训练用的样本图像，通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；第二子模块，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；第三子模块，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；第四子模块，用于至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。Optionally, the training module includes: a first sub-module, configured to obtain a sample image for training, and perform facial expression feature extraction on the sample image through the convolutional layer part of the convolutional neural network model to obtain human Facial expression feature map; The second submodule is used to determine the region of interest ROI corresponding to each face key point in the described human facial expression feature map; the third submodule is used to pass through the pool of the convolutional neural network model The pooling layer part performs pooling processing on each determined ROI to obtain a pooled ROI feature map; the fourth submodule is used to adjust the network parameters of the convolutional neural network model at least according to the ROI feature map.

可选地，所述第二子模块，用于在所述人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Optionally, the second submodule is configured to determine corresponding positions in the facial expression feature map according to the coordinates of each key point of the face; using the determined positions as reference points, obtain the corresponding For each area within the set range, each acquired area is determined as a corresponding ROI.

可选地，所述第三子模块，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述第四子模块，用于将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, the third submodule is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map of a set size; the first The four sub-modules are used to input the ROI of the set size into the loss layer, and obtain the error of the expression classification result of the expression classification of the sample image; according to the error of the expression classification result, adjust the convolutional neural network model. Network parameters.

可选地，所述第四子模块，用于将所述设定尺寸的ROI输入损失层，通过所述损失层的逻辑回归损失函数计算所述表情分类结果误差并输出。Optionally, the fourth sub-module is configured to input the ROI of the set size into a loss layer, and calculate and output the expression classification result error through a logistic regression loss function of the loss layer.

可选地，所述装置还包括：第六确定模块，用于对所述训练用的样本图像进行检测，获得人脸关键点的信息。Optionally, the device further includes: a sixth determination module, configured to detect the sample images used for training to obtain information about key points of human faces.

根据本发明实施例的第三方面，提供了一种电子设备，包括：处理器、存储器、通信元件和通信总线，所述处理器、所述存储器和所述通信元件通过所述通信总线完成相互间的通信；所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行第一方面的任一所述表情识别方法。According to a third aspect of the embodiments of the present invention, there is provided an electronic device, including: a processor, a memory, a communication element, and a communication bus, and the processor, the memory, and the communication element complete mutual communication via the communication bus. communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute any one of the facial expression recognition methods in the first aspect.

根据本发明实施例的第四方面，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有：用于通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图的可执行指令；用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI的可执行指令；用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图的可执行指令；用于至少根据所述ROI特征图获取所述人脸图像的表情识别结果的可执行指令。According to a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, and the computer-readable storage medium stores: the convolutional layer part and the obtained human to be detected through the convolutional neural network model The key points of the face in the face image, the face image to be detected is extracted to obtain the executable instruction of the facial expression feature map; it is used to determine the relationship between the facial expression feature map and each face key point Executable instructions for corresponding regions of interest ROIs; for performing pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtaining executable instructions for pooled ROI feature maps; for An executable instruction for obtaining an expression recognition result of the face image at least according to the ROI feature map.

根据本发明实施例的第五方面，提供了一种卷积神经网络模型训练方法，包括：获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息；通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。According to a fifth aspect of an embodiment of the present invention, a method for training a convolutional neural network model is provided, including: acquiring a sample image for training and information about key points of a human face, wherein the sample image includes a human face The labeling information of expression; Carry out facial expression feature extraction to described sample image by the convolutional layer part of convolutional neural network model, obtain facial expression characteristic map; Determine described human facial expression characteristic map and each face key point Respectively corresponding to the ROI of the region of interest; through the pooling layer part of the convolutional neural network model, pooling is performed on each determined ROI to obtain a pooled ROI feature map; at least according to the ROI feature map, adjust the volume The network parameters of the product neural network model.

可选地，所述通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图，包括：通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数，包括：将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, performing pooling processing on each of the determined ROIs through the pooling layer part of the convolutional neural network model to obtain a pooled ROI feature map includes: Each of the determined ROIs is pooled to obtain a pooled ROI feature map of a set size; the adjustment of the network parameters of the convolutional neural network model at least according to the ROI feature map includes: setting the The fixed-sized ROI is input into the loss layer to obtain the error of the expression classification result of the expression classification of the sample image; according to the error of the expression classification result, the network parameters of the convolutional neural network model are adjusted.

根据本发明实施例的第六方面，提供了一种卷积神经网络模型训练装置，包括：第一获取模块，用于获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息；第二获取模块，用于通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；第三获取模块，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；第四获取模块，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；第五获取模块，用于至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。According to the sixth aspect of the embodiments of the present invention, a convolutional neural network model training device is provided, including: a first acquisition module, configured to acquire training sample images and corresponding face key point information, wherein the The sample image contains labeling information of human facial expression; the second acquisition module is used to extract the facial expression feature of the sample image through the convolutional layer part of the convolutional neural network model, and obtain the facial expression feature map; the second Three acquisition modules, for determining the region of interest ROI corresponding to each face key point in the facial expression feature map; the fourth acquisition module, for determining by the pooling layer part of the convolutional neural network model Each ROI is pooled to obtain a pooled ROI feature map; a fifth acquisition module is configured to adjust network parameters of the convolutional neural network model at least according to the ROI feature map.

可选地，所述第三获取模块，用于在所述人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Optionally, the third acquisition module is configured to determine corresponding positions in the facial expression feature map according to the coordinates of each key point of the face; using the determined positions as reference points, acquire the corresponding For each area within the set range, each acquired area is determined as a corresponding ROI.

可选地，所述第四获取模块，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述第五获取模块，包括：第一获取子模块，用于将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；调整子模块，用于根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, the fourth acquisition module is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map of a set size; the first Five acquisition modules, including: a first acquisition sub-module, used to input the ROI of the set size into the loss layer, and obtain the expression classification result error of the expression classification of the sample image; Expression classification result error, adjust the network parameters of the convolutional neural network model.

可选地，所述第一获取子模块，用于将所述设定尺寸的ROI输入损失层，通过所述损失层的逻辑回归损失函数计算所述表情分类结果误差并输出。Optionally, the first acquisition sub-module is configured to input the ROI of the set size into a loss layer, and calculate and output the expression classification result error through a logistic regression loss function of the loss layer.

可选地，所述装置还包括：第六获取模块，用于对所述训练用的样本图像进行检测，获得人脸关键点的信息。Optionally, the device further includes: a sixth acquisition module, configured to detect the sample images used for training to obtain information about key points of human faces.

根据本发明实施例的第七方面，提供了一种电子设备，包括：处理器、存储器、通信元件和通信总线，所述处理器、所述存储器和所述通信元件通过所述通信总线完成相互间的通信；所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行第五方面的任一所述卷积神经网络模型训练方法。According to a seventh aspect of the embodiments of the present invention, there is provided an electronic device, including: a processor, a memory, a communication element, and a communication bus, and the processor, the memory, and the communication element complete mutual communication through the communication bus communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute any one of the convolutional neural network model training methods in the fifth aspect.

根据本发明实施例的第八方面，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有：用于获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息的可执行指令；用于通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图的可执行指令；用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI的可执行指令；用于通过卷积神经网络模型的��化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图的可执行指令；用于至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数的可执行指令。According to an eighth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, and the computer-readable storage medium stores: information for obtaining training sample images and corresponding face key points, wherein, The sample image contains executable instructions for labeling information of human facial expressions; it is used to extract the facial expression features of the sample images through the convolutional layer part of the convolutional neural network model to obtain the human facial expression feature map. Executing instructions; executable instructions for determining the region of interest ROI corresponding to each face key point in the facial expression feature map; for each ROI determined by the pooling layer part of the convolutional neural network model An executable instruction for performing pooling processing to obtain a pooled ROI feature map; an executable instruction for adjusting network parameters of the convolutional neural network model at least according to the ROI feature map.

根据本发明实施例提供的技术方案，在根据人脸关键点进行人脸表情特征提取，获得人脸表情特征图后，再根据人脸关键点从人脸表情特征图中确定人脸关键点对应的ROI(Region Of Interest，感兴趣区域)，该ROI通过ROI池化层的处理后，可以获得ROI特征图；然后，根据ROI特征图确定人脸表情。通过选择将对应于人脸关键点的区域作为ROI，能够有效地捕捉细微的表情变化，同时能够更好地处理不同面部姿态带来的差异性，充分利用面部多个区域变化的细节信息，对细微的表情变化以及不同姿态的人脸有更准确的识别。According to the technical solution provided by the embodiment of the present invention, after the facial expression feature extraction is performed according to the key points of the human face, and the facial expression feature map is obtained, the corresponding key points of the human face are determined from the facial expression feature map according to the key points of the human face. ROI (Region Of Interest, region of interest), after the ROI is processed by the ROI pooling layer, the ROI feature map can be obtained; then, the facial expression is determined according to the ROI feature map. By selecting the area corresponding to the key points of the face as the ROI, it is possible to effectively capture subtle expression changes, and at the same time better handle the differences caused by different facial poses, and make full use of the detailed information of the changes in multiple areas of the face. Subtle expression changes and faces with different poses are more accurately recognized.

附图说明Description of drawings

图1是根据本发明实施例一的一种表情识别方法的步骤流程图；FIG. 1 is a flow chart of the steps of an expression recognition method according to Embodiment 1 of the present invention;

图2是根据本发明实施例二的一种表情识别方法的步骤流程图；FIG. 2 is a flow chart of the steps of an expression recognition method according to Embodiment 2 of the present invention;

图3是根据本发明实施例三的一种表情识别方法的步骤流程图；FIG. 3 is a flowchart of steps of an expression recognition method according to Embodiment 3 of the present invention;

图4是根据本发明实施例四的一种卷积神经网络模型训练方法的步骤流程图；4 is a flow chart of steps of a method for training a convolutional neural network model according to Embodiment 4 of the present invention;

图5是根据本发明实施例五的一种表情识别装置的结构框图；5 is a structural block diagram of an expression recognition device according to Embodiment 5 of the present invention;

图6是根据本发明实施例六的一种卷积神经网络模型训练装置的结构框图；6 is a structural block diagram of a convolutional neural network model training device according to Embodiment 6 of the present invention;

图7是根据本发明实施例七的一种电子设备的结构示意图；FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 7 of the present invention;

图8是根据本发明实施例八的一种电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device according to Embodiment 8 of the present invention.

具体实施方式Detailed ways

下面结合附图(若干附图中相同的标号表示相同的元素)和实施例，对本发明实施例的具体实施方式作进一步详细说明。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific implementation manners of the embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings (the same symbols in several drawings indicate the same elements) and the embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

本领域技术人员可以理解，本发明实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等，既不代表任何特定技术含义，也不表示它们之间的必然逻辑顺序。Those skilled in the art can understand that terms such as "first" and "second" in the embodiments of the present invention are only used to distinguish different steps, devices or modules, etc. necessary logical sequence.

实施例一Embodiment one

参照图1，示出了根据本发明实施例一的一种表情识别方法的步骤流程图。Referring to FIG. 1 , it shows a flowchart of steps of an expression recognition method according to Embodiment 1 of the present invention.

本实施例的表情识别方法包括以下步骤：The facial expression recognition method of the present embodiment comprises the following steps:

步骤S102：通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。Step S102: Using the convolutional layer part of the convolutional neural network model and the acquired face key points in the face image to be detected, perform facial expression feature extraction on the face image to be detected to obtain a face expression feature map.

经过训练的卷积神经网络模型具有人脸表情识别的功能，其至少包括输入层部分、卷积层部分、池化层部分、全连接层部分等。其中，输入层部分用于输入图像；卷积层部分进行特征提取；池化层部分进行对卷积层部分的处理结果进行池化处理，如对卷积层部分获得的特征图进行降采样等；全连接层部分可以用来进行分类��。The trained convolutional neural network model has the function of facial expression recognition, which at least includes an input layer part, a convolutional layer part, a pooling layer part, a fully connected layer part, etc. Among them, the input layer part is used to input the image; the convolution layer part performs feature extraction; the pooling layer part performs pooling processing on the processing results of the convolution layer part, such as downsampling the feature map obtained by the convolution layer part, etc. ; The fully connected layer part can be used for classification, etc.

本实施例中，通过卷积神经网络模型的卷积层部分进行人脸表情特征提取，获得人脸表情特征图。此外，对于人脸关键点的获取，在一种可行方式中，可以在输入卷积神经网络模型之前，通过对待检测的人脸图像进行人脸关键点检测获得；在另一种可行方式中，可以由卷积神经网络模型的卷积层部分进行提取，也即，卷积层部分先提取待检测的人脸图像中的人脸关键点，然后，基于提取的人脸关键点进行进一步的人脸表情特征提取，获得人脸表情特征图；在再一种可行方式中，也可以在输入卷积神经网络模型之前，人工对待检测的人脸图像进行人脸关键点的标注获得。In this embodiment, facial expression feature extraction is performed through the convolutional layer part of the convolutional neural network model to obtain a facial expression feature map. In addition, for the acquisition of face key points, in a feasible way, before inputting the convolutional neural network model, the face key point detection can be performed on the face image to be detected; in another feasible way, It can be extracted by the convolutional layer part of the convolutional neural network model, that is, the convolutional layer part first extracts the face key points in the face image to be detected, and then performs further human face key points based on the extracted face key points. Facial expression feature extraction to obtain the facial expression feature map; in another feasible way, before inputting the convolutional neural network model, the human face image to be detected can be manually annotated to obtain the key points of the face.

步骤S104：确定人脸表情特征图中与各个人脸关键点分别对应的ROI。Step S104: Determine the ROIs in the facial expression feature map corresponding to each key point of the face.

卷积层部分输出的人脸表情特征图中包含了对图像整体的处理结果，该处理结果包含较大的数据量，若基于此进行人脸表情识别，需处理大量数据，系统处理负担较重。The facial expression feature map output by the convolutional layer contains the processing results of the entire image. The processing results contain a large amount of data. If facial expression recognition is based on this, a large amount of data needs to be processed, and the system processing burden is heavy. .

为此，本发明实施例的方案中，首先根据人脸关键点，确定每一个人脸关键点对应的ROI(Region Of Interest，感兴趣区域)。例如，在根据人脸关键点的信息，确定与每一个人脸关键点对应的ROI时，在人脸表情特征图中，根据人脸关键点的坐标确定相对应的位置；以确定的位置为中心，获取设定范围的区域，将获取的区域确定为ROI。To this end, in the solution of the embodiment of the present invention, firstly, according to the key points of the face, the ROI (Region Of Interest) corresponding to each key point of the face is determined. For example, when determining the ROI corresponding to each face key point according to the information of the key points of the face, in the facial expression feature map, determine the corresponding position according to the coordinates of the key points of the face; the determined position is In the center, acquire the area of the set range, and determine the acquired area as the ROI.

步骤S106：通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图。Step S106: Perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map.

其中，池化处理包括但不限于降采样处理。Wherein, pooling processing includes but not limited to down-sampling processing.

步骤S108：至少根据ROI特征图获取人脸图像的表情识别结果。Step S108: Obtain an expression recognition result of the face image at least according to the ROI feature map.

ROI特征图中包含有人脸表情的特征信息，根据ROI特征图可以获取对待检测的人脸图像中的人脸表情的表情识别结果。The ROI feature map contains feature information of human facial expressions, and an expression recognition result of human facial expressions in the human face image to be detected can be obtained according to the ROI feature map.

通过本实施例，在根据人脸关键点进行人脸表情特征提取，获得人脸表情特征图后，再根据人脸关键点从人脸表情特征图中确定人脸关键点对应的ROI，该ROI通过ROI池化层的处理后，可以获得ROI特征图；然后，根据ROI特征图确定人脸表情。通过选择将对应于人脸关键点的区域作为ROI，能够有效地捕捉细微的表情变化，同时能够更好地处理不同面部姿态带来的差异性，充分利用面部多个区域变化的细节信息，对细微的表情变化以及不同姿态的人脸有更准确的识别。Through this embodiment, after performing facial expression feature extraction according to the key points of the human face and obtaining the facial expression feature map, the ROI corresponding to the key points of the human face is determined from the facial expression feature map according to the key points of the human face. After the processing of the ROI pooling layer, the ROI feature map can be obtained; then, the facial expression is determined according to the ROI feature map. By selecting the area corresponding to the key points of the face as the ROI, it is possible to effectively capture subtle expression changes, and at the same time better handle the differences caused by different facial poses, and make full use of the detailed information of the changes in multiple areas of the face. Subtle expression changes and faces with different poses are more accurately recognized.

实施例二Embodiment two

参照图2，示出了根据本发明实施例二的一种表情识别方法的步骤流程图。Referring to FIG. 2 , it shows a flowchart of steps of an expression recognition method according to Embodiment 2 of the present invention.

本实施例中，先训练一个具有人脸表情识别功能的卷积神经网络模型，然后，基于该模型进行图像的人脸表情识别。但本领域技术人员应当明了，在实际使用中，也可以采用第三方训练完成的卷积神经网络模型进行人脸表情识别。In this embodiment, a convolutional neural network model with a facial expression recognition function is trained first, and then facial expression recognition of images is performed based on the model. However, those skilled in the art should understand that in actual use, the convolutional neural network model trained by a third party can also be used for facial expression recognition.

步骤S202：获取训练用的样本图像，使用样本图像训练卷积神经网络模型。Step S202: Obtain sample images for training, and use the sample images to train a convolutional neural network model.

其中，样本图像可以为静态图像，也可以为视频帧序列的样本图像。样本图像中包含有人脸关键点的信息和人脸表情的标注信息。本实施例中，通过对训练用的样本图像进行检测，获得人脸关键点的信息。Wherein, the sample image may be a static image, or a sample image of a sequence of video frames. The sample image contains the information of the key points of the human face and the annotation information of the facial expression. In this embodiment, the information of the key points of the human face is obtained by detecting the sample images used for training.

在实现本步骤的一种可行方式中，获取训练用的样本图像，通过卷积神经网络模型的卷积层部分对样本图像进行人脸表情特征提取，获得人脸表情特征图；确定人脸表情特征图中与各个人脸关键点分别对应的ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据ROI特征图，调整卷积神经网络模型的网络参数。In a feasible way of realizing this step, the sample image used for training is obtained, and the facial expression feature extraction is performed on the sample image through the convolution layer part of the convolutional neural network model to obtain the facial expression feature map; determine the facial expression The ROIs corresponding to the key points of each face in the feature map; the pooling layer part of the convolutional neural network model is used to pool the determined ROIs to obtain the pooled ROI feature map; at least according to the ROI feature map, Adjust the network parameters of the convolutional neural network model.

其中，在一种可行方式中，确定人脸表情特征图中与各个人脸关键点分别对应的ROI包括：在人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Wherein, in a feasible manner, determining the ROI corresponding to each key point in the facial expression feature map includes: determining the corresponding positions according to the coordinates of each key point in the facial expression feature map ; Using each determined position as a reference point, acquire corresponding areas of each set range, and determine each acquired area as a corresponding ROI.

通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，可以获得池化后的设定尺寸的ROI特征图。而在根据ROI特征图，调整卷积神经网络模型的网络参数时，可以将设定尺寸的ROI特征图输入损失层，获取对样本图像进行表情分类的表情分类结果误差；根据表情分类结果误差，调整卷积神经网络模型的网络参数。其中，调整的网络参数包括但不限于权重参数weight、偏置参数bias等。Through the pooling layer part of the convolutional neural network model, pooling is performed on each determined ROI, and a pooled ROI feature map of a set size can be obtained. When adjusting the network parameters of the convolutional neural network model according to the ROI feature map, the ROI feature map of the set size can be input into the loss layer to obtain the error of the expression classification result of the expression classification of the sample image; according to the error of the expression classification result, Adjust the network parameters of the convolutional neural network model. Wherein, the adjusted network parameters include but not limited to weight parameter weight, bias parameter bias and so on.

表情分类结果误差的获得可以通过将设定尺寸的ROI输入损失层，通过损失层的逻辑回归损失函数计算表情分类结果误差并输出而获得。其中，逻辑回归损失函数可以为具有设定表情分类数量的逻辑回归损失函数。The error of the expression classification result can be obtained by inputting the ROI of the set size into the loss layer, and calculating and outputting the error of the expression classification result through the logistic regression loss function of the loss layer. Wherein, the logistic regression loss function may be a logistic regression loss function with a set number of expression classifications.

通过以上过程，实现了具有表情识别功能的卷积神经网络模型的训练，进而，基于该训练完成的卷积神经网络模型可以进行人脸的表情检测。Through the above process, the training of the convolutional neural network model with expression recognition function is realized, and then, the convolutional neural network model based on the training can be used for facial expression detection.

步骤S204：获取待检测的人脸图像。Step S204: Obtain the face image to be detected.

其中，待检测的人脸图像可以为静态人脸图像，也可以为视频帧序列的人脸图像。Wherein, the face image to be detected may be a static face image, or a face image of a sequence of video frames.

步骤S206：对待检测的人脸图像进行人脸关键点检测，获得人脸图像中的人脸关键点。Step S206: Perform face key point detection on the face image to be detected, and obtain face key points in the face image.

本实施例中，采用先进行人脸图像检测获得人脸关键点的方式。但如前所述，若卷积神经网络模型具有人脸关键点检测的功能，则可以直接将待检测的人脸图像输入卷积神经网络模型，通过卷积神经网络模型的卷积层部分对待检测的人脸图像进行人脸关键点提取。进而，根据提取的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。In this embodiment, a method of first performing face image detection to obtain face key points is adopted. But as mentioned above, if the convolutional neural network model has the function of face key point detection, the face image to be detected can be directly input into the convolutional neural network model, and the convolutional layer part of the convolutional neural network model is treated The detected face image is subjected to face key point extraction. Furthermore, according to the extracted face key points, the facial expression feature extraction is performed on the face image to be detected, and the facial expression feature map is obtained.

步骤S208：通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。Step S208: Using the convolutional layer part of the convolutional neural network model and the acquired face key points in the face image to be detected, perform facial expression feature extraction on the face image to be detected to obtain a face expression feature map.

步骤S210：确定人脸表情特征图中与各个人脸关键点分别对应的ROI。Step S210: Determine ROIs in the facial expression feature map corresponding to each key point of the face.

步骤S212：通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的与各个ROI对应的ROI特征图。Step S212: Perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map corresponding to each ROI.

步骤S214：至少根据ROI特征图获取人脸图像的表情识别结果。Step S214: Obtain an expression recognition result of the face image at least according to the ROI feature map.

在获得了ROI特征图后，可以根据ROI特征图进行表情识别。After obtaining the ROI feature map, expression recognition can be performed according to the ROI feature map.

在一种优选方案中，当使用卷积神经网络模型检测连续的视频帧序列中的人脸表情图像时，若以当前帧为基准，则可以首先采用卷积神经网络模型对视频帧序列中的当前帧进行检测，根据当前帧的人脸图像的ROI特征图，获得当前帧的人脸图像的初步表情识别结果；进而，根据当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取当前帧的人脸图像的表情识别结果。例如，在获取当前帧的人脸初步表情识别结果后，还可以判断当前帧在视频帧序列中的位置是否大于或等于设定的位置阈值；若否，则由于当前帧在视频帧序列中的位置小于设定的位置阈值，以当前帧的人脸初步表情识别结果作为最终的当前帧的人脸图像的人脸表情识别结果输出，和/或，保存当前帧的人脸图像的人脸表情识别结果；若是，则获取当前帧之前的设定数量的视频帧的人脸表情识别结果；将当前帧的人脸图像的初步表情识别结果与获取的至少一在先帧的人脸图像的人脸表情识别结果进行线性加权处理，获得当前帧的人脸图像的表情识别结果。其中，至少一在先帧可以是当前帧之前连续的一帧或多帧，也可以是当前帧之前不连续的一帧或多帧。通过上述过程，可以根据连续的多帧的检测结果确定当前帧的表情识别结果，避免了单帧检测的误差，使得检测结果更为精准。In a preferred solution, when using the convolutional neural network model to detect facial expression images in the continuous sequence of video frames, if the current frame is used as a benchmark, the convolutional neural network model can be firstly used to detect the facial expression images in the sequence of video frames. The current frame is detected, and according to the ROI feature map of the face image of the current frame, the preliminary expression recognition result of the face image of the current frame is obtained; and then, according to the preliminary expression recognition result of the current frame and the face image of at least one previous frame The expression recognition result of the face image of the current frame is obtained. For example, after obtaining the preliminary facial expression recognition result of the current frame, it can also be judged whether the position of the current frame in the video frame sequence is greater than or equal to the set position threshold; If the position is less than the set position threshold, the preliminary facial expression recognition result of the current frame is output as the final facial expression recognition result of the facial image of the current frame, and/or, the facial expression of the facial image of the current frame is saved Recognition result; If so, then obtain the facial expression recognition result of the video frame of setting quantity before the current frame; Combine the preliminary expression recognition result of the human face image of the current frame with the person of the human face image of at least one previous frame obtained The facial expression recognition result is linearly weighted to obtain the facial expression recognition result of the face image of the current frame. Wherein, at least one previous frame may be one or more consecutive frames before the current frame, or may be one or more discontinuous frames before the current frame. Through the above process, the expression recognition result of the current frame can be determined according to the detection results of consecutive multiple frames, avoiding the error of single frame detection, and making the detection result more accurate.

其中，在将当前帧的人脸表情识别结果与获取的至少一在先帧的人脸表情识别结果进行线性加权处理时，可以为当前帧的人脸初步表情识别结果和获取的在先帧的人脸表情识别结果分别设置权重，在设置权重时，当前帧的人脸初步表情识别结果的权重大于获取的任一在先帧的人脸表情识别结果的权重；然后，根据设置的权重，对当前视频帧的人脸初步表情识别结果与获取的在先帧人脸表情识别结果进行线性加权。因为主要针对当前视频帧进行表情识别，所以为当前视频帧的检测结果设置较重的权重，在将相关联视频帧的检测结果作为参考的同时，能够有效保证当前视频帧作为检测��。Wherein, when performing linear weighting processing on the facial expression recognition result of the current frame and the facial expression recognition result of at least one previous frame obtained, it can be the preliminary facial expression recognition result of the current frame and the obtained previous frame The facial expression recognition results are respectively set with weights. When setting the weights, the weight of the preliminary facial expression recognition results of the current frame is greater than the weight of the facial expression recognition results of any previous frame obtained; then, according to the set weights, the The preliminary facial expression recognition result of the current video frame is linearly weighted with the obtained facial expression recognition result of the previous frame. Because the expression recognition is mainly performed on the current video frame, a heavier weight is set for the detection result of the current video frame. While the detection result of the associated video frame is used as a reference, it can effectively ensure that the current video frame is used as the detection target.

需要说明的是，在上述过程中，设定的位置阈值、当前帧之前的视频帧的设定数量、以及设置的权重均可以由本领域技术人员根据实际情况适当设置。其中，优选地，当前视频帧之前的视频帧的设定数量为3。It should be noted that, in the above process, the set position threshold, the set number of video frames before the current frame, and the set weights can all be appropriately set by those skilled in the art according to the actual situation. Wherein, preferably, the set number of video frames preceding the current video frame is three.

通过本实施例，采用能够精准识别人脸表情的卷积神经网络模型，能够捕获面部细微变化的表情，使得表情识别更加精确、快速。并且，对于连续的视频帧序列，通过融合连续的多帧的检测结果，有效避免了单帧检测的误差，也进一步提高了表情检测的精准性。Through this embodiment, the use of a convolutional neural network model capable of accurately recognizing human facial expressions can capture subtle changes in facial expressions, making expression recognition more accurate and faster. Moreover, for a continuous video frame sequence, by fusing the detection results of multiple consecutive frames, the error of single frame detection is effectively avoided, and the accuracy of expression detection is further improved.

实施例三Embodiment three

参照图3，示出了根据本发明实施例三的一种表情识别方法的步骤流程图。Referring to FIG. 3 , it shows a flowchart of steps of an expression recognition method according to Embodiment 3 of the present invention.

本实施例以一个具体实例的形式，对本发明实施例的表情识别方法进行说明。本实施例的表情识别方法既包括卷积神经网络模型训练部分，也包括使用训练完成的卷积神经网络模型进行表情识别的部分。This embodiment describes the facial expression recognition method of the embodiment of the present invention in the form of a specific example. The facial expression recognition method of this embodiment includes both a convolutional neural network model training part and a facial expression recognition part using the trained convolutional neural network model.

步骤S302：收集人脸表情图像，并进行表情标注，形成一个待训练的样本图像集合。Step S302: Collect facial expression images and perform expression labeling to form a sample image set to be trained.

例如，通过手动标注了十种表情，分别是：生气、平静、困惑、厌恶、快乐、难过、害怕、惊讶、斜眼、和尖叫。For example, ten expressions were manually marked: angry, calm, confused, disgusted, happy, sad, scared, surprised, squinting, and screaming.

步骤S304：利用人脸检测算法检测每张样本图像中的人脸及其关键点，并利用关键点对齐人脸。Step S304: Use the face detection algorithm to detect the faces and their key points in each sample image, and use the key points to align the faces.

本步骤中，可以利用常规的人脸检测算法检测每张样本图像中的人脸及其关键点，如包括眼睛，嘴巴等的21个人脸关键点；然后，利用21个人脸关键点对齐人脸。In this step, the face and its key points in each sample image can be detected using a conventional face detection algorithm, such as 21 face key points including eyes, mouth, etc.; then, the 21 face key points are used to align the faces .

步骤S306：使用进行了表情标注的样本图像和人脸关键点训练CNN模型。Step S306: training a CNN model using the sample images labeled with expressions and the key points of the human face.

本实施例中，一个CNN模型的简要结构示例如下：In this embodiment, a brief structure example of a CNN model is as follows:

//第一部分//first part

1.数据输入层1. Data input layer

//第二部分//the second part

2.<＝1卷积层1_1(3x3x4/2)2.<=1 convolutional layer 1_1(3x3x4/2)

3.<＝2非线性响应ReLU层3. <= 2 nonlinear response ReLU layer

4.<＝3Pooling层//普通Pooling层4.<=3Pooling layer//Ordinary Pooling layer

5.<＝4卷积层1_2(3x3x6/2)5.<=4 convolutional layer 1_2(3x3x6/2)

6.<＝5非线性响应ReLU层6. <=5 nonlinear response ReLU layer

7.<＝6Pooling层7. <=6Pooling layer

8.<＝7卷积层1_3(3x3x6)8.<=7 convolutional layer 1_3(3x3x6)

9.<＝8非线性响应ReLU层9. <=8 nonlinear response ReLU layer

10.<＝9Pooling层10.<=9Pooling layer

11.<＝10卷积层2_1(3x3x12/2)11.<=10 convolutional layer 2_1(3x3x12/2)

12.<＝11非线性响应ReLU层12. <=11 nonlinear response ReLU layer

13.<＝12Pooling层13. <= 12Pooling layer

14.<＝13卷积层2_2(3x3x12)14.<=13 convolutional layer 2_2 (3x3x12)

15.<＝14非线性响应ReLU层15.<=14 nonlinear response ReLU layer

16.<＝15Pooling层16.<=15Pooling layer

17.<＝16非线性响应ReLU层17.<=16 nonlinear response ReLU layer

18.<＝17卷积层5_4(3x3x16)18.<=17 convolutional layer 5_4 (3x3x16)

//第三部分//the third part

19.<＝18ROI Pooling层//进行ROI池化的Pooling层19.<=18ROI Pooling layer//Pooling layer for ROI pooling

20.<＝19全连接��20.<=19 fully connected layers

21.<＝20损失层21. <= 20 loss layers

在上述CNN模型结构中，进行了表情标注的样本图像和人脸关键点通过第一部分的输入层输入CNN模型进行训练；然后通过第二部分的常规卷积层部分进行处理；基于第二部分的处理结果根据人脸关键点获得ROI特征图，将得到的ROI特征图输入ROI Pooling层进行ROI池化处理，得到池化后的ROI特征图；池化后的ROI特征图再依次输入全连接层和损失层；根据损失层的处理结果确定如何调整CNN模型的网络参数，对CNN模型进行训练。In the above CNN model structure, the sample images and face key points that have been marked with expressions are input into the CNN model through the input layer of the first part for training; then they are processed through the conventional convolution layer part of the second part; based on the second part The processing result obtains the ROI feature map according to the key points of the face, and inputs the obtained ROI feature map into the ROI Pooling layer for ROI pooling processing to obtain the pooled ROI feature map; the pooled ROI feature map is then sequentially input into the fully connected layer and the loss layer; determine how to adjust the network parameters of the CNN model according to the processing results of the loss layer, and train the CNN model.

其中，基于第二部分的处理结果根据人脸关键点获得ROI特征图时，对于人脸21个关键点对应的ROI，可以首先根据21个关键点的坐标将其映射回CNN模型的最后一个卷积层(本实施例中为第32层)输出的特征图上，也就是把根据原始样本图像上检测到的人脸21个关键点映射到第32层输出的特征图上，并以这些关键点为中心在特征图上割取出21个小区域(如3×3的区域，或非规则区域等)，然后以这21个区域的特征图作为ROI Pooling层的输入，得到ROI特征图，再将ROI特征图输入到全连接层，后接十分类的逻辑回归损失函数层(如，SoftmaxWithloss Layer)，将该结果与标注的人脸表情标注做计算得到误差，并反向传播误差，从而更新CNN模型的参数(包括全连接层的参数)。如此循环反复，直到误差不再降低，CNN模型收敛，得到训练完成的模型。Among them, when the ROI feature map is obtained based on the key points of the face based on the processing results of the second part, for the ROI corresponding to the 21 key points of the face, it can be mapped back to the last volume of the CNN model according to the coordinates of the 21 key points On the feature map output by the multiplication layer (the 32nd layer in this embodiment), that is, the 21 key points of the human face detected on the original sample image are mapped to the feature map output by the 32nd layer, and these key points 21 small areas (such as 3×3 areas, or irregular areas, etc.) are cut out on the feature map centered on the point, and then the feature maps of these 21 areas are used as the input of the ROI Pooling layer to obtain the ROI feature map, and then Input the ROI feature map to the fully connected layer, followed by a 10-class logistic regression loss function layer (for example, SoftmaxWithloss Layer), calculate the result and the marked facial expression label to get the error, and backpropagate the error to update The parameters of the CNN model (including the parameters of the fully connected layer). This cycle is repeated until the error no longer decreases, the CNN model converges, and the trained model is obtained.

因为21个ROI区域涵盖了跟人脸表情相关的所有位置，且没有冗余信息，使得CNN模型可以更专注于学好这些区域，更容��捉到人脸肌肉细微的变化；ROI Pooling层之后得到固定长度的ROI特征表示，从而允许当输入不同大小的ROI区域时也可以用相同的网络结构；将该固定长度的特征表示再依次输入到全连接层和损失层，得到最终的表情分类结果。Because the 21 ROI areas cover all positions related to facial expressions, and there is no redundant information, the CNN model can focus more on learning these areas, and it is easier to capture subtle changes in facial muscles; the ROI Pooling layer is fixed later The length of the ROI feature representation allows the same network structure to be used when inputting ROI regions of different sizes; the fixed-length feature representation is then input into the fully connected layer and the loss layer in turn to obtain the final expression classification result.

其中，ROI Pooling层是针对ROI特征图的池化层，比如某个ROI区域坐标为(x1，y1，x2，y2)，那么输入尺寸为(y2-y1)×(x2-x1)，如果ROI Pooling层的输出尺寸为pooled_height×pooled_width，那么每个网格的输出是[(y2-y1)/pooled_height]×[(x2-x1)/pooled_width]的区域池化结果。Among them, the ROI Pooling layer is a pooling layer for the ROI feature map. For example, if the coordinates of a certain ROI area are (x1, y1, x2, y2), then the input size is (y2-y1)×(x2-x1), if the ROI The output size of the Pooling layer is pooled_height×pooled_width, then the output of each grid is the area pooling result of [(y2-y1)/pooled_height]×[(x2-x1)/pooled_width].

此外，需要说明的是：In addition, it should be noted that:

上述卷积网络结构的说明中，2.<＝1表明当前层为第二层，输入为第一层；卷积层后面括号为卷积层参数(3x3x16)表明卷积核大小为3x3,通道数为16。其它依此类推，不再赘述。In the above description of the convolutional network structure, 2.<=1 indicates that the current layer is the second layer, and the input is the first layer; the brackets behind the convolutional layer are the convolutional layer parameters (3x3x16) indicating that the convolution kernel size is 3x3, and the channel The number is 16. Others can be deduced in the same way, so I won't repeat them here.

在上述卷积网络结构中，每个卷积层之后都有一个非线性响应单元ReLU。优选地，该ReLU可以采用PReLU(ParametricRectified Linear Units，参数化纠正线性单元)，以有效提高CNN模型的检测精度。In the above convolutional network structure, each convolutional layer is followed by a nonlinear response unit ReLU. Preferably, the ReLU can use PReLU (ParametricRectified Linear Units, parametric rectified linear unit), so as to effectively improve the detection accuracy of the CNN model.

另外，将卷积层的卷积核设为3x3，能更好的综合局部信息；设定卷积层的间隔stride，可以让上层特征在不增加计算量的前提下获得更大的视野。In addition, setting the convolution kernel of the convolution layer to 3x3 can better integrate local information; setting the stride interval of the convolution layer can allow the upper layer features to obtain a larger field of view without increasing the amount of calculation.

但本领域技术人员应当明了的是，上述卷积核的大小、通道数、以及卷积层的层数数量均为示例性说明，在实际应用中，本领域技术人员可以根据实际需要进行适应性调整，本发明实施例对此不作限制。此外，本实施例中的卷积网络模型中的所有层的组合及参数都是可选的，可以任意组合。However, those skilled in the art should understand that the above-mentioned size of the convolution kernel, the number of channels, and the number of layers of the convolutional layer are all exemplary illustrations. In practical applications, those skilled in the art can adapt according to actual needs. adjustment, which is not limited in this embodiment of the present invention. In addition, the combination and parameters of all layers in the convolutional network model in this embodiment are optional and can be combined arbitrarily.

步骤S308：通过训练完成的CNN模型对对齐后的人脸表情图像进行表情识别，并得到识别结果。Step S308: Perform expression recognition on the aligned facial expression images through the trained CNN model, and obtain a recognition result.

与CNN模型训练所不同的是，在使用训练完成的CNN模型进行表情识别的时候，CNN模型的全连接层后接的是十分类的逻辑回归层，而不是逻辑回归损失函数层，以直接得到识别结果。The difference from CNN model training is that when using the trained CNN model for expression recognition, the fully connected layer of the CNN model is followed by a very similar logistic regression layer instead of a logistic regression loss function layer to directly obtain recognition result.

对于单张图像，可以直接通过上述训练完成的CNN模型进行表情识别。For a single image, expression recognition can be performed directly through the CNN model trained above.

对于视频帧序列，其中的每一帧的识别与单张图像一样。但为了提升视频帧序列表情识别的准确率，可以对多帧进行融合。例如，设t＝1为视频起始帧，当t>＝3即当前帧的位置大于等于第三帧时，将当前帧与其之前的两帧同时进行识别，分别得到这三帧的识别结果。若将输入的三帧分别记为Xt-2，Xt-1与Xt，将三帧的识别结果记为Yt-2，Yt-1与Yt，将这三帧的识别结果做一个线性加权，设当前帧权重0.5，该帧之前的两帧权重都是0.25，那么最终的预测结果是Y＝0.25×Yt-2+0.25×Yt-1+0.5×Yt。当t<3，即当前帧位置小于第三针帧时，Y＝Yt。本领域技术人员应当明了的是，上述各帧的权重仅为示例性说明，在实际应用中，本领域技术人员可以根据实际需要适当为各帧设置权重，使当前帧对应的权重大于其它帧即可。For sequences of video frames, each frame is identified as a single image. However, in order to improve the accuracy of facial expression recognition in video frame sequences, multiple frames can be fused. For example, let t=1 be the start frame of the video, when t>=3, that is, when the position of the current frame is greater than or equal to the third frame, the current frame and the previous two frames are recognized simultaneously, and the recognition results of the three frames are respectively obtained. If the three input frames are recorded as Xt-2, Xt-1 and Xt respectively, and the recognition results of the three frames are recorded as Yt-2, Yt-1 and Yt, and the recognition results of these three frames are linearly weighted, set The weight of the current frame is 0.5, and the weights of the two frames before this frame are both 0.25, so the final prediction result is Y=0.25×Yt-2+0.25×Yt-1+0.5×Yt. When t<3, that is, when the current frame position is smaller than the third frame, Y=Yt. It should be clear to those skilled in the art that the weights of the above-mentioned frames are only exemplary illustrations. In practical applications, those skilled in the art can properly set the weights for each frame according to actual needs, so that the weight corresponding to the current frame is greater than that of other frames, that is, Can.

通过本实施例，通过选择将人脸的21个关键点对应的区域作为ROI，充分利用脸部多个区域变化的细节信息，能捕获面部细微变化的表情，使得识别更加精确，快速；通过对多帧进行融合，使得本实施例的方案可以有效适用于基于视频的表情识别。Through this embodiment, by selecting the region corresponding to the 21 key points of the human face as the ROI, making full use of the detailed information of the changes in the multiple regions of the face, it is possible to capture the subtle changes in the expression of the face, making the recognition more accurate and fast; Multiple frames are fused, so that the solution of this embodiment can be effectively applied to facial expression recognition based on video.

本实施例的表情识别方法可以由任意适当的具有数据处理能力的设备执行，包括但不限于：移动终端、PC机、服务器、车载设备、广告机、颜值机等。The facial expression recognition method in this embodiment can be executed by any suitable device with data processing capabilities, including but not limited to: mobile terminals, PCs, servers, vehicle-mounted devices, advertising machines, appearance machines, and the like.

实施例四Embodiment four

参照图4，示出了根据本发明实施例四的一种卷积神经网络模型训练方法的步骤流程图。Referring to FIG. 4 , it shows a flowchart of steps of a method for training a convolutional neural network model according to Embodiment 4 of the present invention.

本实施例的卷积神经网络模型训练方法包括以下步骤：The convolutional neural network model training method of the present embodiment includes the following steps:

步骤S402：获取训练用的样本图像及对应的人脸关键点的信息。Step S402: Acquiring training sample images and information of corresponding facial key points.

其中，样本图像中包含有人脸表情的标注信息，在进行CNN训练时，可以预先为样本图像进行标注，本实施例中，在样本图像中标注出相应的人脸表情信息，以便后续根据该标注信息确定CNN训练结果是否准确。Wherein, the sample image contains the annotation information of human facial expression. When performing CNN training, the sample image can be annotated in advance. In this embodiment, the corresponding facial expression information is marked in the sample image, so that the subsequent The information determines whether the CNN training results are accurate.

此外，对于每张样本图像，还需要获得相应的人脸关键点的信息。因此，在实际应用中，在本步骤之前，还可以包括：对训练用的样本图像进行检测，获得人脸关键点的信息。为使训练效果更好，在获得了人脸关键点的信息之后，还可以根据这些关键点进行人脸对齐，将人脸对齐后的样本图像输入CNN进行样本训练。通过人脸对齐，能够提高样本图像的训练效果。In addition, for each sample image, it is also necessary to obtain the information of the corresponding key points of the face. Therefore, in practical applications, before this step, it may also include: detecting the sample images used for training to obtain information about key points of the human face. In order to make the training effect better, after obtaining the information of key points of the face, face alignment can also be performed according to these key points, and the sample images after face alignment are input into CNN for sample training. Through face alignment, the training effect of sample images can be improved.

在上述过程中，对样本检测获得人脸关键点和对齐人脸均可以由本领域技术人员采用任意适当的相关方式实现，其中，对样本检测获得人脸关键的实现可以包括但不限于：通过具有人脸关键点定位功能的CNN，或者，ASM(ActiveShapeModel)方法，或者，G-EBGM(基于Gabor特征的弹性图匹配)等等；人脸对齐的实现可以包括但不限于：AAM(ActiveAppearance Model)、CLM(ConstrainedLocalModels，有约束的局部模型)等。In the above process, the detection of the samples to obtain the key points of the face and the alignment of the faces can be implemented by those skilled in the art in any appropriate related manner, wherein the realization of the detection of the samples to obtain the key points of the face may include but not limited to: CNN of face key point positioning function, or ASM (ActiveShapeModel) method, or G-EBGM (Gabor feature-based elastic map matching), etc.; the realization of face alignment can include but not limited to: AAM (ActiveAppearance Model) , CLM (ConstrainedLocalModels, constrained local model), etc.

另外，人脸关键点可以采用常规的关键点，如人脸68个关键点，但不限于此，本发明实施例中，可采用人脸的21个关键点，这21个关键点分别包括：每侧眉毛部位各3个关键点(眉头、眉尾、和眉峰)、每侧眼睛部位各3个关键点(内眼角、外眼角、瞳孔中心)、鼻子部位4个关键点(两侧的鼻翼最外侧点、鼻尖、鼻头��下侧点)、嘴巴部位5个关键点(两个唇角、上唇凹陷点、下唇凹陷点、下唇与上唇接触线中间位置点)。一方面，这21个关键点代表了人脸的关键部位，可以有效表征人脸特征；另一方面，通过这21个关键点即可完成本发明实施例的卷积神经网络模型的训练，减少了数据量和训练成本。In addition, the key points of the human face can be conventional key points, such as 68 key points of the human face, but not limited thereto. In the embodiment of the present invention, 21 key points of the human face can be used, and these 21 key points include respectively: 3 key points on each side of the eyebrow (brow head, eyebrow tail, and eyebrow peak), 3 key points on each side of the eye (inner corner, outer corner, pupil center), 4 key points on the nose (both sides of the nose) The outermost point, the tip of the nose, the lowermost point of the tip of the nose), 5 key points of the mouth (two lip corners, the upper lip depression point, the lower lip depression point, the middle point of the contact line between the lower lip and upper lip). On the one hand, these 21 key points represent the key parts of the human face, which can effectively represent the features of the human face; on the other hand, the training of the convolutional neural network model of the embodiment of the present invention can be completed through these 21 key points, reducing The amount of data and the cost of training.

步骤S404：通过CNN模型的卷积层部分对样本图像进行人脸表情特征提取，获得人脸表情特征图。Step S404: Extracting facial expression features from the sample image through the convolutional layer of the CNN model to obtain a facial expression feature map.

本实施例中，CNN模型的卷积部分可以采用常规CNN模型的卷积结构，对样本图像的处理可以参照相关的CNN模型的卷积层部分的处理进行，在此不再赘述。经卷积层部分的处理后，获得相应的人脸表情特征图(卷积层部分某次的处理结果可以理解为CNN模型在某次训练过程中的输出结果)。In this embodiment, the convolution part of the CNN model can adopt the convolution structure of the conventional CNN model, and the processing of the sample image can refer to the processing of the convolution layer part of the relevant CNN model, which will not be repeated here. After being processed by the convolutional layer part, the corresponding facial expression feature map is obtained (the processing result of a certain convolutional layer part can be understood as the output result of the CNN model during a certain training process).

步骤S406：确定人脸表情特征图中，与每一个人脸关键点分别对应的ROI。Step S406: Determine the ROI corresponding to each facial key point in the facial expression feature map.

卷积层输出的人脸表情特征图中包含了对图像整体的处理结果，若直接使用该结果进行后续的表情训练，则一方面需要处理的数据量较大，另一方面，也无法针对人脸表情进行有针对性的训练，造成训练结果不精准。The facial expression feature map output by the convolutional layer contains the processing results of the entire image. If this result is directly used for subsequent expression training, on the one hand, the amount of data to be processed is large, and on the other hand, it cannot be targeted at human Targeted training for facial expressions, resulting in inaccurate training results.

为此，本发明实施例的方案中，在根据人脸关键点的信息，确定与每一个人脸关键点对应的ROI时，在人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以各个确定的位置为参考点(如为中心点，在实际应用中，允许该中心点存在小范围内的偏差)，获取设定范围的区域，将获取的各个区域确定为对应的ROI。为人脸关键点为21个关键点为例，在确定ROI时，可以根据人脸关键点，即人脸21个关键点的坐标，首先将其映射回CNN的最后一个卷积层输出的人脸表情特征图上，然后，以人脸表情特征图上每个关键点为中心，割取一定范围的区域(一般割取范围为3×3～7×7，优选为3×3)，以这21个区域的特征图作为ROI Pooling Layer(ROI池化层)的输入。这21个区域涵盖了跟人脸表情相关的所有位置，且没有冗余信息，使得网络可以更专注于学好这些区域，更容易捕捉到人脸肌肉细微的变化。For this reason, in the solution of the embodiment of the present invention, when determining the ROI corresponding to each face key point according to the information of the face key points, in the facial expression feature map, determine according to the coordinates of each face key point Corresponding to each position; with each determined position as a reference point (such as the center point, in practical applications, the center point is allowed to have a small deviation in the range), obtain the area of the set range, and determine each area obtained is the corresponding ROI. Take the 21 key points of the face as an example. When determining the ROI, you can first map it back to the face output by the last convolutional layer of CNN according to the key points of the face, that is, the coordinates of the 21 key points of the face. On the expression feature map, then, with each key point on the face expression feature map as the center, cut out a certain range of areas (generally the cut range is 3 × 3 ~ 7 × 7, preferably 3 × 3), with this The feature maps of 21 regions are used as input to the ROI Pooling Layer (ROI Pooling Layer). These 21 areas cover all positions related to facial expressions, and there is no redundant information, so that the network can focus more on learning these areas, and it is easier to capture subtle changes in facial muscles.

步骤S408：通过CNN的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图。Step S408: performing pooling processing on each of the determined ROIs through the pooling layer part of the CNN to obtain a pooled ROI feature map.

在CNN模型中，池化层往往在卷积层后面，通过池化来降低卷积层输出的特征向量，同时改善结果，以使结果不易出现过拟合。对于不同的图像，可以根据图像的大小动态地计算池化窗口的大小和步长，以得到相同大小的池化结果。In the CNN model, the pooling layer is often behind the convolutional layer, and the feature vector output by the convolutional layer is reduced by pooling, and the result is improved at the same time, so that the result is not prone to overfitting. For different images, the size and stride of the pooling window can be dynamically calculated according to the size of the image to obtain pooling results of the same size.

在本发明实施例中，将ROI输入池化层，经过池化层的ROI池化处理后，可获得固定长度的ROI的特征表示，即尺寸统一的ROI特征图。In the embodiment of the present invention, the ROI is input into the pooling layer, and after the ROI pooling processing of the pooling layer, a feature representation of the ROI with a fixed length, that is, a ROI feature map with a uniform size can be obtained.

步骤S410：��少根据池化��的ROI特征图，��整CNN模型的��络参数。Step S410: Adjust network parameters of the CNN model at least according to the pooled ROI feature map.

在一种可行方式中，可以将ROI池化后的设定尺寸的ROI特征图输入全连接层进行相应的处理；然后，再将处理后的设定尺寸的ROI输入损失层，获取对样本图像进行表情分类的表情分类结果误差，例如，该误差可以是损失层的输出结果；根据表情分类结果误差，调整CNN模型的网络参数。In a feasible way, the ROI feature map of the set size after ROI pooling can be input into the fully connected layer for corresponding processing; then, the processed ROI of the set size can be input into the loss layer to obtain a pair of sample images The expression classification result error of expression classification, for example, the error can be the output result of the loss layer; according to the expression classification result error, adjust the network parameters of the CNN model.

在获得ROI池化层的处理结果后，可以将该处理结果输入全连接层，通过全连接层将不同大小的图像转化成相同维度的特征；然后，将全连接层输出的特征输入损失层，以获得损失结果；根据该损失结果决定是否需要调整CNN模型的网络参数以继续进行训练。具体到本实施例，在获得了ROI池化层的ROI特征图后，将该ROI特征图输入全连接层，获得设定维度的ROI特征，其中，设定维度可以由本领域技术人员根据实际需求适当设置，本发明实施例对此不作限制；该设定维度的ROI特征被输入损失层，通过损失函数计算损失结果；进而根据损失结果判断CNN模型的训练输出是否满足收敛条件；若满足收敛条件，则结束CNN模型的训练；若不满足收敛条件，则根据损失结果对CNN模型训练的参数(包括但不限于权重参数weight、偏置参数bias等)进行调整；使用调整后的参数继续进行CNN模型的训练，直至训练结果满足收敛条件。其中，收敛条件可以由本领域技术人员根据实际需要适当设定，本发明实施例对此不作限制。After obtaining the processing result of the ROI pooling layer, the processing result can be input into the fully connected layer, and images of different sizes are converted into features of the same dimension through the fully connected layer; then, the features output by the fully connected layer are input into the loss layer, Obtain a loss result; decide whether to adjust the network parameters of the CNN model to continue training according to the loss result. Specifically in this embodiment, after obtaining the ROI feature map of the ROI pooling layer, the ROI feature map is input into the fully connected layer to obtain the ROI feature of the set dimension, wherein the set dimension can be determined by those skilled in the art according to actual needs Appropriately set, the embodiment of the present invention does not limit this; the ROI feature of the set dimension is input into the loss layer, and the loss result is calculated through the loss function; then judge whether the training output of the CNN model meets the convergence condition according to the loss result; if the convergence condition is satisfied , then end the training of the CNN model; if the convergence condition is not met, adjust the parameters of the CNN model training (including but not limited to weight parameter weight, bias parameter bias, etc.) according to the loss result; use the adjusted parameters to continue CNN The model is trained until the training result meets the convergence condition. Wherein, the convergence condition may be appropriately set by those skilled in the art according to actual needs, which is not limited in this embodiment of the present invention.

本实施例中，损失层的损失函数采用逻辑回归损失函数，在此情况下，全连接层输出的设定尺寸的ROI将被输入损失层，通过损失层的逻辑回归损失函数进行表情分类结果误差计算，进而输入误差计算结果。可选地，逻辑回归损失函数为具有设定表情分类数量的逻辑回归损失函数。例如，样本图像中标注了十种类型的表情，CNN模型的训练目的是能够对这十种类型的表情进行识别和分类，则逻辑回归损失函数就可以为十分类的逻辑回归损失函数，其中，“十分类”表示通过该逻辑回归损失函数可以检测和识别标注的十种类型的表情。In this embodiment, the loss function of the loss layer adopts the logistic regression loss function. In this case, the ROI of the set size output by the fully connected layer will be input into the loss layer, and the expression classification result error is performed through the logistic regression loss function of the loss layer. Calculate, and then enter the error calculation result. Optionally, the logistic regression loss function is a logistic regression loss function with a set number of expression classifications. For example, ten types of expressions are marked in the sample image, and the training purpose of the CNN model is to be able to recognize and classify these ten types of expressions, then the logistic regression loss function can be a ten-category logistic regression loss function, where, "Ten categories" means that the logistic regression loss function can detect and recognize the ten types of marked expressions.

此外，在实际的CNN模型训练中，用于训练的样本图像还可以采用视频帧序列的人脸表情样本图像，视频帧序列中视频帧之间具有一定的联系，采用视频帧序列作为训练样本，更有助于训练后的CNN模型在实际检测中对连续的视频帧中人脸表情的识别。In addition, in the actual CNN model training, the sample image used for training can also use the facial expression sample image of the video frame sequence. There is a certain connection between the video frames in the video frame sequence, and the video frame sequence is used as the training sample. It is more helpful for the trained CNN model to recognize facial expressions in continuous video frames in actual detection.

通过上述过程，实现了对本发明实施例中的CNN模型的表情识别的训练，与常规的训练方式不同，在通过CNN模型中的卷积层部分对人脸表情样本图像进行处理后，根据人脸关键点在卷积层处理结果上确定ROI，将该ROI输入ROI池化层进行处理，最终根据ROI池化层的处理结果确定对卷积神经网络模型的训练。通过选择将对应于人脸关键点的区域作为ROI，可以更有针对性地进行训练，能够充分利用脸部多个区域变化的细节信息，对细微的表情变化以及不同姿态的人脸有更准确的识别。这种采用人脸关键点所对应的面部区域作为ROI的CNN模型训练方法，能有效地捕捉细微的表情变化，同时能够更好地处理不同面部姿态带来的差异性，从而提高CNN模型的预测精度和鲁棒性。并且，相比较于传统机器学习框架进行表情识别的方式，采用本发明实施例结构的CNN因自身的结构特征，不仅可以使用大数据量样本进行训练，且训练效率高，训练成本也相对较低。Through the above process, the training of the facial expression recognition of the CNN model in the embodiment of the present invention is realized. Different from the conventional training method, after the facial expression sample image is processed by the convolution layer part in the CNN model, according to the human face The key point is to determine the ROI on the processing result of the convolutional layer, input the ROI into the ROI pooling layer for processing, and finally determine the training of the convolutional neural network model according to the processing result of the ROI pooling layer. By selecting the area corresponding to the key points of the face as the ROI, more targeted training can be carried out, and the detailed information of the changes in multiple areas of the face can be fully utilized, and it is more accurate for subtle expression changes and faces of different poses. identification. This CNN model training method using the facial area corresponding to the key points of the face as the ROI can effectively capture subtle expression changes, and at the same time can better handle the differences caused by different facial poses, thereby improving the prediction of the CNN model. precision and robustness. Moreover, compared with the traditional machine learning framework for facial expression recognition, the CNN using the structure of the embodiment of the present invention can not only use a large amount of data samples for training due to its own structural characteristics, but also has high training efficiency and relatively low training cost .

本实施例的卷积神经网络模型训练方法可以由任意适当的具有数据处理能力的设备执行，包括但不限于：移动终端、PC机等。The method for training a convolutional neural network model in this embodiment may be executed by any appropriate device with data processing capabilities, including but not limited to: mobile terminals, PCs, and the like.

实施例五Embodiment five

参照图5，示出了根据本发明实施例五的一种表情识别装置的结构框图；具体包括如下模块：Referring to FIG. 5 , it shows a structural block diagram of an expression recognition device according to Embodiment 5 of the present invention; specifically, it includes the following modules:

第一确定模块502，用于通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。The first determination module 502 is used to perform facial expression feature extraction on the human face image to be detected through the convolutional layer part of the convolutional neural network model and the acquired facial key points in the human face image to be detected, and obtain the human face Facial expression feature map.

第二确定模块504，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI。The second determination module 504 is configured to determine ROIs in the facial expression feature map corresponding to each key point of the human face.

第三确定模块506，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图。The third determination module 506 is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map.

第四确定模块508，用于至少根据所述ROI特征图获取所述人脸图像的表情识别结果。The fourth determining module 508 is configured to acquire an expression recognition result of the face image at least according to the ROI feature map.

可选地，所述第三确定模块506，包括：第一获取子模块5062，用于根据所述当前帧的人脸图像的所述ROI特征图，获取所述当前帧的人脸图像的初步表情识别结果；第二获取子模块5064，用于根据所述当前帧的初步表情识别结果和至少一在先帧的人脸图像的表情识别结果，获取所述当前帧的人脸图像的表情识别结果。Optionally, the third determining module 506 includes: a first acquiring submodule 5062, configured to acquire a preliminary image of the face image of the current frame according to the ROI feature map of the face image of the current frame Expression recognition result; the second acquisition submodule 5064 is used to obtain the expression recognition of the face image of the current frame according to the preliminary expression recognition result of the current frame and the expression recognition result of the face image of at least one previous frame result.

可选地，所述第二获取子模块5064，用于将所述当前帧的人脸图像的初步人脸表情识别结果与至少一在先帧的人脸图像的人脸表情识别结果进行加权处理，获得所述当前帧的人脸图像的表情识别结果，其中，所述当前帧的人脸图像的初步表情识别结果的权重大于任一在先帧的人脸图像的表情识别结果的权重。Optionally, the second acquisition sub-module 5064 is configured to perform weighting processing on the preliminary facial expression recognition result of the face image of the current frame and the facial expression recognition result of at least one previous frame of the face image , obtaining the expression recognition result of the face image of the current frame, wherein the weight of the preliminary expression recognition result of the face image of the current frame is greater than the weight of the expression recognition result of any previous frame of the face image.

可选地，所述装置还包括：第五确定模块510，用于确定所述当前帧在视频帧序列中的位置大于或等于设定位置阈值。Optionally, the apparatus further includes: a fifth determining module 510, configured to determine that the position of the current frame in the video frame sequence is greater than or equal to a set position threshold.

可选地，所述装置还包括：响应模块512，用于响应于所述当前帧在所述视频帧序列中的位置小于设定的位置阈值，输出所述当前帧的人脸图像的人脸表情识别结果，和/或，保存所述当前帧的人脸图像的人脸表情识别结果。Optionally, the device further includes: a response module 512, configured to output the face of the face image of the current frame in response to the position of the current frame in the sequence of video frames being smaller than a set position threshold The facial expression recognition result, and/or, saving the facial expression recognition result of the human face image of the current frame.

可选地，所述第一确定模块502，用于对待检测的人脸图像进行人脸关键点检测，获得所述人脸图像中的人脸关键点；根据所述人脸关键点，通过所述卷积神经网络模型的卷积层部分对所述人脸图像进行人脸表情特征提取，获得人脸表情特征图。Optionally, the first determination module 502 is configured to perform face key point detection on the face image to be detected, and obtain face key points in the face image; according to the face key points, through the The convolutional layer part of the convolutional neural network model performs facial expression feature extraction on the face image to obtain a facial expression feature map.

可选地，所述第一确定模块502，用于通过所述卷积神经网络模型的卷积层部分对待检测的人脸图像进行人脸关键点提取；根据提取的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图。Optionally, the first determination module 502 is configured to extract face key points from the face image to be detected through the convolution layer part of the convolutional neural network model; according to the extracted face key points, the face key points to be detected Extract the facial expression feature from the face image, and obtain the facial expression feature map.

可选地，所述装置还包括：训练模块514，用于获取训练用的样本图像，使用所述样本图像训练所述卷积神经网络模型，其中，所述样本图像中包含有人脸关键点的信息和人脸表情的标注信息。Optionally, the device further includes: a training module 514, configured to acquire a sample image for training, and use the sample image to train the convolutional neural network model, wherein the sample image contains key points of a human face Labeling information of information and facial expressions.

可选地，所述训练模块514，包括：第一子模块5142，用于获取训练用的样本图像，通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；第二子模块5144，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；第三子模块5146，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；第四子模块5148，用于至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。Optionally, the training module 514 includes: a first sub-module 5142, configured to obtain a sample image for training, and perform facial expression feature extraction on the sample image through the convolutional layer part of the convolutional neural network model, Obtain the facial expression feature map; the second submodule 5144 is used to determine the ROI corresponding to each key point of the face in the facial expression feature map; the third submodule 5146 is used to pass the convolution neural The pooling layer part of the network model performs pooling processing on each determined ROI to obtain a pooled ROI feature map; the fourth sub-module 5148 is used to adjust the convolutional neural network model at least according to the ROI feature map network parameters.

可选地，所述第二子模块5142，用于在所述人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Optionally, the second submodule 5142 is configured to determine corresponding positions in the facial expression feature map according to the coordinates of each key point of the face; using the determined positions as reference points, obtain the corresponding Each region of the set range is determined as the corresponding ROI.

可选地，所述第三子模块5146，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述第四子模块5148，用于将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, the third sub-module 5146 is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model to obtain a pooled ROI feature map of a set size; the The fourth sub-module 5148 is used to input the ROI of the set size into the loss layer, and obtain the expression classification result error of the expression classification of the sample image; adjust the convolutional neural network according to the expression classification result error The network parameters of the model.

可选地，所述第四子模块5148，用于将所述设定尺寸的ROI输入损失层，通过所述损失层的逻辑回归损失函数计算所述表情分类结果误差并输出。Optionally, the fourth sub-module 5148 is configured to input the ROI of the set size into the loss layer, calculate and output the expression classification result error through the logistic regression loss function of the loss layer.

可选地，第六确定模块516，用于对所述训练用的样本图像进行检测，获得人脸关键点的信息。Optionally, the sixth determination module 516 is configured to detect the sample images used for training to obtain information about key points of human faces.

通过本实施例的表情识别装置可执行实施例一至三中的任意一种表情识别方法，并取得该方法的有益效果，在此不作赘述。The expression recognition device of this embodiment can execute any one of the expression recognition methods in Embodiments 1 to 3, and obtain beneficial effects of the method, which will not be repeated here.

实施例六Embodiment six

参照图6，示出了根据本发明实施例六的一种卷积神经网络模型训练装置的结构框图；具体包括如下模块：Referring to FIG. 6 , it shows a structural block diagram of a convolutional neural network model training device according to Embodiment 6 of the present invention; specifically, it includes the following modules:

第一获取模块602，用于获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息。The first acquiring module 602 is configured to acquire sample images for training and information of corresponding facial key points, wherein the sample images include annotation information of human facial expressions.

第二获取模块604，用于通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图。The second acquisition module 604 is configured to extract facial expression features from the sample image through the convolutional layer part of the convolutional neural network model to obtain a facial expression feature map.

第三获取模块606，用于确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI。The third acquiring module 606 is configured to determine ROIs in the facial expression feature map corresponding to each key point of the human face.

第四获取模块608，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图。The fourth obtaining module 608 is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map.

第五获取模块610，用于至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。The fifth obtaining module 610 is configured to adjust network parameters of the convolutional neural network model at least according to the ROI feature map.

可选地，所述第三获取模块606，用于在所述人脸表情特征图中，根据各个人脸关键点的坐标确定相对应的各个位置；以确定的各个位置为参考点，获取对应的各个设定范围的区域，将获取的各个区域确定为对应的ROI。Optionally, the third acquisition module 606 is configured to determine corresponding positions in the facial expression feature map according to the coordinates of each key point of the face; using the determined positions as reference points, acquire the corresponding Each region of the set range is determined as the corresponding ROI.

可选地，所述第四获取模块608，用于通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的设定尺寸的ROI特征图；所述第五获取模块610，包括：第一获取子模块6102，用于将所述设定尺寸的ROI输入损失层，获取对所述样本图像进行表情分类的表情分类结果误差；调整子模块6104，用于根据所述表情分类结果误差，调整所述卷积神经网络模型的网络参数。Optionally, the fourth acquisition module 608 is configured to perform pooling processing on each determined ROI through the pooling layer part of the convolutional neural network model, and obtain a pooled ROI feature map of a set size; the The fifth acquisition module 610 includes: a first acquisition submodule 6102, configured to input the ROI of the set size into the loss layer, and acquire an expression classification result error for the expression classification of the sample image; an adjustment submodule 6104, using Adjusting the network parameters of the convolutional neural network model according to the error of the expression classification result.

可选地，所述第一获取子模块6104，用于将所述设定尺寸的ROI输入损失层，通过所述损失层的逻辑回归损失函数计算所述表情分类结果误差并输出。Optionally, the first obtaining sub-module 6104 is configured to input the ROI of the set size into a loss layer, and calculate and output the expression classification result error through a logistic regression loss function of the loss layer.

可选地，所述装置还包括：第六获取模块612，用于对所述训练用的样本图像进行检测，获得人脸关键点的信息。Optionally, the device further includes: a sixth acquiring module 612, configured to detect the sample images used for training to obtain information about key points of human faces.

实施例七Embodiment seven

本发明实施例五提供了一种电子设备，例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等。下面参考图7，其示出了适于用来实现本发明实施例的终端设备或服务器的电子设备700的结构示意图：如图7所示，电子设备700包括一个或多个处理器、通信元件等，所述一个或多个处理器例如：一个或多个中央处理单元(CPU)701，和/或一个或多个图像处理器(GPU)713等，处理器可以根据存储在只读存储器(ROM)702中的可执行指令或者从存储部分708加载到随机访问存储器(RAM)703中的可执行指令而执行各种适当的动作和处理。通信元件包括通信组件712和/或通信接口709。其中，通信组件712可包括但不限于网卡，所述网卡可包括但不限于IB(Infiniband)网卡，通信接口709包括诸如LAN卡、调制解调器等的网络接口卡的通信接口，通信接口709经由诸如因特网的网络执行通信处理。Embodiment 5 of the present invention provides an electronic device, which may be, for example, a mobile terminal, a personal computer (PC), a tablet computer, a server, and the like. Referring to FIG. 7 below, it shows a schematic structural diagram of an electronic device 700 suitable for implementing a terminal device or a server according to an embodiment of the present invention: as shown in FIG. 7 , the electronic device 700 includes one or more processors, communication elements etc., the one or more processors are for example: one or more central processing units (CPU) 701, and/or one or more image processors (GPU) 713, etc., the processors can be stored in the read-only memory ( ROM) 702 or loaded from the storage section 708 into random access memory (RAM) 703 to execute various appropriate actions and processes. Communication elements include a communication component 712 and/or a communication interface 709 . Wherein, the communication component 712 may include but not limited to a network card, and the network card may include but not limited to an IB (Infiniband) network card, and the communication interface 709 includes a communication interface of a network interface card such as a LAN card, a modem, etc. The network performs communication processing.

处理器可与只读存储器702和/或随机访问存储器703中通信以执行可执行指令，通过通信总线704与通信组件712相连、并经通信组件712与其他目标设备通信，从而完成本发明实施例提供的任一项表情识别方法对应的操作，例如，通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图获取所述人脸图像的表情识别结果。The processor can communicate with the read-only memory 702 and/or the random access memory 703 to execute executable instructions, connect to the communication component 712 through the communication bus 704, and communicate with other target devices through the communication component 712, thereby completing the embodiment of the present invention The operation corresponding to any of the expression recognition methods provided, for example, through the convolutional layer part of the convolutional neural network model and the key points of the face in the face image to be detected, perform face recognition on the face image to be detected The expression feature is extracted to obtain the facial expression feature map; determine the region of interest ROI corresponding to each face key point in the facial expression feature map; each ROI determined by the pooling layer part of the convolutional neural network model Perform pooling processing to obtain a pooled ROI feature map; at least obtain an expression recognition result of the face image based on the ROI feature map.

此外，在RAM 703中，还可存储有装置操作所需的各种程序和数据。CPU701或GPU713、ROM702以及RAM703通过通信总线704彼此相连。在有RAM703的情况下，ROM702为可选模块。RAM703存储可执行指令，或在运行时向ROM702中写入可执行指令，可执行指令使处理器执行上述通信方法对应的操作。输入/输出(I/O)接口705也连接至通信总线704。通信组件712可以集成设置，也可以设置为具有多个子模块(例如多个IB网卡)，并在通信总线链接上。In addition, in the RAM 703, various programs and data necessary for device operation can also be stored. The CPU 701 or GPU 713 , the ROM 702 , and the RAM 703 are connected to each other through a communication bus 704 . In the case of RAM703, ROM702 is an optional module. RAM703 stores executable instructions, or writes executable instructions into ROM702 during operation, and the executable instructions cause the processor to perform operations corresponding to the above communication methods. An input/output (I/O) interface 705 is also connected to the communication bus 704 . The communication component 712 can be integrated, or can be configured to have multiple sub-modules (such as multiple IB network cards), and be linked on a communication bus.

以下部件连接至I/O接口705：包括键盘、鼠标等的输入部分706；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分707；包括硬盘等的存储部分708；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信接口709。驱动器710也根据需要连接至I/O接口705。可拆卸介质711，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器710上，以便于从其上读出的计算机程序根据需要被安装入存储部分708。The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, etc.; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 708 including a hard disk, etc. and a communication interface 709 including a network interface card such as a LAN card, a modem, or the like. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, etc. is mounted on the drive 710 as necessary so that a computer program read therefrom is installed into the storage section 708 as necessary.

需要说明的，如图7所示的架构仅为一种可选实现方式，在具体实践过程中，可根据实际需要对上述图7的部件数量和类型进行选择、删减、增加或替换；在不同功能部件设置上，也可采用分离设置或集成设置等实现方式，例如GPU和CPU可分离设置或者可将GPU集成在CPU上，通信元件可分离设置，也可集成设置在CPU或GPU上，等等。这些可替换的实施方式均落入本发明的保护范围。It should be noted that the architecture shown in Figure 7 is only an optional implementation, and in the actual practice process, the number and type of components in Figure 7 above can be selected, deleted, added or replaced according to actual needs; Different functional components can also be set separately or integrated. For example, the GPU and CPU can be set separately or the GPU can be integrated on the CPU. The communication components can be set separately or integrated on the CPU or GPU. and many more. These alternative implementations all fall within the protection scope of the present invention.

特别地，根据本发明实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本发明实施例包括一种计算机程序产品，其包括有形地包含在机器可读介质上的计算机程序，计算机程序包含用于执行流程图所示的方法的程序代码，程序代码可包括对应执行本发明实施例提供的方法步骤对应的指令，例如，通过卷积神经网络模型的卷积层部分和获取的待检测的人脸图像中的人脸关键点，对待检测的人脸图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图获取所述人脸图像的表情识别结果。在这样的实施例中，该计算机程序可以通过通信元件从网络上被下载和安装，和/或从可拆卸介质711被安装。在该计算机程序被处理器执行时，执行本发明实施例的方法中限定的上述功能。In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, an embodiment of the present invention includes a computer program product, which includes a computer program tangibly contained on a machine-readable medium, the computer program includes program code for executing the method shown in the flow chart, and the program code may include corresponding execution The instructions corresponding to the method steps provided by the embodiments of the present invention, for example, use the convolution layer part of the convolutional neural network model and the key points of the face in the face image to be detected to perform face recognition on the face image to be detected. The expression feature is extracted to obtain the facial expression feature map; determine the region of interest ROI corresponding to each face key point in the facial expression feature map; each ROI determined by the pooling layer part of the convolutional neural network model Perform pooling processing to obtain a pooled ROI feature map; at least obtain an expression recognition result of the face image based on the ROI feature map. In such an embodiment, the computer program may be downloaded and installed from a network via a communication element, and/or installed from a removable medium 711 . When the computer program is executed by the processor, the above-mentioned functions defined in the methods of the embodiments of the present invention are executed.

实施例八Embodiment Eight

本发明实施例八提供了一种电子设备，例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等。下面参考图8，其示出了适于用来实现本发明实施例的终端设备或服务器的电子设备800的结构示意图：如图8所示，电子设备800包括一个或多个处理器、通信元件等，所述一个或多个处理器例如：一个或多个中央处理单元(CPU)801，和/或一个或多个图像处理器(GPU)813等，处理器可以根据存储在只读存储器(ROM)802中的可执行指令或者从存储部分808加载到随机访问存储器(RAM)803中的可执行指令而执行各种适当的动作和处理。通信元件包括通信组件812和/或通信接口809。其中，通信组件812可包括但不限于网卡，所述网卡可包括但不限于IB(Infiniband)网卡，通信接口809包括诸如LAN卡、调制解调器等的网络接口卡的通信接口，通信接口809经由诸如因特网的网络执行通信处理。Embodiment 8 of the present invention provides an electronic device, which may be, for example, a mobile terminal, a personal computer (PC), a tablet computer, a server, and the like. Referring to FIG. 8 below, it shows a schematic structural diagram of an electronic device 800 suitable for implementing a terminal device or a server according to an embodiment of the present invention: as shown in FIG. 8 , the electronic device 800 includes one or more processors, communication elements etc., the one or more processors are for example: one or more central processing units (CPU) 801, and/or one or more image processors (GPU) 813, etc., the processors can be stored in the read-only memory ( ROM) 802 or executable instructions loaded from storage section 808 into Random Access Memory (RAM) 803 to perform various appropriate actions and processes. The communication elements include communication component 812 and/or communication interface 809 . Wherein, the communication component 812 may include but not limited to a network card, and the network card may include but not limited to an IB (Infiniband) network card, and the communication interface 809 includes a communication interface of a network interface card such as a LAN card, a modem, etc. The network performs communication processing.

处理器可与只读存储器802和/或随机访问存储器803中通信以执行可执行指令，通过通信总线804与通信组件812相连、并经通信组件812与其他目标设备通信，从而完成本发明实施例提供的任一项卷积神经网络模型训练方法对应的操作，例如，获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息；通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。The processor can communicate with the read-only memory 802 and/or the random access memory 803 to execute executable instructions, connect to the communication component 812 through the communication bus 804, and communicate with other target devices through the communication component 812, thereby completing the embodiment of the present invention The operation corresponding to any one of the convolutional neural network model training methods provided, for example, obtaining the sample image for training and the information of the corresponding key points of the human face, wherein the sample image contains the annotation information of human facial expression; by The convolutional layer part of the convolutional neural network model carries out facial expression feature extraction to the sample image to obtain a facial expression feature map; determine the regions of interest corresponding to the key points of each face in the facial expression feature map ROI; pooling each determined ROI through the pooling layer part of the convolutional neural network model to obtain a pooled ROI feature map; adjusting the network of the convolutional neural network model at least according to the ROI feature map parameter.

此外，在RAM 803中，还可存储有装置操作所需的各种程序和数据。CPU801或GPU813、ROM802以及RAM803通过通信总线804彼此相连。在有RAM803的情况下，ROM802为可选模块。RAM803存储可执行指令，或在运行时向ROM802中写入可执行指令，可执行指令使处理器执行上述通信方法对应的操作。输入/输出(I/O)接口805也连接至通信总线804。通信组件812可以集成设置，也可以设置为具有多个子模块(例如多个IB网卡)，并在通信总线链接上。In addition, in the RAM 803, various programs and data necessary for the operation of the device can also be stored. The CPU 801 or GPU 813 , the ROM 802 , and the RAM 803 are connected to each other through a communication bus 804 . In the case of RAM803, ROM802 is an optional module. RAM803 stores executable instructions, or writes executable instructions into ROM802 during operation, and the executable instructions cause the processor to perform operations corresponding to the above communication methods. An input/output (I/O) interface 805 is also connected to the communication bus 804 . The communication component 812 can be integrated, or can be configured to have multiple sub-modules (such as multiple IB network cards), and be linked on a communication bus.

以下部件连接至I/O接口805：包括键盘、鼠标等的输入部分806；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807；包括硬盘等的存储部分808；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信接口809。驱动器810也根据需要连接至I/O接口805。可拆卸介质811，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器810上，以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the I/O interface 805: an input section 806 including a keyboard, a mouse, etc.; an output section 807 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 808 including a hard disk, etc. and a communication interface 809 including a network interface card such as a LAN card, a modem, or the like. A drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 810 as necessary so that a computer program read therefrom is installed into the storage section 808 as necessary.

需要说明的，如图8所示的架构仅为一种可选实现方式，在具体实践过程中，可根据实际需要对上述图8的部件数量和类型进行选择、删减、增加或替换；在不同功能部件设置上，也可采用分离设置或集成设置等实现方式，例如GPU和CPU可分离设置或者可将GPU集成在CPU上，通信元件可分离设置，也可集成设置在CPU或GPU上，等等。这些可替换的实施方式均落入本发明的保护范围。It should be noted that the architecture shown in Figure 8 is only an optional implementation, and in the actual practice process, the number and types of the components in Figure 8 above can be selected, deleted, added or replaced according to actual needs; Different functional components can also be set separately or integrated. For example, the GPU and CPU can be set separately or the GPU can be integrated on the CPU. The communication components can be set separately or integrated on the CPU or GPU. and many more. These alternative implementations all fall within the protection scope of the present invention.

特别地，根据本发明实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本发明实施例包括一种计算机程序产品，其包括有形地包含在机器可读介质上的计算机程序，计算机程序包含用于执行流程图所示的方法的程序代码，程序代码可包括对应执行本发明实施例提供的方法步骤对应的指令，例如，获取训练用的样本图像及对应的人脸关键点的信息，其中，所述样本图像中包含有人脸表情的标注信息；通过卷积神经网络模型的卷积层部分对所述样本图像进行人脸表情特征提取，获得人脸表情特征图；确定所述人脸表情特征图中与各个人脸关键点分别对应的感兴趣区域ROI；通过卷积神经网络模型的池化层部分对确定的各ROI进行池化处理，获得池化后的ROI特征图；至少根据所述ROI特征图，调整所述卷积神经网络模型的网络参数。在这样的实施例中，该计算机程序可以通过通信元件从网络上被下载和安装，和/或从可拆卸介质811被安装。在该计算机程序被处理器执行时，执行本发明实施例的方法中限定的上述功能。In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, an embodiment of the present invention includes a computer program product, which includes a computer program tangibly contained on a machine-readable medium, the computer program includes program code for executing the method shown in the flow chart, and the program code may include corresponding execution The instructions corresponding to the method steps provided by the embodiments of the present invention, for example, obtain training sample images and corresponding face key point information, wherein the sample images include labeling information of human facial expressions; through the convolutional neural network The convolutional layer part of the model carries out facial expression feature extraction to described sample image, obtains facial expression feature map; Determines the region of interest ROI corresponding to each face key point respectively in described facial expression feature map; The pooling layer part of the convolutional neural network model performs pooling processing on each determined ROI to obtain a pooled ROI feature map; at least according to the ROI feature map, adjust the network parameters of the convolutional neural network model. In such an embodiment, the computer program may be downloaded and installed from a network via a communication element, and/or installed from a removable medium 811 . When the computer program is executed by the processor, the above-mentioned functions defined in the methods of the embodiments of the present invention are executed.

需要指出，根据实施的需要，可将本发明实施例中描述的各个部件/步骤拆分为更多部件/步骤，也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤，以实现本发明实施例的目的。It should be pointed out that, according to implementation requirements, each component/step described in the embodiment of the present invention can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiments of the present invention.

上述根据本发明实施例的方法可在硬件、固件中实现，或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码，或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码，从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解，计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如，RAM、ROM、闪存等)，当所述软件或计算机代码被计算机、处理器或硬件访问且执行时，实现在此描述的处理方法。此外，当通用计算机访问用于实现在此示出的处理的代码时，代码的执行将通用计算机转换为用于执行在此示出的处理的专用计算机。The above method according to the embodiment of the present invention can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the processing methods described herein are implemented. Furthermore, when a general-purpose computer accesses the code for implementing the processing shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the processing shown here.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及方法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明实施例的范围。Those skilled in the art can appreciate that the units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present invention.

以上实施方式仅用于说明本发明实施例，而并非对本发明实施例的限制，有关技术领域的普通技术人员，在不脱离本发明实施例的精神和范围的情况下，还可以做出各种变化和变型，因此所有等同的技术方案也属于本发明实施例的范畴，本发明实施例的专利保护范围应由权利要求限定。The above embodiments are only used to illustrate the embodiments of the present invention, rather than to limit the embodiments of the present invention. Those of ordinary skill in the relevant technical field can also make various Changes and modifications, so all equivalent technical solutions also belong to the category of the embodiments of the present invention, and the patent protection scope of the embodiments of the present invention should be defined by the claims.

Claims

1. a kind of expression recognition method, which is characterized in that including：

It is right by the face key point in the convolution layer segment of convolutional neural networks model and the facial image to be detected of acquisition Facial image to be detected carries out human face expression feature extraction, obtains human face expression characteristic pattern；

Determine region of interest ROI corresponding with each face key point in the human face expression characteristic pattern；

Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, obtains the ROI behind pond Characteristic pattern；

The Expression Recognition result of the facial image is obtained according at least to the ROI feature figure.

2. according to the method described in claim 1, it is characterized in that, the facial image includes Static Human Face image.

3. according to the method described in claim 1, it is characterized in that, the facial image includes the face figure in sequence of frames of video Picture.

4. according to the method described in claim 3, it is characterized in that, obtain the face figure according at least to the ROI feature figure The Expression Recognition of picture as a result, including：

According to the ROI feature figure of the facial image of the present frame, the preliminary table of the facial image of the present frame is obtained Feelings recognition result；

According to the preliminary Expression Recognition result of the present frame and at least Expression Recognition of the facial image of a prior frame as a result, obtaining Take the Expression Recognition result of the facial image of the present frame.

5. according to the method described in claim 4, it is characterized in that, according to the preliminary Expression Recognition result of the present frame and extremely The Expression Recognition of the facial image of a few prior frame is as a result, obtain the Expression Recognition of the facial image of the present frame as a result, packet It includes：

By the preliminary facial expression recognition result of the facial image of the present frame and the people of at least facial image of a prior frame Face Expression Recognition result is weighted processing, obtains the Expression Recognition of the facial image of the present frame as a result, wherein, described to work as The weight of the preliminary Expression Recognition result of the facial image of previous frame is more than the Expression Recognition result of the facial image of any prior frame Weight.

6. a kind of convolutional neural networks model training method, which is characterized in that including：

The sample image of training and the information of corresponding face key point are obtained, wherein, someone is included in the sample image The markup information of face expression；

Human face expression feature extraction is carried out to the sample image by the convolution layer segment of convolutional neural networks model, obtains people Face expressive features figure；

According at least to the ROI feature figure, the network parameter of the convolutional neural networks model is adjusted.

7. a kind of expression recognition apparatus, which is characterized in that including：

First determining module, for the facial image to be detected for passing through the convolution layer segment of convolutional neural networks model and obtaining In face key point, human face expression feature extraction is carried out to facial image to be detected, obtains human face expression characteristic pattern；

2nd the 5th determining module, for determining sense corresponding with each face key point in the human face expression characteristic pattern Interest region ROI；

Third determining module, the pond layer segment for passing through convolutional neural networks model carry out pond Hua Chu to determining each ROI Reason obtains the ROI feature figure behind pond；

4th determining module, for obtaining the Expression Recognition result of the facial image according at least to the ROI feature figure.

8. a kind of convolutional neural networks model training apparatus, which is characterized in that including：

First acquisition module, for obtaining the information of the sample image of training and corresponding face key point, wherein, the sample Include the markup information of human face expression in this image；

Second acquisition module, the convolution layer segment for passing through convolutional neural networks model carry out face table to the sample image Feelings feature extraction obtains human face expression characteristic pattern；

Third acquisition module, it is corresponding interested with each face key point in the human face expression characteristic pattern for determining Region ROI；

4th acquisition module, the pond layer segment for passing through convolutional neural networks model carry out pond Hua Chu to determining each ROI Reason obtains the ROI feature figure behind pond；

5th acquisition module, for according at least to the ROI feature figure, adjusting the network ginseng of the convolutional neural networks model Number.

9. a kind of electronic equipment, which is characterized in that including：Processor, memory, communication device and communication bus, the processing Device, the memory and the communication device complete mutual communication by the communication bus；

For the memory for storing an at least executable instruction, the executable instruction makes the processor perform right such as will Seek any expression recognition methods of 1-5.

10. a kind of electronic equipment, which is characterized in that including：Processor, memory, communication device and communication bus, the processing Device, the memory and the communication device complete mutual communication by the communication bus；

For the memory for storing an at least executable instruction, the executable instruction makes the processor perform right such as will Seek the 6 convolutional neural networks model training methods.