CN111626105B

CN111626105B - Attitude estimation method, device and electronic equipment

Info

Publication number: CN111626105B
Application number: CN202010297991.XA
Authority: CN
Inventors: 魏秀参
Original assignee: Xuzhou Kuangshi Data Technology Co ltd; Nanjing Kuangyun Technology Co ltd; Beijing Megvii Technology Co Ltd
Current assignee: Wuhan Kuangshi Jinzhi Technology Co ltd; Beijing Megvii Technology Co Ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2024-02-20
Anticipated expiration: 2040-04-15
Also published as: CN111626105A

Abstract

The invention provides a posture estimation method, device and electronic equipment. First, an image to be processed containing a target object is acquired; the image to be processed is input into a key point recognition model, and the key point information of the target object is output; and then the key point information of the target object is output based on the image. The key point information of the target object determines the position of the key point of the target object; and then determines the posture of the target object based on the position of the key point of the target object and the connection relationship between the key points. The embodiment of the present invention pre-establishes a key point identification model and trains it by combining the connection relationships between key points in the limb structure. Based on this model, when identifying key points, the key points can be identified based on the mutual relationships between the key points. The key point positions are adjusted and reasoned to improve the estimation accuracy of the key point positions, thereby improving the estimation accuracy of the human posture.

Description

Gesture estimation method and device and electronic equipment

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method and apparatus for estimating a gesture, and an electronic device.

Background

Human body posture estimation is used as an important research direction in the field of computer vision research and a key problem in the technical field of somatosensory, and is widely applied to the fields of human body activity analysis, intelligent video monitoring, advanced human-computer interaction and the like. The human body posture estimation technology can automatically detect the human body in an image containing the human body through a computer, and the human body posture estimation technology comprises the step of locating the joint points of the human body.

At present, the existing human body posture estimation method is used for independently positioning each joint point of a human body, and the estimated joint point position accuracy is low due to the mode, so that the human body posture estimation accuracy is low.

Disclosure of Invention

Accordingly, the present invention is directed to a method and apparatus for estimating a posture, and an electronic device, which can improve the accuracy of estimating a position of a key point in human posture estimation, thereby improving the accuracy of estimating a human posture.

In a first aspect, an embodiment of the present invention provides a method for estimating a pose, including: acquiring an image to be processed containing a target object; inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph; determining the position of the key point of the target object according to the key point information of the target object; and determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points.

In a preferred embodiment of the present invention, the above-mentioned keypoint identification model is trained by: instantiating a neural network model according to a preset limb structure; inputting a current training picture into the instantiated neural network model, and outputting key point information of the concerned object in the current training picture; the key point information includes: probability that the key point of the concerned object is positioned at each pixel point in the graph; determining a loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture; and according to the loss value, carrying out iterative updating on parameters of the neural network model to obtain a key point identification model.

In a preferred embodiment of the present invention, the neural network model includes a full convolution neural network module and a graph convolution neural network module; the step of inputting the current training picture into the instantiated neural network model and outputting the key point information of the concerned object in the current training picture includes: inputting a current training picture into the full convolutional neural network module, and outputting a convolutional feature map of key points of the concerned object in the current training picture; and inputting the convolution feature diagram of the key points of the concerned object into the diagram convolution neural network module, and outputting the key point information of the concerned object in the current training picture.

In a preferred embodiment of the present invention, the above-mentioned convolutional neural network module includes a first convolutional neural network unit and a second convolutional neural network unit; the first graph convolution neural network unit is established based on the correlation of the key points of the concerned object in the local receptive field; the second graph convolution neural network unit is established based on the correlation relationship between the key points of the concerned object; the step of inputting the convolution feature map of the key point of the attention object into the map convolution neural network module and outputting the key point information of the attention object in the current training picture comprises the following steps: respectively inputting the convolution feature images of the key points of the concerned object into the first graph convolution neural network unit and the second graph convolution neural network unit, and correspondingly outputting the first convolution feature image and the second convolution feature image of the key points; respectively carrying out 1X 1 convolution processing on the first convolution feature map and the second convolution feature map to correspondingly obtain first key point information and second key point information of the key points; and outputting the key point information of the concerned object in the current training picture according to the first key point information and the second key point information.

In a preferred embodiment of the present invention, the step of outputting the key point information of the object of interest in the current training picture according to the first key point information and the second key point information includes: feature fusion is carried out on the first key point information and the second key point information to obtain third key point information of the key point; and outputting the first key point information, the second key point information and the third key point information of the key point.

In a preferred embodiment of the present invention, the network structure of the first graph convolutional neural network unit is constructed according to the following formula:wherein,in (1) the->An input convolution feature map representing a keypoint u in a layer 1 network of the first graph convolution neural network element;A hidden feature representing the key point u; along the channel direction will +.>Divided into K parts, each of which is characterized by->att _u,v Representing a convolution parameter; * Representing a convolution operation; n (N) _u A set of contiguous keypoints representing the keypoint u; concate (·) represents a series of feature maps along the channel direction; sigma represents a RELU activation function;Representing a 3 x 3 convolutional layer.

In a preferred embodiment of the present invention, the network structure of the second graph convolutional neural network unit is constructed according to the following formula: Wherein (1)>In (1) the->Representing input features of a key point u in a first layer network of the second graph roll-up neural network unit;A hidden feature representing the key point u; n (N) _u A set of contiguous nodes representing node u;And->All represent a 3 x 3 convolutional layer; beta _u,v ∈R ^HW×HW The attention of the keypoint u to the keypoint v in the neural network unit is sought for this second graph convolution.

In a preferred embodiment of the present invention, the step of performing feature fusion on the first key point information and the second key point information to obtain third key point information of the key point includes: performing feature fusion on the first key point information and the second key point information according to a preset feature fusion formula to obtain third key point information of the key point; the feature fusion formula is as follows: wherein P is _u Third key point information indicating arbitrary key point u, < +.>First key information indicating the key u,/or->Second key point information indicating the key point u.

In a preferred embodiment of the present invention, the step of determining the loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture includes: according to the labeling value of the key point of the concerned object in the current training picture, calculating the real heat map of the concerned object; the real heat map comprises probabilities of key points of the concerned object in each pixel point in the current training picture; calculating a square error between the real heat map and the key point information; and determining a loss value corresponding to the neural network model according to the square error.

In a preferred embodiment of the present invention, the calculation formula for determining the loss value corresponding to the neural network model according to the square error is: wherein l _m Representing a loss value corresponding to the neural network model;Representing the limb structure;First key point information indicating the key point u;second key point information indicating the key point u; p (P) _u Third key point information indicating the key point u; g _u A real heat map representing the key point u; II ₂ Representing the squaring error.

In a preferred embodiment of the present invention, the step of determining the location of the keypoint of the target object according to the keypoint information of the target object includes: and for each key point of the target object, determining the coordinates of the pixel point corresponding to the maximum probability in the key point information corresponding to the key point as the position of the key point.

In a second aspect, an embodiment of the present invention further provides an attitude estimation apparatus, including: the image acquisition module to be processed is used for acquiring an image to be processed containing a target object; the key point information output module is used for inputting the image to be processed into a key point identification model and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, and the limb structure comprises key points appointed in limbs and connection relations among the key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph; the key point position determining module is used for determining the position of the key point of the target object according to the key point information of the target object; and the target object gesture determining module is used for determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points.

In a second aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores computer executable instructions executable by the processor, and the processor executes the computer executable instructions to implement the above-mentioned pose estimation method.

In a third aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described pose estimation method.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a gesture estimation method, a gesture estimation device and electronic equipment, wherein an image to be processed containing a target object is firstly acquired; inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph; then determining the position of the key point of the target object according to the key point information of the target object; and determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points. In the method, the key point identification model is pre-established and trained by combining each appointed key point in the limb structure and the connection relation between each key point, and the key point positions can be adjusted and inferred according to the interrelationship between the key points when the key points are identified based on the model.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part will be obvious from the description, or may be learned by practice of the techniques of the disclosure.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an electronic system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a gesture estimation method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a training key point recognition model in a gesture estimation method according to an embodiment of the present invention;

Fig. 4a and fig. 4b are schematic diagrams of network updating of a graph convolution neural network through a local spatial attention mechanism according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a graph convolution neural network performing network update through a global spatial attention mechanism according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a process for identifying key points of a human body through a key point identification model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a working process of a graph roll-up neural network module in a key point recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of the effect of identifying key points of a human body through a key point identification model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an attitude estimation device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In consideration of the problem that the existing human body posture estimation method has low accuracy in estimating joint points, the posture estimation method, the device and the electronic equipment provided by the embodiment of the invention can be applied to scenes in which key point positioning or posture estimation is performed on people, animals or other movable objects (such as robots, virtual characters, mechanical arms and the like). For the convenience of understanding the present embodiment, a detailed description will be first given of an attitude estimation method disclosed in the embodiment of the present invention.

An example electronic system 100 for implementing the state estimation method, apparatus, and electronic device of the embodiments of the present invention is described herein with reference to fig. 1.

As shown in fig. 1, an electronic system 100 includes one or more processing devices 102, one or more storage devices 104, an input device 106, an output device 108, and one or more image capture devices 110, interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and configuration of the electronic system 100 shown in fig. 1 are exemplary only and not limiting, as the electronic system may have other components and configurations as desired.

The processing device 102 may be a smart terminal or a device that includes a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, may process data from other components in the electronic system 100, and may also control other components in the electronic system 100 to perform targeted object statistics functions.

The storage 104 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processing device 102 may execute the program instructions to implement client functions and/or other desired functions in embodiments of the present invention described below (implemented by the processing device). Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may capture training pictures and store the captured preview video frames or image data in the storage 104 for use by other components.

For example, the state estimation method, apparatus and devices in the electronic device according to the embodiments of the present invention may be integrally disposed, or may be disposed in a scattered manner, such as integrally disposing the processing device 102, the storage device 104, the input device 106 and the output device 108, and disposing the image capturing device 110 at a designated position where a picture may be captured. When the devices in the above-described electronic system are integrally provided, the electronic system may be implemented as an intelligent terminal such as a camera, a smart phone, a tablet computer, a vehicle-mounted terminal, or the like.

Referring to fig. 2, a flow chart of an attitude estimation method according to an embodiment of the present invention is shown, and as can be seen from fig. 2, the method includes the following steps:

Step S102: and acquiring a to-be-processed image containing the target object.

Here, the target object may be a person, an animal, or other movable object, such as a robot, a forklift, a robot arm, a virtual character, or the like. And, the image to be processed may be a whole or a part of the target object, for example, the target object is a person a, and the image to be processed only includes the upper half of the person a, and the lower half of the person a may be blocked.

Step S104: inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: the probability that the keypoints of the target object are located at each pixel point in the map.

The key point identification model can be obtained through learning training of a neural network in advance, and a network structure of the key point identification model is built based on a preset limb structure. Here, the limb structure may be a human body structure of a human, a limb structure of an animal, or a mechanical structure of a robot, or the like.

And, the limb structure includes the designated key points in the limb, and the connection relationship of the designated key points themselves. The key points may be joint points, for example, taking a human body structure as an example, and may be shoulder joints, neck joints, knee joints, elbow joints, and the like of the human body; in addition, the key point can be an important part in the limb structure, and the human body structure is taken as an example, and the whole head can be taken as a key point. In other application scenarios, key points of the limb structure can be flexibly set according to actual requirements, and the method is not limited.

In addition, the key points in the limb structure are mutually related, and the relation between the key points is different, for example, some key points are directly connected, some key points are indirectly connected, the key points have differences in distance and distance, and in an actual activity scene, the key points mutually influence each other.

For example, taking the human body structure as an example, if each main joint point of the human body is taken as a key point of the human body structure, the following 7 key points are assumed to be included: neck, shoulder, elbow, wrist, hip, knee and ankle joints, the neck and shoulder joints being directly connected to the elbow joint (via shoulder joint to elbow joint) and indirectly connected to the shoulder joint, based on the natural structure of the human body itself. In addition, the distance between the elbow joint and the shoulder joint and the distance between the elbow joint and the hip joint are different, and in the actual human movement, the mutual influence between the two groups of joints, namely the elbow joint and the shoulder joint and the elbow joint and the hip joint, are also different, for example, when a person stretches out and does not wear the waist while sitting, the movement of the shoulder joint and the elbow joint are mutually influenced, and the hip joint can be kept motionless.

In this embodiment, the key point recognition model is constructed based on the connection relationship between the key points of the limb structure, so that when the key points are recognized, the key point recognition model does not independently position each key point, but combines the internal relations between each key point, and based on the overall angle of the limb structure, the positions of each key point are estimated, so that the estimated positions of each key point conform to the connection relationship between each other.

For example, assuming that the elbow joint of a person in an image is clear and the wrist joint is blocked, the conventional method independently positions the wrist joint, which may result in inaccurate positioning estimation of the wrist joint due to the blocking in the image, and the key point recognition model in the embodiment may also combine the estimated position of the elbow joint and the connection relationship between the elbow joint and the wrist joint to refine and infer the position of the wrist joint, so that the problem of poor estimation accuracy caused by the blocked wrist joint may be alleviated.

It can be seen that when the human body posture is estimated by the key point recognition model in the embodiment, the human body posture is regarded as a multi-task model, wherein the positioning of different joint points is equivalent to a plurality of different subtasks, and different subtasks have different correlations. Because the joints of the human body are connected with each other, the key point recognition model of the embodiment represents the connection relationship (namely the human body structure relationship) as the correlation relationship between the subtasks, so that the position estimation of each joint point is not independent of each other, but is influenced by each other, and is corrected and adjusted with each other, thereby improving the position estimation precision of the joint point.

And inputting the image to be processed for the key point identification model, and outputting key point information of a target object in the image to be processed, wherein the key point information comprises the probability that the key point of the target object is positioned at each pixel point in the image, and the key point information can be output as a data table, an image or a text document and the like. In at least one possible implementation, the keypoint information may be output in the form of a heatmap. Here, the heat map reflects probability distribution of estimated positions of key points in each pixel point in the map, and reflects the probability by color difference, so that the heat map is more visual.

Step S106: and determining the position of the key point of the target object according to the key point information of the target object.

In one possible implementation manner, for each key point of the target object, the coordinates of the pixel point corresponding to the maximum probability in the key point information corresponding to the key point are determined as the position of the key point.

Step S108: and determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points.

And connecting the key points of the target object determined in the previous step according to the connection relation between the key points, and obtaining the gesture of the target object.

The embodiment of the invention provides a gesture estimation method, which comprises the steps of firstly, obtaining an image to be processed containing a target object; inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph; then determining the position of the key point of the target object according to the key point information of the target object; and determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points. In the method, the key point identification model is pre-established and trained by combining each appointed key point in the limb structure and the connection relation between each key point, and the key point positions can be adjusted and inferred according to the interrelationship between the key points when the key points are identified based on the model.

On the basis of the posture estimation method shown in fig. 2, this embodiment also provides another posture estimation method, and this method mainly describes a specific implementation process of the training key point recognition model in the foregoing embodiment, as shown in fig. 3, which is a schematic flow chart of the training key point recognition model in the posture estimation method, and as can be seen from fig. 3, this training process includes the following steps:

step S202: and instantiating a neural network model according to the preset limb structure.

Here, the limb structure includes designated key points in the limb, and connection relations between the designated key points. Moreover, instantiation refers to the process of creating objects from classes in object-oriented programming, called instantiation, which is the process of specifying an abstract conceptual class into the class of objects.

Step S204: inputting a current training picture into the instantiated neural network model, and outputting key point information of the concerned object in the current training picture; the key point information includes: the probability that the keypoints of the object of interest are located at individual pixels in the map.

In actual operation, the current training picture can be determined based on a preset training set; in one possible implementation manner, the current training picture is pre-labeled with a labeling frame of the object of interest and key points of the object of interest.

The attention object is taken as a human body to explain, and the determined current training picture is pre-marked with a human body annotation frame and specified key points of the human body, for example, the specified key points can be joint points of the human body. In one possible embodiment, only the joint points displayed in the current training picture are marked, for example, if only the head and neck of the target object is displayed in the picture and the other parts are blocked, only the neck joint positions are marked.

For each current training picture, the following training operations are performed: inputting the current training picture into the instantiated neural network model, and outputting key point information of the concerned object in the current training picture; the key point information includes: probability that the key point of the concerned object is positioned at each pixel point in the graph; determining a loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture; and adjusting parameters of the neural network model according to the loss value.

In at least one possible implementation, the neural network model includes a full convolution neural network module and a graph convolution neural network module; the step of inputting the current training picture into the instantiated neural network model and outputting the key point information of the object of interest in the current training picture can be implemented by the following steps 21-22:

(21) And inputting the current training picture into the full convolutional neural network module, and outputting a convolutional feature map of the key points of the concerned object in the current training picture.

(22) And inputting the convolution feature diagram of the key points of the concerned object into the diagram convolution neural network module, and outputting the key point information of the concerned object in the current training picture.

In one possible implementation manner, the graph rolling neural network module includes a first graph rolling neural network unit and a second graph rolling neural network unit. The first graph roll-up neural network unit is established based on the correlation of the key points of the object of interest in the local receptive field, for example, the first graph roll-up neural network unit can concentrate on the characteristic correlation of the key points in the local receptive field through a local spatial attention mechanism. The second graph roll-up neural network unit is established based on the correlation relationship between the key points of the object of interest, for example, the second graph roll-up neural network unit can make the information interaction of the key points of interest in the global pixel points through a global spatial attention mechanism.

Here, the step of inputting the convolution feature map of the key point of the object of interest into the graph convolution neural network module and outputting the key point information of the object of interest in the current training picture may be implemented by the following steps 31-33:

(31) And respectively inputting the convolution feature images of the key points of the concerned object into the first graph convolution neural network unit and the second graph convolution neural network unit, and correspondingly outputting the first convolution feature image and the second convolution feature image of the key points.

In this embodiment, the network structure of the first graph roll-up neural network unit is constructed according to the following formula:

wherein,

in the above-mentioned formula 1,an input convolution feature map representing a keypoint u in a layer 1 network of the first graph convolution neural network element;A hidden feature representing the key point u; along the channel direction will +.>Divided into K parts, each of which is characterized by->att _u,v Representing a convolution parameter; * Representing a convolution operation; n (N) _u A set of contiguous keypoints representing the keypoint u; concate (·) represents a series of feature maps along the channel direction; sigma represents a RELU activation function;Representing a 3 x 3 convolutional layer.

In order to understand the network structure of the first rolled neural network unit more clearly, referring to fig. 4a and fig. 4b, schematic diagrams of network updating of the rolled neural network through a local spatial attention mechanism are shown, where fig. 4a and fig. 4b both show an updating process of the characteristics of a key point u of the rolled neural network unit from a first layer network to a first+1th layer network, and the key point u merges the characteristics of adjacent key points v thereof through the local attention mechanism and performs information interaction with local pixel points where the key point u is located. Specifically, fig. 4a shows a manner of updating a single feature of the key point u, and fig. 4b shows a manner of updating two features of the key point u at the same time.

In addition, the network structure of the second graph roll-up neural network unit is constructed according to the following formula:

wherein,

in the above-mentioned formula 2,representing input features of a key point u in a first layer network of the second graph roll-up neural network unit;A hidden feature representing the key point u; n (N) _u A set of contiguous nodes representing node u;And->All represent a 3 x 3 convolutional layer; beta _u,v ∈R ^HW×HW The attention of the keypoint u to the keypoint v in the neural network unit is sought for this second graph convolution.

Here, referring to fig. 5, a schematic diagram of a graph convolutional neural network performing network update through a global spatial attention mechanism is shown, where an attention map of hw×hw is obtained through the global spatial attention mechanism, so that features of a key point u in the graph convolutional neural network are fused in an update process from a first layer network to a first+1th layer network, and information interaction is performed between adjacent key points v and global pixel points.

(32) And respectively carrying out 1X 1 convolution processing on the first convolution feature map and the second convolution feature map to correspondingly obtain first key point information and second key point information of the key points.

Here, by the 1×1 convolution process, two-dimensional key point information is obtained from the three-dimensional convolution feature map.

(33) And outputting the key point information of the concerned object in the current training picture according to the first key point information and the second key point information.

Here, feature fusion may be performed on the first key point information and the second key point information to obtain third key point information of the key point; then, first, second, and third key point information of the key point are output.

In at least one possible implementation manner, feature fusion can be performed on the first key point information and the second key point information according to a preset feature fusion formula to obtain third key point information of the key point; the feature fusion formula is as follows:

wherein P is _u Third keypoint information indicating an arbitrary keypoint u,first key information indicating the key u,/or->Second key point information indicating the key point u.

The first, second and third key point information include probabilities that the key point of the attention object is located at each pixel point in the drawing.

Step S206: and determining a loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture.

In actual operation, the step of calculating the loss value of the current training picture based on the key point information, the current training picture and the preset loss function may be implemented by the following steps 41-43:

(41) According to the labeling positions of the key points of the concerned object in the current training picture, calculating the real heat map of the concerned object; the real heat map comprises probabilities of key points of the concerned object in each pixel point in the current training picture;

(42) Calculating a square error between the real heat map and the key point information;

(43) And determining a loss value corresponding to the neural network model according to the square error.

In one possible implementation manner, the calculation formula for determining the loss value corresponding to the neural network model according to the square error is as follows:

wherein l _m Representing a loss value corresponding to the neural network model;representing the limb structure;First key point information indicating the key point u;Second key point information indicating the key point u; p (P) _u Third key point information indicating the key point u; g _u A real heat map representing the key point u; II ₂ Representing the squaring error.

Step S208: and according to the loss value, carrying out iterative updating on parameters of the neural network model to obtain a key point identification model.

After the loss value corresponding to the current neural network model is obtained through calculation, parameters of the neural network model are adjusted according to the loss value, and the current training picture is continuously determined from the training set so as to continuously train the neural network model.

And when the training operation meets the preset training ending condition, determining the neural network model obtained by current training as a key point recognition model. Here, the training ending condition may be a preset training duration, or may be an overall training number, or other ending condition, and the neural network model obtained after the training is ended is determined to be the key point recognition model.

In order to verify the recognition effect of the key point recognition model obtained by training in this embodiment, taking human body gesture recognition as an example, a corresponding human body structure is set according to a preset connection relationship between a human body joint point and a joint point, and a key point recognition model is constructed according to the human body structure. Referring to fig. 6 and 7, fig. 6 is a schematic diagram of a process of identifying key points of a human body through a key point identification model, and fig. 7 shows a working process of a graph roll-up neural network module in the key point identification model.

In addition, the key point recognition model obtained by training in this embodiment is tested through three sets of authoritative data for human body posture estimation, and the test results are shown in the following table:

Table 1：Comparisons of PCKh@0.5 scores on the MPII testing set

Table 2：Comparisons of PCK@0.2 scores on the LSP testing set.

Table 3：Comparison with Hourglass ，CPNand SIMon COCO val2017 dataset.

Their results are cited from and

as can be seen from the test data of the three tables, compared with the traditional human body posture estimation method, the method for identifying the human body joint point based on the key point identification model obtained through training (corresponding to the Ours in the table) has higher accuracy of the identification result. In addition, referring to fig. 8, an effect diagram of identifying key points of a human body through a key point identification model is shown, and as can be seen from fig. 8, the gesture estimation method (group-trunk in the corresponding diagram) provided by the embodiment of the invention can refine and infer the positions of the nodes at the same time, so as to obtain a better estimation effect.

According to the gesture estimation method provided by the embodiment, the key point identification model is established and trained by combining each appointed key point in the limb structure and the connection relation among the key points, and then the key points of the target object in the image to be processed are determined according to the key point identification model obtained through training.

Corresponding to the posture estimation method shown in fig. 2, the embodiment of the present invention further provides a posture estimation device, as shown in fig. 9, which is a schematic structural diagram of the posture estimation device, and as can be seen from fig. 9, the device includes a to-be-processed image acquisition module 81, a keypoint information output module 82, a keypoint position determination module 83, and a target object posture determination module 84 that are sequentially connected, where the functions of each module are as follows:

a to-be-processed image acquisition module 81 for acquiring a to-be-processed image including a target object;

the key point information output module 82 is configured to input the image to be processed into a key point identification model, and output key point information of the target object; the key point identification model is built based on a preset limb structure, and the limb structure comprises key points appointed in limbs and connection relations among the key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph;

a keypoint location determining module 83 for determining a location of a keypoint of the target object according to the keypoint information of the target object;

the target object pose determination module 84 is configured to determine the pose of the target object according to the positions of the keypoints of the target object and the connection relationship between the keypoints.

The gesture estimation device provided in this embodiment firstly obtains an image to be processed including a target object; inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph; then determining the position of the key point of the target object according to the key point information of the target object; and determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points. In the device, through combining each appointed key point in the limb structure and the connection relation among the key points, a key point identification model is established in advance and trained, when the key points are identified based on the model, the positions of the key points can be adjusted and inferred according to the mutual relation among the key points, and compared with the traditional mode of independently positioning the key points, the invention can improve the estimation precision of the positions of the key points, and further improve the estimation precision of the human body posture.

In one possible implementation manner, the keypoint identification model is trained by the following manner: instantiating a neural network model according to a preset limb structure; inputting a current training picture into the instantiated neural network model, and outputting key point information of the concerned object in the current training picture; the key point information includes: probability that the key point of the concerned object is positioned at each pixel point in the graph; determining a loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture; and according to the loss value, carrying out iterative updating on parameters of the neural network model to obtain a key point identification model.

In another possible implementation manner, the neural network model includes a full convolution neural network module and a graph convolution neural network module; the step of inputting the current training picture into the instantiated neural network model and outputting the key point information of the concerned object in the current training picture includes: inputting the current training picture into the full convolution neural network module, and outputting a convolution feature diagram of key points of the concerned object in the current training picture; and inputting the convolution feature diagram of the key points of the concerned object into the diagram convolution neural network module, and outputting the key point information of the concerned object in the current training picture.

In another possible implementation manner, the above-mentioned graph convolution neural network module includes a first graph convolution neural network unit and a second graph convolution neural network unit; the first graph convolution neural network unit is established based on the correlation of the key points of the concerned object in the local receptive field; the second graph convolution neural network unit is established based on the correlation relationship between the key points of the concerned object; the step of inputting the convolution feature map of the key point of the attention object into the map convolution neural network module and outputting the key point information of the attention object in the current training picture comprises the following steps: respectively inputting the convolution feature images of the key points of the concerned object into the first graph convolution neural network unit and the second graph convolution neural network unit, and correspondingly outputting the first convolution feature image and the second convolution feature image of the key points; respectively carrying out 1X 1 convolution processing on the first convolution feature map and the second convolution feature map to correspondingly obtain first key point information and second key point information of the key points; and outputting the key point information of the concerned object in the current training picture according to the first key point information and the second key point information.

In another possible embodiment, the step of outputting the keypoint information of the object of interest in the current training picture according to the first keypoint information and the second keypoint information includes: feature fusion is carried out on the first key point information and the second key point information to obtain third key point information of the key point; and outputting the first key point information, the second key point information and the third key point information of the key point.

In another possible implementation manner, the network structure of the first graph roll-up neural network unit is constructed according to the following formula:wherein, in (1) the->An input convolution feature map representing a keypoint u in a layer 1 network of the first graph convolution neural network element;A hidden feature representing the key point u; along the channel direction will +.>Divided into K parts, each of which is characterized by->att _u,v Representing a convolution parameter; * Representing a convolution operation; n (N) _u A set of contiguous keypoints representing the keypoint u; concate (·) represents a series of feature maps along the channel direction; sigma represents a RELU activation function;Representing a 3 x 3 convolutional layer.

In another possible implementation manner, the network structure of the second graph roll-up neural network unit is constructed according to the following formula: Wherein (1)>In (1) the->Representing input features of a key point u in a first layer network of the second graph roll-up neural network unit;A hidden feature representing the key point u; n (N) _u A set of contiguous nodes representing node u;And->All represent a 3 x 3 convolutional layer; beta _u,v ∈R ^HW×HW The attention of the keypoint u to the keypoint v in the neural network unit is sought for this second graph convolution.

In another possible embodiment, the step of performing feature fusion on the first keypoint information and the second keypoint information to obtain third keypoint information of the keypoint includes: performing special processing on the first key point information and the second key point information according to a preset feature fusion formulaObtaining third key point information of the key point through sign fusion; the feature fusion formula is as follows:wherein P is _u Third key point information indicating arbitrary key point u, < +.>First key information indicating the key u,/or->Second key point information indicating the key point u.

In another possible implementation manner, the step of determining the loss value corresponding to the neural network model based on the keypoint information and the labeling value of the current training picture includes: according to the labeling value of the key point of the concerned object in the current training picture, calculating the real heat map of the concerned object; the real heat map comprises probabilities of key points of the concerned object in each pixel point in the current training picture; calculating a square error between the real heat map and the key point information; and determining a loss value corresponding to the neural network model according to the square error.

In another possible implementation manner, the calculation formula for determining the loss value corresponding to the neural network model according to the square error is: wherein l _m Representing a loss value corresponding to the neural network model;Representing the limb structure;A first switch representing the key point uKey point information;Second key point information indicating the key point u; p (P) _u Third key point information indicating the key point u; g _u A real heat map representing the key point u; II ₂ Representing the squaring error.

In another possible implementation, the keypoint location determination module 83 is further configured to: and for each key point of the target object, determining the coordinates of the pixel point corresponding to the maximum probability in the key point information corresponding to the key point as the position of the key point.

The implementation principle and the generated technical effects of the gesture estimation device provided by the embodiment of the present invention are the same as those of the previous gesture estimation method embodiment, and for the sake of brief description, reference may be made to corresponding contents in the previous gesture estimation method embodiment where the embodiment portion of the gesture estimation device is not mentioned.

The embodiment of the invention further provides an electronic device, as shown in fig. 10, which is a schematic structural diagram of the electronic device, wherein the electronic device includes a processor 91 and a memory 92, the memory 92 stores machine executable instructions that can be executed by the processor 91, and the processor 91 executes the machine executable instructions to implement the above-mentioned pose estimation method.

In the embodiment shown in fig. 10, the electronic device further comprises a bus 93 and a communication interface 94, wherein the processor 91, the communication interface 94 and the memory 92 are connected by means of the bus.

The memory 92 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 94 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. The bus may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in fig. 9, but not only one bus or one type of bus.

The processor 91 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 91 or by instructions in the form of software. The processor 91 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 91 reads the information in the memory 92 and in combination with its hardware performs the steps of the attitude estimation method of the previous embodiment.

The embodiment of the invention also provides a machine-readable storage medium, which stores machine-executable instructions that, when being called and executed by a processor, cause the processor to implement the above-mentioned attitude estimation method, and the specific implementation can refer to the above-mentioned method embodiment, and will not be repeated here.

The gesture estimation method, the gesture estimation device and the computer program product of the electronic device provided by the embodiments of the present invention include a computer readable storage medium storing program codes, and the instructions included in the program codes may be used to execute the gesture estimation method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments and will not be repeated herein.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A posture estimation method, characterized by comprising:

acquiring an image to be processed containing a target object;

inputting the image to be processed into a key point identification model, and outputting key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in limbs and connection relations among the designated key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph;

Determining the position of a key point of the target object according to the key point information of the target object;

determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points;

the key point identification model comprises a full convolution neural network module and a graph convolution neural network module; the graph convolution neural network module comprises a first graph convolution neural network unit and a second graph convolution neural network unit; the first graph convolution neural network unit is established based on the correlation of key points of the object of interest in the local receptive field; the second graph convolution neural network unit is established based on a correlation relationship between key points of the object of interest.

2. The method of claim 1, wherein the keypoint identification model is trained by:

instantiating a neural network model according to a preset limb structure;

inputting a current training picture into the instantiated neural network model, and outputting key point information of an object of interest in the current training picture; the key point information includes: probability that the key point of the concerned object is positioned at each pixel point in the graph;

Determining a loss value corresponding to the neural network model based on the key point information and the labeling value of the current training picture;

and according to the loss value, carrying out iterative updating on parameters of the neural network model to obtain the key point identification model.

3. The method according to claim 2, wherein the step of inputting a current training picture into the instantiated neural network model and outputting key point information of an object of interest in the current training picture includes:

inputting a current training picture into the full convolution neural network module, and outputting a convolution feature diagram of key points of an object of interest in the current training picture;

and inputting a convolution feature graph of the key points of the concerned object into the graph convolution neural network module, and outputting the key point information of the concerned object in the current training picture.

4. The method of claim 3, wherein the step of inputting the convolution feature map of the key points of the object of interest to the graph convolution neural network module and outputting the key point information of the object of interest in the current training picture comprises:

Respectively inputting the convolution feature images of the key points of the concerned object into the first graph convolution neural network unit and the second graph convolution neural network unit, and correspondingly outputting the first convolution feature image and the second convolution feature image of the key points;

performing 1 x 1 convolution processing on the first convolution feature map and the second convolution feature map respectively to correspondingly obtain first key point information and second key point information of the key points;

and outputting the key point information of the concerned object in the current training picture according to the first key point information and the second key point information.

5. The method according to claim 4, wherein the step of outputting the key point information of the object of interest in the current training picture based on the first key point information and the second key point information includes:

feature fusion is carried out on the first key point information and the second key point information to obtain third key point information of the key points;

and outputting the first key point information, the second key point information and the third key point information of the key points.

6. The attitude estimation method according to claim 4, wherein the network structure of the first graph roll-up neural network unit is constructed according to the following formula:

Wherein,

in the method, in the process of the invention,an input convolution feature map representing a key point u in a first layer network of the first graph convolution neural network element;A hidden feature representing the key point u; along the channel direction will +.>Divided into K parts, each of which is characterized by->att _u,v Representing a convolution parameter; * Representing a convolution operation; n (N) _u A set of contiguous keypoints representing the keypoint u; concate (·) represents a series of feature maps along the channel direction; sigma represents a RELU activation function;Representing a 3 x 3 convolutional layer.

7. The attitude estimation method according to claim 4, wherein the network structure of the second graph roll-up neural network unit is constructed according to the following formula:

wherein,

in the method, in the process of the invention,representing input features of key points u in a first layer network of the second graph roll-up neural network unit;a hidden feature representing the key point u; n (N) _u A set of contiguous nodes representing node u;And->All represent a 3 x 3 convolutional layer; beta _u,v ∈R ^HW×HW The attention of the keypoint u to the keypoint v in the neural network unit is sought for the second graph convolution.

8. The method of claim 5, wherein the step of feature-fusing the first keypoint information and the second keypoint information to obtain third keypoint information of the keypoint comprises:

Performing feature fusion on the first key point information and the second key point information according to a preset feature fusion formula to obtain third key point information of the key points; wherein, the feature fusion formula is:

wherein P is _u Third keypoint information indicating an arbitrary keypoint u,first keypoint information representing said keypoint u,a second switch representing the key point uKey point information.

9. The method according to claim 5, wherein the step of determining the loss value corresponding to the neural network model based on the keypoint information and the labeling value of the current training picture includes:

according to the labeling value of the key point of the concerned object in the current training picture, calculating the real heat map of the concerned object; the real heat map comprises the probability that the key points of the concerned object are positioned at all pixel points in the current training picture;

calculating a square error between the real heat map and the key point information;

and determining a loss value corresponding to the neural network model according to the square error.

10. The attitude estimation method according to claim 9, wherein the calculation formula for determining the loss value corresponding to the neural network model according to the square error is:

Wherein l _m Representing a loss value corresponding to the neural network model;representing the limb structure;First keypoint information representing said keypoint u;Second keypoint information representing the keypoint u; p (P) _u Third keypoint information indicating the keypoint u; g _u A real heat map representing the key points u; II‖ ₂ Representing the squaring error.

11. The attitude estimation method according to claim 1, wherein the step of determining the positions of the keypoints of the target object from the keypoint information of the target object comprises:

and for each key point of the target object, determining the coordinates of the pixel point corresponding to the maximum probability in the key point information corresponding to the key point as the position of the key point.

12. An attitude estimation apparatus, comprising:

the image acquisition module to be processed is used for acquiring an image to be processed containing a target object;

the key point information output module is used for inputting the image to be processed into a key point identification model and outputting the key point information of the target object; the key point identification model is built based on a preset limb structure, wherein the limb structure comprises designated key points in a limb and connection relations among the key points; the key point information includes: probability that the key point of the target object is positioned at each pixel point in the graph;

The key point position determining module is used for determining the position of the key point of the target object according to the key point information of the target object;

the target object gesture determining module is used for determining the gesture of the target object according to the positions of the key points of the target object and the connection relation between the key points;

13. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the pose estimation method according to any of claims 1 to 11.

14. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the pose estimation method according to any of claims 1 to 11.