CN111209811B

CN111209811B - Method and system for detecting eyeball attention position in real time

Info

Publication number: CN111209811B
Application number: CN201911371128.8A
Authority: CN
Inventors: 戚鹏飞
Original assignee: Dilu Technology Co Ltd
Current assignee: Dilu Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2024-04-09
Anticipated expiration: 2039-12-26
Also published as: CN111209811A

Abstract

The invention discloses a method and a system for detecting the eye attention position in real time, comprising the following steps that an image acquisition module respectively acquires original pictures of a person object; the original picture input opencv is subjected to segmentation calculation and then input data is output; correspondingly constructing a neural network structure model according to the input data; collecting annotation training data; the training data is input into the neural network structure model to perform model training and complete training parameter setting of the model; and the prediction result processing module restores the prediction result generated by the neural network structure model to the original size. The invention has the beneficial effects that: the recognition accuracy of the outline edge of the eye is improved, and compared with the traditional recognition result, the recognition accuracy of the outline edge of the eye is greatly improved; the rectangular coordinate system is established by taking the center points of the pupils of the two eyes as the origin, so that the accurate positions of the eyes in four quadrants except the left direction and the right direction can be accurately identified, and the practical degree is greatly improved.

Description

Method and system for detecting eyeball attention position in real time

Technical Field

The invention relates to the technical field of vision processing, in particular to a method for detecting the attention position of an eyeball in real time and a system for detecting the attention position of the eyeball in real time.

Background

In recent years, as intelligent control is developed more and more rapidly, image capturing and recognition technologies are studied and widely applied to various intelligent products. Not only creates a new field of research, but also greatly promotes the intelligent process of electronic products and facilitates the life of people. In recent years, eyeballs are becoming more and more interesting as a new information source, the movement track of the eyeballs can be reasonably researched to judge the intention of people, and the eyeballs are a neglected control mode, and compared with other control modes such as motion capture control, the eyeballs have certain convenience and low power consumption requirements, so that the man-machine interaction mode is enriched. The main mode of eyeball capture is through recognition technology, but the general accuracy of image recognition is not enough to process the motion trail of tiny objects such as pupils, and the processing time required for a large number of pictures is long, and the processing time is mainly dependent on the performance of each component on hardware and the quality of a recognition algorithm. Therefore, how to improve the real-time performance and accuracy of eyeball capture is an important issue in this technical field.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems occurring in the prior art.

Therefore, one technical problem solved by the present invention is: the method for detecting the eye attention position in real time improves the eye contour edge recognition accuracy.

In order to solve the technical problems, the invention provides the following technical scheme: a method for detecting the eye attention position in real time comprises the following steps that an image acquisition module respectively acquires original pictures of a person object; the original picture input opencv is subjected to segmentation calculation and then input data is output; correspondingly constructing a neural network structure model according to the input data; collecting annotation training data; the training data is input into the neural network structure model to perform model training and complete training parameter setting of the model; and the prediction result processing module restores the prediction result generated by the neural network structure model to the original size, and the eyeball concentration position returns to the position in the rectangular coordinate taking the image acquisition module as the origin.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the input data comprises the following steps of obtaining, namely dividing the original picture into 3 pictures of a left eye, a right eye and a face through a haarcascade model of opencv, and simultaneously calculating the position occupied by the face in the picture; and transmitting 4 input data of the left eye picture, the right eye picture, the face picture and the face grid to the neural network structure model.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the picture input opencv is the original picture acquired by the image acquisition module, the resolution ratio of the original picture is 1920x1080, and the number of channels is 3 respectively.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the neural network structure model comprises the following construction steps of obtaining data of the original picture; input data preparation of the neural network structure model; and constructing the neural network structure model neural network structure unit.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the image segmentation of the left eye and the right eye comprises the step of inputting the original image into a haarcascade_eye eye recognition unit of opencv to obtain two sets of x, y, w, h coordinates of the left eye and the right eye, and cutting out two images of the left eye and the right eye according to the coordinates.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the face picture segmentation and acquisition comprises the steps of inputting the original picture into a haarcascade_front face recognition unit of opencv, acquiring x, y, w, h coordinates of the face, and cutting the face picture according to the coordinates.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the obtaining of the face grids comprises the steps of equally dividing the original picture into 5x5 grids, marking the grid position as 1 if the face ratio exceeds 50% in each grid, otherwise marking the grid position as 0, and obtaining a face position mask with the size of 5x 5.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the neural network structure unit comprises a left-right feature extraction network structure, a facial feature extraction network structure and feature combination, wherein the feature combination comprises straightening of a left eye feature image and a right eye feature image, straightening of a facial feature image and straightening of a facial position mask result, and finally combining the four to output a two-dimensional result, wherein the x and y represent the eyeball attention position as an origin coordinate system.

As a preferred embodiment of the method for detecting an eye's attention position in real time according to the present invention, wherein: the marking training data comprises the following acquisition steps of adopting a square grid plate as a scale; collecting face pictures of an observer, taking a face picture of the observer and recording the watched grids when the observer watches one point in a grid plate every time, wherein the grid plate comprises 30 grids in total, and each person collects 30 face pictures and corresponding grid positions; randomly selecting 10 observers, and repeating the operation to obtain 300 images with 1920x1080 resolution and corresponding grid positions with the same number; and storing the picture under an img directory, converting the grid position into a coordinate system, and storing under a label directory.

The invention solves the other technical problem that: the system for detecting the eye attention position in real time improves the eye contour edge recognition accuracy.

In order to solve the technical problems, the invention provides the following technical scheme: a system for detecting eye gaze location in real time, characterized by: the system comprises an image acquisition module, a neural network structure model and a prediction result processing module; the image acquisition module is used for respectively acquiring original pictures of the person objects and then constructing the neural network structure model; the neural network structure model is used for outputting eyeball attentiveness of an input person object as a prediction result; and the prediction result received by the prediction result processing module is used for returning the eyeball attention position to the position in the rectangular coordinate taking the image acquisition module as the origin.

The invention has the beneficial effects that: the recognition accuracy of the outline edge of the eye is improved, and compared with the traditional recognition result, the recognition accuracy of the outline edge of the eye is greatly improved; the rectangular coordinate system is established by taking the center points of the pupils of the two eyes as the origin, so that the accurate positions of the eyes in four quadrants except the left direction and the right direction can be accurately identified, and the practical degree is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a schematic overall flow chart of a method for detecting an eye attention position in real time according to a first embodiment of the invention;

FIG. 2 is a schematic diagram of a left-right feature extraction network structure according to a first embodiment of the invention;

FIG. 3 is a schematic diagram of a facial feature extraction network structure according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of a Flatten layer implementation according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a visual display of a Flatten layer neural network according to a first embodiment of the present invention;

FIG. 6 is a schematic view of a square grid plate according to a first embodiment of the present invention;

FIG. 7 is a schematic diagram of training a neural network structure model according to a first embodiment of the present invention;

fig. 8 is a schematic diagram of the overall principle of a system for detecting the eye's attention position in real time according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

In the big data age, the occurrence times of machine learning in human vision are more and more, a recommendation system on a shopping platform, a recommendation system in a search engine, mobile phone image recognition characters, voice conversion characters and alpha go fight world go universities, which are all outstanding in deep learning, the deep learning is not only applied to scientific research, but also has come close to the life of human beings, the life of the human beings is improved, the work load of the human beings is reduced through machines in the deep learning, and the corresponding efficiency is improved. Deep learning is a branch of machine learning. The deep learning structure features a deep sensor with several hidden layers, and features of lower layers are combined to form complicated higher layers for searching data features and attribute features.

For tracking of eyeballs, the conventional detection method is based on the conventional vision processing technology, and has two problems: firstly, the precision of the eye contour segmentation result is not high, the deviation from the true value is large, and the final judgment result is affected; secondly, the accuracy of judging the left and right movement of the eyeballs is still good due to the influence of human eye structures, but the judging result of the up and down movement is poor or can not be judged, because the eyeballs are flat elliptic, the lateral movement range of the eyeballs in the eyeballs is far beyond the longitudinal movement range, and the accuracy of identification can not be achieved for the movement of the longitudinal range by the traditional method. In the traditional vision processing method, eyeball coordinates are obtained by using methods of extracting channels, calculating gradients, gaussian filtering and the like. Inaccurate detection of the orbital range leads to misjudgment of the pupil position. In the embodiment, a face photo of a user is acquired by using a deep learning method and is divided into a left eye and right eye picture, a face picture and facegrid, the face picture and facegrid enter a neural network model, head gestures and gaze directions are estimated, and the quadrant position x and y coordinates where the current attention of the user is located are returned to rectangular coordinates with a camera as an origin.

Referring to the schematic diagram of fig. 1, a method for detecting the eye attention position in real time according to the present embodiment is shown, which specifically includes the following steps:

s1: the image acquisition module 100 acquires original pictures of the person object respectively, and in colloquial terms, the function of this step is to prepare data, acquire pictures of eyes and faces and transmit the pictures to the next neural network for eye tracking and identification.

S2: the original picture input opencv is subjected to segmentation calculation and then input data is output, and the input data in the step comprises the following acquisition steps: dividing an original picture into 3 pictures of a left eye, a right eye and a face through a haarcascade model of opencv, and simultaneously calculating the position of the face in the picture; the 4 input data of the left eye picture, the right eye picture, the face picture and the face mesh are transferred to the neural network structure model 200. The picture input opencv is an original picture acquired by the image acquisition module 100, the resolution of the original picture is 1920x1080, and the number of channels is (r, g, b) 3 respectively.

S3: the neural network structure model 200 is correspondingly constructed according to the input data, and the neural network structure model 200 comprises the following construction steps:

s31: data acquisition of an original picture:

s32: input data preparation of the neural network structural model 200:

the method comprises the steps of obtaining pictures of a left eye and a right eye through segmentation, wherein the picture segmentation comprises the steps of inputting an original picture into a haarcascade_eye eye recognition unit of opencv to obtain two sets of x, y, w, h coordinates of the left eye and the right eye, and cutting out two pictures of the left eye and the right eye according to the coordinates;

the face picture segmentation and acquisition comprises the steps of inputting an original picture into a haarcascade_front face recognition unit of opencv, acquiring x, y, w, h coordinates of a face, and cutting the face picture according to the coordinates;

the obtaining of the face grids comprises the steps of equally dividing an original picture into 5x5 grids, marking the grid position as 1 if the face ratio exceeds 50% in each grid, otherwise marking the grid position as 0, and obtaining a face position mask with the size of 5x 5.

It should also be noted that,

corresponding to the above 4 inputs, the first half of the neural network structure is composed of four channels:

left eye channel: the channel comprises four convolution layers, and left eye characteristics are extracted;

right eye channel: the channel also contains four convolutional layers sharing parameters with the left eye channel;

facial passageway: the channel structure is identical to the left/right eye channel, but does not share parameters;

position channel: the channel has no convolution layer, and is combined with the characteristics after being connected into the FC layer.

And after the data of the four channels are straightened and combined, the data enter a final FC layer, and two results are output, namely the x and y coordinates of the position where the eyeball attention is located. The input dimension of the final FC layer is 256, that is, the dimension after combining the four feature vectors of the left eye feature, the right eye feature, the facial feature, and the facial grid (facegrid) f, and the output dimension is 2.

S33: building a neural network structural model 200 neural network structural unit: the neural network structure unit comprises a left-right feature extraction network structure, a facial feature extraction network structure and feature combination, wherein the feature combination comprises straightening of a left eye feature map and a right eye feature map, straightening of a facial feature map and straightening of a facial position mask result, and finally combining the four to output a two-dimensional result, wherein the x and y represent the eyeball attention position as an origin coordinate system.

This step is also described with reference to the schematic diagram of fig. 2, in which the left and right feature extraction network structures are as follows:

CONV-E1：kernal_size(11*11)filter_number(96)

CONV-E2：kernal_size(5*5)filter_number(256)

CONV-E3：kernal_size(3*3)filter_number(384)

CONV-E4：kernal_size(1*1)filter_number(64)

E1-E4 are 4-layer convolution operations, the kernel_size being the size of the convolution kernel, here a matrix of 11 x 11, 5*5, 3*3, 1*1 sizes, respectively; filter_number is the number of convolution kernels, i.e., the dimension of the output Tensor.

Referring to the illustration of fig. 3, the facial feature extraction network architecture is as follows:

the parameter and eye feature extraction network comprises the following steps:

CONV-F1：kernal_size(11*11)filter_number(96)

CONV-F2：kernal_size(5*5)filter_number(256)

CONV-F3：kernal_size(3*3)filter_number(384)

CONV-F4：kernal_size(1*1)filter_number(64)

f1-F4 is a 4-layer convolution operation, and kernel_size is a matrix of 11 x 11, 5*5, 3*3, 1*1 sizes, respectively. Filter_number is the number of convolution kernels, i.e., the dimension of the output Tensor.

The feature merging is to straighten the left and right eye feature images, straighten the face feature images and straighten the mask result of the face position, merge the four, and finally output a 2-dimensional result which represents x and y on a coordinate system.

Referring to the illustrations of fig. 4-5, the straightening process is that the input data enters an FC layer after being subjected to a flat operation, which serves to "Flatten" the input, i.e., to unidimensionally unify the input, commonly used in the transition from the convolutional layer to the fully-connected layer.

S4: and collecting labeling training data. Marking training data in this step includes the following acquisition steps:

a square grid plate as shown in fig. 6 is used as a scale;

collecting face pictures of an observer, taking a face picture of the observer at each time, and recording the gazed grids of the observer while taking the face picture of the observer, wherein the total number of the grids of the square grid is 30, and each person collects 30 face pictures and corresponding grid positions;

randomly selecting 10 observers, and repeating the operation to obtain 300 images with 1920x1080 resolution and corresponding grid positions with the same number; and storing the picture under an img directory, converting the grid position into a coordinate system, and storing under a label directory.

S5: the training data is input into the neural network structure model 200 for model training and for completing training parameter settings of the model. Referring to the illustration of fig. 7, the training parameters in this embodiment are set as follows:

Epoch：300

Step：500

Lr：0.0001

momentum factor: 0.9

LossFunction：MCELoss

BatchSize：2

The entire training process takes 10 hours on 2080 graphics cards.

S6: the prediction result processing module 300 restores the prediction result generated by the neural network structure model 200 to the original size, and the eyeball focus position returns to the position in the rectangular coordinate with the image acquisition module 100 as the origin. Specifically, the prediction result generated by the neural network is the position coordinate at the resolution 224x224, and the position coordinate is reduced to the original size 1920x1080 through the resize.

The embodiment improves the eye contour edge recognition precision, greatly improves the recognition result compared with the traditional recognition result, and combines an eyeball recognition algorithm, so that the overall recognition accuracy is more than 90%; the rectangular coordinate system is established by taking the center points of the pupils of the two eyes as the origin, so that the accurate positions of the eyes in four quadrants except the left direction and the right direction can be accurately identified, and the practical degree is greatly improved.

Scene one:

the effect of the overall recognition accuracy rate of 90% is obtained for verification in the embodiment, the comparison of the traditional recognition result and the recognition of the method is carried out, the technical effect adopted in the method is verified and illustrated, and the different methods selected in the embodiment and the method are adopted for comparison test, and the test result is compared by a scientific demonstration means to verify the true effect of the method.

According to the traditional technical scheme, for example, eye tracking based on a single frame image, eye tracking based on a video frame and the like, for a detection algorithm YOLO of the single frame image, up-sampling feature fusion is carried out on feature images corresponding to each convolution layer, so that more obvious feature information is obtained, prediction is carried out on the feature images of all the convolution layers, final eye position information is obtained by utilizing training modes such as frame regression and the like, and accuracy problems exist on small objects of eyeballs through the detection algorithm eye tracking of the single frame image. Meanwhile, the YOLO algorithm of the video frame is combined with the recurrent neural network, the YOLO algorithm is processed in the video frame, the spatial correlation of front and rear information is large, the characteristic information in the front and rear frame images is learned in spatial correlation, under the condition that eyeballs are shielded by external factors, confidence maps of the front and rear 5 frame image information are utilized to predict the position information of the eyeballs, and the problem that the tracking effect is not obvious when the eyeballs are shielded exists.

In the embodiment, simulation test experiments are respectively carried out on the basis of the method for detecting the eyeball position and the traditional eyeball tracking technology based on the video frame, so that the detection accuracy of the method is verified.

The test environment is as follows:

operating system window10 specialty 64 (DirectX 12).

A processor: intel Corei5-6500@3.20GHz quad core.

Display card: nvidiaGeForceGTX10603GB.

A framework of Tensorflow is used, with the tool being Unity3d2017.

Training data set: the Kaggle data set comprises 96 x 967000 face image data and 30 kinds of face key point labeling data; the ImageNet classification data set is 120 ten thousand image data, and 1000 categories.

Data set tested: OTB50.

The evaluation rule on the accuracy rate is that the frame number of the center position of the eyeball tracking result in the frame image sequence and the distance between the center point of the true mark position are within a certain threshold value accounts for the percentage of the total frame number.

The formula is:

wherein Box is _T Box for eye tracking _G Is a frame body with a true mark.

The following test methods were run separately, and the simulation test results output by the final actual software are shown in table 1 below.

Table 1: experimental results.

Detection method	Data set	Accuracy rate of	Speed/s
				Fastest DPM	OTB50	81.9	4.61
R-CNN Minus R	OTB50	85.6	0.83
				Fast R-CNN	OTB50	89.1	2.77
The method	OTB50	91.8	1.57
				Faster R-CNN ZF	OTB50	62.1	24
YOLO VGG-16	OTB50	78.2	17

The conclusion is drawn from the table, the detection method based on the application can reach 91.8 percent of accuracy, the effect is similar to the Fast R-CNN accuracy, but the method has great advantages in speed.

Example 2

Referring to the schematic of fig. 8, a system for detecting the eye attention position in real time is shown in this embodiment, and includes an image acquisition module 100, a neural network structure model 200, and a prediction result processing module 300.

Further more specifically, in this embodiment, the image acquisition module 100 is configured to respectively acquire original pictures of the person object and then construct a neural network structure model 200; the neural network structure model 200 is used for outputting the eyeball focus of the input person object as a prediction result; the prediction result received by the prediction result processing module 300 is used for returning the eyeball focus position to the position in the rectangular coordinate with the image acquisition module 100 as the origin. The image acquisition module 100 is a camera or a camera, the neural network structural model 200 and the prediction result processing module 300 are software programs in a running computer, and tracking of detecting the eye concentration position in real time is realized through the algorithm program of the embodiment. It should be understood that, in a chip integration manner, the neural network structure model 200 and the prediction result processing module 300 both write the processing circuit board hardware of the respective calculation program correspondingly, and form the processing chip hardware in an integration manner.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, the components may be, but are not limited to: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Furthermore, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method for detecting the eye's attention position in real time, characterized by: comprises the steps of,

the image acquisition module (100) respectively acquires original pictures of the person objects;

the original picture input opencv is subjected to segmentation calculation and then input data is output;

correspondingly constructing a neural network structure model (200) according to the input data;

collecting annotation training data;

the training data is input into the neural network structure model (200) to perform model training and complete training parameter setting of the model;

the prediction result processing module (300) restores the prediction result generated by the neural network structure model (200) into the original size, and the prediction result processing module (300) establishes a rectangular coordinate system by taking the center point of the pupils of two eyes obtained in the original picture as the origin and returns accurate information of the eye attention position;

besides the left and right directions, the prediction result processing module (300) can accurately identify the positions of the eye attentiveness in four quadrants;

correspondingly constructing a neural network structure model according to the input data, comprising the following acquisition steps,

dividing the original picture into 3 pictures of a left eye, a right eye and a face through a haarcascade model of opencv, and simultaneously calculating the position of the face in the picture;

transmitting 4 input data of a left eye picture, a right eye picture, a face picture and a face mesh to the neural network structural model (200);

the picture input opencv is the original picture acquired by the image acquisition module (100), the resolution ratio of the original picture is 1920x1080, and the number of channels is (r, g and b) 3 respectively;

the neural network structural model (200) comprises the following construction steps,

acquiring data of the original picture;

input data preparation of the neural network structural model (200);

constructing a neural network structural unit of the neural network structural model (200);

the input data preparation of the neural network structural model (200) further includes,

the image segmentation of the left eye and the right eye comprises the steps of inputting the original image into a haarcascade_eye eye recognition unit of opencv to obtain two sets of x, y, w, h coordinates of the left eye and the right eye, and cutting out two images of the left eye and the right eye according to the coordinates;

the face picture segmentation and acquisition comprises the steps of inputting the original picture into a haarcascade_front face recognition unit of opencv, acquiring x, y, w, h coordinates of a face, and cutting the face picture according to the coordinates;

the obtaining of the face grids comprises the steps of equally dividing the original picture into 5x5 grids, marking the grid position as 1 if the face ratio in each grid exceeds 50%, otherwise marking the grid position as 0, and obtaining a face position mask with the size of 5x 5;

the front half part of the neural network structure consists of four channels:

position channel: the channel has no convolution layer;

after the data of the four channels are straightened and combined, the data enter a final FC layer, and two results are output, namely x and y coordinates of the eyeball attention position; the input dimension of the final FC layer is 256, namely the dimension after combining the four feature vectors of the left eye feature, the right eye feature, the facial feature and the facial grid, and the output dimension is 2;

the neural network structure unit comprises a left eye feature extraction network structure, a right eye feature extraction network structure, a facial feature extraction network structure and feature combination, wherein the feature combination comprises straightening of a left eye feature image and a right eye feature image, straightening of a facial feature image and straightening of a facial position mask result, combining the four to finally output a two-dimensional result, and representing that the eyeball attention position is x and y on an origin coordinate system;

the left and right eye feature extraction network structure is as follows:

CONV-E1：kernal_size(11*11)filter_number(96)

CONV-E2：kernal_size(5*5)filter_number(256)

CONV-E3：kernal_size(3*3)filter_number(384)

CONV-E4：kernal_size(1*1)filter_number(64)

E1-E4 are 4-layer convolution operations, the kernel_size being the size of the convolution kernel, here a matrix of 11 x 11, 5*5, 3*3, 1*1 sizes, respectively; the filter_number is the number of convolution kernels, namely the dimension of the output Tensor;

the facial feature extraction network structure is as follows:

the parameter and eye feature extraction network comprises the following steps:

CONV-F1：kernal_size(11*11)filter_number(96)

CONV-F2：kernal_size(5*5)filter_number(256)

CONV-F3：kernal_size(3*3)filter_number(384)

CONV-F4：kernal_size(1*1)filter_number(64)

f1-F4 are 4-layer convolution operations, the kernel_size being the size of the convolution kernel, here a matrix of 11 x 11, 5*5, 3*3, 1*1 sizes, respectively; the filter_number is the number of convolution kernels, namely the dimension of the output Tensor;

the annotation training data comprises the following acquisition steps,

a square grid plate is adopted as a scale;

collecting face pictures of an observer, taking a face picture of the observer and recording the watched grids when the observer watches one point in a grid plate every time, wherein the grid plate comprises 30 grids in total, and each person collects 30 face pictures and corresponding grid positions;

randomly selecting 10 observers, and repeating the operation to obtain 300 images with 1920x1080 resolution and corresponding grid positions with the same number;

and storing the picture under an img directory, converting the grid position into a coordinate system, and storing under a label directory.