CN107194559A

CN107194559A - A kind of work stream recognition method based on Three dimensional convolution neutral net

Info

Publication number: CN107194559A
Application number: CN201710335309.XA
Authority: CN
Inventors: 胡海洋; 丁佳民; 陈洁; 胡华; 程凯明
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Taoyi Data Technology Co ltd
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2017-09-22
Anticipated expiration: 2037-05-12
Also published as: CN107194559B

Abstract

The invention discloses a workflow recognition method based on a three-dimensional convolutional neural network. Only dividing different process tasks in advance and manually labeling different actions during the process of analyzing the video does not meet the automation requirements of intelligent manufacturing. The present invention firstly proposes an inter-frame difference method with an adaptive threshold, which is mainly used to segment the region of moving objects from complex backgrounds, thereby reducing the time complexity of subsequent feature extraction and model training; secondly, The 3D convolutional neural network is improved so that it can fully adapt to the factory environment with multiple monitoring devices, and for different views, the view pooling layer is used to fuse the views from different angles according to the weight; finally, a The new action division method automatically divides the continuous production actions in the video, thus realizing the automatic workflow recognition process.

Description

A Workflow Recognition Method Based on 3D Convolutional Neural Network

技术领域technical field

本发明属于工作流识别技术领域，用于对生产制造流程的快速精确地识别与检测。通过制造车间里安装的摄像头，拍摄生产线上生产调度的整个过程，然后对视频进行计算与处理，从而在保护员工人身安全，减少生产开销，保证产品��量，以及优化生产调度以及流程规��发挥重要的作用The invention belongs to the technical field of workflow identification and is used for fast and accurate identification and detection of production and manufacturing processes. Through the camera installed in the manufacturing workshop, the entire process of production scheduling on the production line is photographed, and then the video is calculated and processed, so as to protect the personal safety of employees, reduce production costs, ensure product quality, and optimize production scheduling and process specifications. play an important role

背景技术Background technique

智能制造是制造自动化的进一步发展方向，它将人工智能技术广泛应用于工业制造过程的工程设计、工艺过程设计、生产调度、故障诊断等各个环节，从而实现制造过程智能化，并大幅度提高生产力。工作流识别(workflow recognition)作为智能制造的一个重要技术方向，目前已经引起产业界与科研界的重视。它利用制造车间里安装的摄像头，拍摄了生产线上生产调度的整个过程，然后对视频进行计算与处理，实现对工业生产流程进行快速精确的识别、检测，从而在保护员工人身安全，减少生产开销，保证产品质量，以及优化生产调度以及流程规范方面都将发挥重要的作用。Intelligent manufacturing is the further development direction of manufacturing automation. It widely applies artificial intelligence technology to engineering design, process design, production scheduling, fault diagnosis and other links of industrial manufacturing process, so as to realize intelligent manufacturing process and greatly improve productivity. . As an important technical direction of intelligent manufacturing, workflow recognition has attracted the attention of the industry and scientific research circles. It uses the camera installed in the manufacturing workshop to capture the entire process of production scheduling on the production line, and then calculates and processes the video to realize fast and accurate identification and detection of industrial production processes, thereby protecting the personal safety of employees and reducing production costs. , ensuring product quality, and optimizing production scheduling and process specification will play an important role.

然而，工作流识别技术有其复杂性与特殊性。首先，由于生产车间内各类机器、搬运车辆、辅助器械等物体较多，时常彼此遮挡，并且不同工序操作的相��性、车间内频繁的光线强弱变化，这些都给视频、图像的分析与识别带来了挑战。此外，动态的生产工作流过程又使得识别过程相当复杂，很容易产生偏差：例如，工作流中不同的任务往往有不同的执行时间，并且在任务开始与结束之间没有一个明确的定义；这些任务甚至可能同时包含人和机器的动作，而其中一些与工作流无关的动作必须与真正的生产任务区分开来。这些方面使得传统的、依赖于目标物体检测与跟踪的动作/姿态识别方法很难适用于复杂的工厂制造环境。此外，目前一些研究者虽然对工作流识别技术展开了部分的研究，但如何对视频中的图像序列进行生产过程/动作的自动划分，这些研究工作并未给出明确的定义，它们大多仅是对不同工序任务进行事先划分并且在分析视频的过程中人为地对不同动作行为打上标签，这样显然不符合智能制造的自动化需求。However, workflow identification technology has its complexity and particularity. First of all, because there are many kinds of machines, transport vehicles, auxiliary equipment and other objects in the production workshop, they are often blocked by each other, and the similarity of different process operations, frequent changes in light intensity in the workshop, all these make video and image analysis and Identification poses challenges. In addition, the dynamic production workflow process makes the recognition process quite complicated, and it is easy to produce deviations: for example, different tasks in the workflow often have different execution times, and there is no clear definition between the start and end of tasks; these Tasks may even contain both human and machine actions, some of which have nothing to do with workflow and must be distinguished from real production tasks. These aspects make it difficult for traditional action/posture recognition methods that rely on object detection and tracking to be applied in complex factory manufacturing environments. In addition, although some researchers have carried out some research on workflow recognition technology, how to automatically divide the production process/action of the image sequence in the video, these research works have not given a clear definition, most of them are only Dividing different process tasks in advance and artificially labeling different actions during the video analysis process obviously does not meet the automation requirements of intelligent manufacturing.

发明内容Contents of the invention

本发明针对目前的研究现状，提出了一个具有较强鲁棒性的工作流识别框架。在该框架中，首先提出了一种带有自适应阈值的帧间差分法，该方法主要用于从复杂背景中分割出运动物体的区域，从而降低了后面特征提取与模型训练的时间复杂度；其次，对3D卷积神经网络进行了改进，使其能够充分适应具有多个监控设备的工厂环境，而对于不同视图，采用视图池化层对不同角度的视图按权重进行融合；最后，提出了一种新的动作划分方法，对视频中连续的生产动作进行自动划分，从而实现了自动化的工作流识别过程。Aiming at the current research status, the present invention proposes a workflow identification framework with strong robustness. In this framework, an inter-frame difference method with an adaptive threshold is first proposed, which is mainly used to segment the region of moving objects from complex backgrounds, thereby reducing the time complexity of subsequent feature extraction and model training ;Secondly, the 3D convolutional neural network is improved so that it can fully adapt to the factory environment with multiple monitoring devices, and for different views, the view pooling layer is used to fuse the views from different angles according to the weight; finally, the proposed A new action division method is proposed to automatically divide the continuous production actions in the video, so as to realize the automatic workflow recognition process.

本发明方法的具体步骤是：The concrete steps of the inventive method are:

步骤(1)、从数据集中导出包含多视角的工作流视频，获取各视角工作流视频的视频分辨率和帧数；Step (1), export workflow videos containing multiple perspectives from the data set, and obtain the video resolution and frame number of the workflow videos of each perspective;

步骤(2)、初始化各视角工作流视频的帧间差分阈值；对各视角工作流视频分别进行步骤(3)～(11)；Step (2), initializing the inter-frame difference thresholds of the workflow videos of each perspective; performing steps (3) to (11) for the workflow videos of each perspective;

步骤(3)、设定t＝2；Step (3), setting t=2;

步骤(4)、读取t-1、t、t+1三个连续的视频帧并将这三个视频帧进行灰度化和中值滤波处理；Step (4), read three consecutive video frames of t-1, t, and t+1 and carry out grayscale and median filter processing of these three video frames;

步骤(5)、分别对前两帧和后两帧进行帧间差分运算得到两张帧间差分图；Step (5), performing inter-frame difference operations on the first two frames and the last two frames respectively to obtain two inter-frame difference images;

步骤(6)、根据步骤(5)得到的两幅帧间差分图动态更新帧间差分阈值；帧间差分阈值动态更新方法如下：Step (6), dynamically update the inter-frame difference threshold according to the two inter-frame difference images obtained in step (5); the inter-frame difference threshold dynamic update method is as follows:

6.1设定l＝1，第t帧帧间差分阈值d_k为帧间差分图中第k个像素的像素值，max{d_k}为帧间差分图中像素值的最大值，min{d_k}为帧间差分图中像素值的最小值；6.1 Set l=1, the inter-frame difference threshold of the tth frame d _k is the pixel value of the kth pixel in the inter-frame difference image, max{d _k } is the maximum value of the pixel value in the inter-frame difference image, and min{d _k } is the minimum value of the pixel value in the inter-frame difference image;

6.2令N₁和N₂分别表示满足和的像素总数；6.2 orders N ₁ and N ₂ respectively represent the satisfaction of with The total number of pixels;

6.3若则将赋值给τ¹ _t，否则，令l＝l+1，重复步骤6.2；6.3 If then will Assign value to τ ¹ _t , otherwise, let l=l+1, repeat step 6.2;

步骤(7)、根据(6)得到的帧间差分阈值对当前帧进行二值化处理，大于帧间差分阈值的像素点设为1，小于帧间差分阈值的设为0；In step (7), the current frame is binarized according to the inter-frame difference threshold obtained in (6), and the pixels greater than the inter-frame difference threshold are set to 1, and the pixels less than the inter-frame difference threshold are set to 0;

步骤(8)、将前后两幅帧间差分图运行与操作，得到三帧差分图，并使用块提取方法获取兴趣点中心坐标；Step (8), running and operating the front and rear two frame difference images to obtain three frame difference images, and using the block extraction method to obtain the central coordinates of the point of interest;

步骤(9)、将提取到的兴趣点从当前帧原始图像中分割出来；Step (9), segmenting the extracted point of interest from the original image of the current frame;

步骤(10)、t取值逐步加1，重复执行步骤(4)～(9)，直到t取值比工作流视频最后一帧取值小1，重复过程中步骤(9)分割尺寸不变；将每次重复过程中步骤(9)中得到的兴趣点图像按先后顺序保存为兴趣点视频，并根据数据集中的分类规则对兴趣点视频进行分类；Step (10), the value of t is gradually increased by 1, and steps (4) to (9) are repeated until the value of t is 1 smaller than the value of the last frame of the workflow video, and the segmentation size of step (9) remains unchanged during the repetition process ; The point-of-interest image obtained in step (9) in each repetition process is preserved as the point-of-interest video in sequence, and the point-of-interest video is classified according to the classification rules in the data set;

步骤(11)、从步骤(10)得到的兴趣点视频中随机选取90％作为训练集，其余作为测试集；Step (11), randomly select 90% from the point-of-interest video obtained in step (10) as a training set, and the rest as a test set;

步骤(12)、构建一个多视图三维卷积神经网络，初始化训练轮数为5000；多视图三维卷积神经网络构建方法如下：Step (12), construct a multi-view three-dimensional convolutional neural network, and initialize the number of training rounds to be 5000; the construction method of the multi-view three-dimensional convolutional neural network is as follows:

12.1、卷积及池化操作如下：12.1. The convolution and pooling operations are as follows:

①为第一卷积层初始化一个大小为9*9*9*10的四维卷积核，激活函数为sigmoid，第一池化层窗口大小为2，步长为2；① Initialize a four-dimensional convolution kernel with a size of 9*9*9*10 for the first convolutional layer, the activation function is sigmoid, the window size of the first pooling layer is 2, and the step size is 2;

②为第二卷积层初始化一个大小为9*9*7*30的四维卷积核，激活函数为sigmoid，第二池化层窗口大小为2，步长为2；②Initialize a four-dimensional convolution kernel with a size of 9*9*7*30 for the second convolutional layer, the activation function is sigmoid, the window size of the second pooling layer is 2, and the step size is 2;

③为第三卷积层初始化一个大小为9*8*5*50的四维卷积核，激活函数为sigmoid，第三池化层窗口大小为2，步长为2；③Initialize a four-dimensional convolution kernel with a size of 9*8*5*50 for the third convolutional layer, the activation function is sigmoid, the window size of the third pooling layer is 2, and the step size is 2;

④为第四卷积层初始化一个大小为4*3*3*150的四维卷积核，激活函数为sigmoid，第四池化层窗口大小为2，步长为2；④ Initialize a four-dimensional convolution kernel with a size of 4*3*3*150 for the fourth convolutional layer, the activation function is sigmoid, the window size of the fourth pooling layer is 2, and the step size is 2;

12.2、初始化加权平均视图池化层中各特征图权重参数为[0,1]中随机值，且加权平均视图池化层中的加权平均视图池化操作如下：12.2. Initialize the weight parameters of each feature map in the weighted average view pooling layer is a random value in [0,1], and The weighted average view pooling operation in the weighted average view pooling layer is as follows:

式中，a为加权平均视图池化操作后的加权平均特征图，t₁为卷积及池化操作后的池化特征图序号，为序号t₁对应的池化特征图所占权重，exp表示e为底的指数函数，为序号t₁对应的池化特征图；In the formula, a is the weighted average feature map after the weighted average view pooling operation, t ₁ is the serial number of the pooled feature map after the convolution and pooling operation, is the weight of the pooled feature map corresponding to the serial number t ₁ , exp represents the exponential function with e as the base, is the pooled feature map corresponding to the serial number t ₁ ;

12.3、为前两层全连接层分别初始化一个3000*1500和1500*750的卷积核，并设置激活函数为Relu；加权平均视图池化操作后的加权平均特征图输入前两层全连接层；12.3. Initialize a convolution kernel of 3000*1500 and 1500*750 for the first two fully connected layers, and set the activation function to Relu; the weighted average feature map after the weighted average view pooling operation is input to the first two fully connected layers ;

12.4、为最后一层全连接层初始化一个750*14的卷积核并设置Softmax分类函数。12.4. Initialize a 750*14 convolution kernel for the last fully connected layer and set the Softmax classification function.

步骤(13)、各视角工作流视频对应的训练集中均随机选取20个视频输入到(12)中的多视图三维卷积神经网络中进行特征训练，并输出训练误差；In step (13), 20 videos are randomly selected from the training set corresponding to the workflow videos of each perspective and input into the multi-view 3D convolutional neural network in (12) for feature training, and the training error is output;

步骤(14)、各视角工作流视频对应的的训练集中随机选取10个视频输入到多视图三维卷积神经网络中进行验证，得到多视图三维卷积神经网络分类识别的准确率；Step (14), randomly select 10 videos from the training set corresponding to the workflow videos of each perspective and input them into the multi-view three-dimensional convolutional neural network for verification, and obtain the accuracy rate of classification and recognition of the multi-view three-dimensional convolutional neural network;

步骤(15)、重复步骤(13)～(14)，每重复一次训练轮数减1，直到训练轮数为0，得到一个训练好的多视图三维卷积神经网络；Step (15), repeating steps (13) to (14), the number of training rounds is reduced by 1 each time it is repeated, until the number of training rounds is 0, and a trained multi-view three-dimensional convolutional neural network is obtained;

步骤(16)、使用各视角工作流视频对应的测试集对步骤(15)中的多视图三维卷积神经网络进行测试；Step (16), testing the multi-view three-dimensional convolutional neural network in step (15) using the test set corresponding to the workflow video of each perspective;

步骤(17)、对新输入的工作流视频，获取视频分辨率和帧数，初始化帧间差分阈值；设定t＝2；Step (17), for the newly input workflow video, obtain the video resolution and the number of frames, and initialize the difference threshold between frames; set t=2;

步骤(18)、根据步骤(4)～(8)提取相邻两帧兴趣点的中心坐标，并计算两个中心坐标间的距离，若距离大于设定的阈值T，则标记为运动状态S₁，否则，标记为相对静止状态S₀；Step (18), according to steps (4) to (8), extract the center coordinates of two adjacent frames of interest points, and calculate the distance between the two center coordinates, if the distance is greater than the set threshold T, mark it as a motion state S ₁ , otherwise, it is marked as relative static state S ₀ ;

步骤(19)、t取值逐步加1，重复步骤(18)直到t取值比新输入的工作流视频最后一帧取值小1，统计连续的S₀和S₁的数量，当检测到的S₀或S₁数量大于或等于N时，分割出连续的S₀或S₁对应帧中目标兴趣点存储到帧队列中，否则丢弃连续的S₀或S₁对应帧。Step (19), the value of t is gradually increased by 1, and step (18) is repeated until the value of t is smaller than the value of the last frame of the newly input workflow video by 1, and the number of continuous S ₀ and S ₁ is counted. When the number of S ₀ or S ₁ is greater than or equal to N, the target interest points in the corresponding frames of continuous S ₀ or S ₁ are divided and stored in the frame queue, otherwise the consecutive frames corresponding to S ₀ or S ₁ are discarded.

步骤(20)、帧队列各个连续的S₀或S₁对应帧的集合中从第i帧开始提取连续的关键帧，i＞5，使关键帧帧数与数据集中分类好的各段视频的帧数相同。Step (20), in the set of each continuous S ₀ or S ₁ corresponding frames of the frame queue, start to extract continuous key frames from the i-th frame, i>5, so that the number of key frames is the same as that of each section of video classified in the data set The number of frames is the same.

步骤(21)、将步骤(20)中关键帧按先后顺序组成的视频输入到步骤(15)中训练好的多视图三维卷积神经网络中对员工行为进行分类识别；Step (21), the key frame in the step (20) is entered into the video input that the key frame forms successively in the multi-view three-dimensional convolutional neural network trained in the step (15) and employee behavior is classified and identified;

步骤(22)、将步骤(21)中得到的行为类别与预先定义的标准工作流进行比对。Step (22), comparing the behavior category obtained in step (21) with the predefined standard workflow.

本发明有益效果如下：The beneficial effects of the present invention are as follows:

本发明所提供的基于三维卷积神经网络的工作流识别方法主要由以下功能模块组成：运动目标分割模块、行为识别模块和动作划分模块。The workflow recognition method based on the three-dimensional convolutional neural network provided by the present invention is mainly composed of the following functional modules: a moving object segmentation module, a behavior recognition module and an action division module.

运动目标分割模块主要实现从图像、视频序列中分割出目标兴趣点。由于工作流视频序列中的目标运动相对较大，而背景基本处于静止状态，因此可以将前后两帧图像相减，得到帧间差分图，然后可根据像素之差与阈值的大小关系来对运动目标进行分割。而所采用的自适应三帧差分法是将三个视频帧中前两帧和后两帧得到的帧间差分图进行与运算，得到三帧差分图，并且其阈值的设定是根据前面的帧间差分图自动调整的，因此其可以有效避免噪声的影响；The moving target segmentation module mainly realizes the segmentation of target interest points from images and video sequences. Since the target motion in the workflow video sequence is relatively large, and the background is basically in a static state, the two frames of images before and after can be subtracted to obtain the inter-frame difference map, and then the motion can be adjusted according to the relationship between the pixel difference and the threshold value target segmentation. The adaptive three-frame difference method adopted is to perform an AND operation on the inter-frame difference maps obtained from the first two frames and the last two frames of the three video frames to obtain a three-frame difference map, and the threshold is set according to the previous The inter-frame difference map is automatically adjusted, so it can effectively avoid the influence of noise;

行为识别模块利用3D卷积神经网络以及多视图的学习能力对移动目标进行行为识别。为了实现多视图融合，我们采用了一个视图池化层(view-pooling layer)来融合这些全局视图信息。多视图3D-CNNs中涉及多个独立的3D-CNNs用于从不同视图的图像序列中提取特征；然后，在视图池化层中对来自不同视图提取到的特征描述符进行融合并学习视图相关特征；最后用一个带有softmax分类器的全连接神经网络(full connected neuralnetwork,FNN)进行最后的识别；The behavior recognition module uses 3D convolutional neural network and multi-view learning ability to recognize the behavior of moving targets. To achieve multi-view fusion, we employ a view-pooling layer to fuse these global view information. In multi-view 3D-CNNs, multiple independent 3D-CNNs are used to extract features from image sequences of different views; then, in the view pooling layer, the feature descriptors extracted from different views are fused and learn view correlation. Features; Finally, a fully connected neural network (full connected neural network, FNN) with a softmax classifier is used for final identification;

动作划分模块定义了两种状态：运动状态和相对静止状态。针对每一帧的兴趣点，取其中心坐标，当兴趣点移动时，其中心坐标也会随之移动。这时，可以取相邻两帧的的兴趣点中心坐标之差来表示当前兴趣点的状态。通过这样的方式实现动态和静态的划分；The action division module defines two states: motion state and relative static state. For the point of interest in each frame, its center coordinates are taken. When the point of interest moves, its center coordinates will also move accordingly. At this time, the difference between the center coordinates of the point of interest in two adjacent frames can be taken to represent the state of the current point of interest. In this way, dynamic and static divisions are realized;

本发明提供的工作流识别方法可有效解决复杂环境下工作流识别需要解决的两个问题，第一个问题是生产车间内各类机器、搬运车辆、辅助器械等物体彼此遮挡，以及不同工序操作的相似性、车间内频繁的光线强弱变化对工作流识别带来的影响，第二个问题是如何对视频中的图像序列进行生产过程/动作的自动划分。The workflow identification method provided by the present invention can effectively solve two problems that need to be solved in workflow identification in a complex environment. The first problem is that various machines, transport vehicles, auxiliary equipment and other objects in the production workshop are blocked from each other, and different process operations similarity, the impact of frequent light intensity changes in the workshop on workflow recognition, and the second problem is how to automatically divide the production process/action in the image sequence in the video.

附图说明Description of drawings

图1为多视图三维卷积神经网络构建示意图；Figure 1 is a schematic diagram of the construction of a multi-view three-dimensional convolutional neural network;

图2为工作流动作划分示意图。FIG. 2 is a schematic diagram of workflow action division.

具体实施方式detailed description

下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below in conjunction with drawings and embodiments.

首先进行概念定义及符号说明：First, define the concept and explain the symbols:

帧间差分阈值t表示当前帧号，l≥1代表递归次序，d_k为帧间差分图中第k个像素的像素值，max{d_k}为帧间差分图中像素值的最大值，min{d_k}为帧间差分图中像素值的最小值；Frame Difference Threshold t represents the current frame number, l≥1 represents the recursive order, d _k is the pixel value of the kth pixel in the inter-frame difference map, max{d _k } is the maximum value of the pixel value in the inter-frame difference map, min{d _k } is the minimum value of the pixel value in the inter-frame difference map;

N₁和N₂分别表示满足和的像素总数。 N ₁ and N ₂ respectively represent the satisfaction of with the total number of pixels.

a：加权平均视图池化操作后的加权平均特征图。a: Weighted average feature map after weighted average view pooling operation.

t₁：卷积及池化操作后的池化特征图序号。t ₁ : The sequence number of the pooled feature map after convolution and pooling operations.

序号t₁对应的池化特征图所占权重。 The weight of the pooled feature map corresponding to the serial number t ₁ .

序号t₁对应的池化特征图。 The pooled feature map corresponding to the serial number t ₁ .

其次，一种基于三维卷积神经网络的工作流识别方法，实现步骤如下：Secondly, a workflow recognition method based on a three-dimensional convolutional neural network, the implementation steps are as follows:

(1)运动目标分割：生产线上的视频监控设备往往架设于较高的位置，导致监控画面中大部分区域都是与工作流识别无关的工厂背景，若直接从整个监控画面中提取特征向量，这将大幅增加特征提取的难度和计算的时间消耗。因此，使用自适应阈值的三帧差分法分割出视频中的运动目标(兴趣点)部分，从而减少后面步骤的工作量。具体的：(1) Segmentation of moving objects: Video monitoring equipment on the production line is often set up at a higher position, resulting in most areas of the monitoring screen being factory backgrounds that have nothing to do with workflow recognition. If feature vectors are directly extracted from the entire monitoring screen, This will greatly increase the difficulty of feature extraction and the time consumption of calculation. Therefore, the three-frame difference method with adaptive threshold is used to segment the moving target (point of interest) part in the video, thereby reducing the workload of the following steps. specific:

(1.1)从数据集中导出多视角工作流视频，获取各视角工作流视频的视频分辨率和帧数；(1.1) Export the multi-view workflow video from the data set, and obtain the video resolution and frame number of each view workflow video;

(1.2)初始化各视角工作流视频的帧间差分阈值；设定t＝2，对各视角工作流视频分别进行步骤(1.3)～(1.9)(1.2) Initialize the inter-frame difference thresholds of the workflow videos of each view; set t=2, and perform steps (1.3) to (1.9) for the workflow videos of each view respectively

(1.3)读取一个视频帧t及其相邻两帧t-1和t+1，并将这三个视频帧进行灰度化和中值滤波处理；(1.3) Read a video frame t and its adjacent two frames t-1 and t+1, and carry out grayscale and median filter processing of these three video frames;

(1.4)分别对前两帧和后两帧进行帧间差分运算得到两张帧间差分图；(1.4) Perform inter-frame difference calculations on the first two frames and the last two frames respectively to obtain two inter-frame difference maps;

(1.5)根据步骤(1.4)得到的两幅帧间差分图动态更新帧间差分阈值，更新方法如下：(1.5) Dynamically update the inter-frame difference threshold according to the two inter-frame difference images obtained in step (1.4), the update method is as follows:

(1.5.1)设��l取��为1，第t帧帧间差分阈值d_k为帧间差分图中第k个像素的像素值，max{d_k}为帧间差分图中像素值的最大值，min{d_k}为帧间差分图中像素值的最小值；(1.5.1) Set the value of l to 1, and the inter-frame difference threshold of the tth frame d _k is the pixel value of the kth pixel in the inter-frame difference image, max{d _k } is the maximum value of the pixel value in the inter-frame difference image, and min{d _k } is the minimum value of the pixel value in the inter-frame difference image;

(1.5.2)令N₁和N₂分别表示满足和的像素的总数；(1.5.2) order N ₁ and N ₂ respectively represent the satisfaction of with The total number of pixels;

(1.5.3)若则将赋值给τ¹ _t，否则，令l＝l+1，重复步骤(1.5.2)；(1.5.3) If then will Assign value to τ ¹ _t , otherwise, let l=l+1, repeat step (1.5.2);

(1.6)根据(1.5)得到的帧间差分阈值对当前帧(即中间那帧)进行二值化处理，大于帧间差分阈值的像素点设为1，小于帧间差分阈值的设为0；(1.6) Binarize the current frame (that is, the middle frame) according to the inter-frame difference threshold obtained in (1.5), set the pixel points greater than the inter-frame difference threshold to 1, and set the pixel points smaller than the inter-frame difference threshold to 0;

(1.7)将前后两幅差分图进行与操作，得到三帧差分图，并使用Blob Extraction(块提取)方法获取兴趣点中心坐标；(1.7) Perform AND operation on the two difference images before and after to obtain three frames of difference images, and use the Blob Extraction method to obtain the center coordinates of the point of interest;

(1.8)将提取到的兴趣点从当前帧原始图像中分割出来；(1.8) Segment the extracted interest points from the original image of the current frame;

(1.9)t取值逐步加1，重复执行步骤(1.3)-(1.8)直到t取值比工作流视频最后一帧取值小1，重复过程中，步骤(1.8)分割尺寸不变；将每次重复过程中步骤(1.8)中得到的兴趣点图像按先后顺序保存为兴趣点视频，并根据数据集中的分类规则对兴趣点视频进行分类；(1.9) The value of t is gradually increased by 1, and steps (1.3)-(1.8) are repeated until the value of t is 1 smaller than the value of the last frame of the workflow video. During the repetition, the segmentation size of step (1.8) remains unchanged; The point-of-interest images obtained in step (1.8) in each repetition process are saved as point-of-interest videos in sequence, and the point-of-interest videos are classified according to the classification rules in the data set;

(2)基于多视图三维卷积神经网络的行为识别：对当前制造业生产线进行考察后可发现，目前制造业生产线中往往会对同一个工作场景采用多个摄像头从不同角度进行同步实时监测，以此来保证产品的质量以及员工的安全。利用这一特点，我们使用多视图特征提取与融合的方法来有效降低了工厂复杂环境对行为识别的影响，提高行为识别的准确率。具体执行步骤如下：(2) Behavior recognition based on multi-view 3D convolutional neural network: After inspecting the current manufacturing production line, it can be found that in the current manufacturing production line, multiple cameras are often used for synchronous real-time monitoring of the same working scene from different angles. In order to ensure the quality of products and the safety of employees. Taking advantage of this feature, we use the method of multi-view feature extraction and fusion to effectively reduce the impact of the complex environment of the factory on behavior recognition and improve the accuracy of behavior recognition. The specific execution steps are as follows:

(2.1)从(1)得到的兴趣点视频中选取90％作为训练集，其余作为测试集；(2.1) Select 90% from the point-of-interest videos obtained in (1) as a training set, and the rest as a test set;

(2.2)构建一个多视图三维卷积神经网络(见附图1)。初始化训练轮数为5000，多视图三维卷积神经网络构建方法如下：(2.2) Construct a multi-view three-dimensional convolutional neural network (see Figure 1). The number of initial training rounds is 5000, and the construction method of the multi-view 3D convolutional neural network is as follows:

卷积及池化操作过程为(2.2.1)～(2.2.4)：The convolution and pooling operation process is (2.2.1) ~ (2.2.4):

(2.2.1)为第一卷积层初始化一个大小为9*9*9*10的四维卷积核，激活函数为sigmoid，第一池化层窗口大小为2，步长为2；(2.2.1) Initialize a four-dimensional convolution kernel with a size of 9*9*9*10 for the first convolutional layer, the activation function is sigmoid, the window size of the first pooling layer is 2, and the step size is 2;

(2.2.2)为第二卷积层初始化一个大小为9*9*7*30的四维卷积核，激活函数为sigmoid，第二池化层窗口大小为2，步长为2；(2.2.2) Initialize a four-dimensional convolution kernel with a size of 9*9*7*30 for the second convolutional layer, the activation function is sigmoid, the window size of the second pooling layer is 2, and the step size is 2;

(2.2.3)为第三卷积层初始化一个大小为9*8*5*50的四维卷积核，激活函数为sigmoid，第三池化层窗口大小为2，步长为2；(2.2.3) Initialize a four-dimensional convolution kernel with a size of 9*8*5*50 for the third convolutional layer, the activation function is sigmoid, the window size of the third pooling layer is 2, and the step size is 2;

(2.2.4)为第四卷积层初始化一个大小为4*3*3*150的四维卷积核，激活函数为sigmoid，第四池化层窗口大小为2，步长为2；(2.2.4) Initialize a four-dimensional convolution kernel with a size of 4*3*3*150 for the fourth convolutional layer, the activation function is sigmoid, the window size of the fourth pooling layer is 2, and the step size is 2;

(2.2.5)初始化加权平均视图池化层中各特征图权重参数为[0,1]中随机值，且加权平均视图池化层(weighted average view-pooling layer，WAVP)计算公式如下：(2.2.5) Initialize the weight parameters of each feature map in the weighted average view pooling layer is a random value in [0,1], and The weighted average view-pooling layer (WAVP) calculation formula is as follows:

(2.2.6)为前两层全连接层分别初始化一个3000*1500和1500*750的卷积核，并设置激活函数为Relu；加权平均视图池化操作后的加权平均特征图输入前两层全连接层；(2.2.6) Initialize a 3000*1500 and 1500*750 convolution kernel for the first two fully connected layers, and set the activation function to Relu; the weighted average feature map after the weighted average view pooling operation is input to the first two layers fully connected layer;

(2.2.7)为最后一层全连接层初始化一个750*14的卷积核并设置Softmax分类函数，其中14为动作的种类。(2.2.7) Initialize a 750*14 convolution kernel for the last fully connected layer and set the Softmax classification function, where 14 is the type of action.

(2.3)各视角工作流视频的训练集中随机选取20个视频输入到(2.2)中的多视图三维卷积神经网络中进行特征训练，并输出训练误差；(2.3) Randomly select 20 videos from the training set of workflow videos from each perspective and input them into the multi-view 3D convolutional neural network in (2.2) for feature training, and output the training error;

(2.4)各视角工作流视频的训练集中随机选取10个的视频输入到多视图三维卷积神经网络中进行验证，得到多视图三维卷积神经网络分类识别的准确率；(2.4) Randomly select 10 videos from the training set of the workflow videos of each view and input them into the multi-view 3D convolutional neural network for verification, and obtain the accuracy rate of classification and recognition of the multi-view 3D convolutional neural network;

(2.5)重复(2.3)～(2.4)，每重复一次训练轮数减1，直到训练轮数为0，得到一个训练好的多视图三维卷积神经网络；(2.5) Repeat (2.3)～(2.4), and the number of training rounds is reduced by 1 every time it is repeated, until the number of training rounds is 0, and a trained multi-view 3D convolutional neural network is obtained;

(2.6)使用各视角工作流视频对应的测试集对(2.5)中的多视图三维卷积神经网络进行测试；(2.6) Test the multi-view 3D convolutional neural network in (2.5) using the test set corresponding to the workflow video of each view;

(3)基于状态的动作划分方法：实际环境下，工人的动作往往是连续发生的，这种情况下，若要对动作进行识别需要先对动作进行分割，然后才能对每个动作分别进行识别。经观察发现，工人从取零件、搬运零件到放置零件以及工人从取焊接工具到进行零件焊接，这些行为中间都会发生一段位移(见附图2)。所以，可以根据工人的运动状态对动作进行划分。具体执行步骤如下：(3) State-based action division method: In the actual environment, the actions of workers often occur continuously. In this case, if the action is to be recognized, the action needs to be segmented first, and then each action can be identified separately. . It is found through observation that a certain displacement (see accompanying drawing 2) will take place in the middle of these behaviors of workers from taking parts, handling parts to placing parts and workers from taking welding tools to welding parts. Therefore, the actions can be divided according to the worker's motion state. The specific execution steps are as follows:

(3.1)对新输入的视频，获取视频分辨率和帧数，初始化帧间差分阈值；设定t＝2；(3.1) For the newly input video, obtain the video resolution and the number of frames, initialize the difference threshold between frames; set t=2;

(3.2)根据(1.3)～(1.7)提取相邻两帧兴趣点的中心坐标，并计算两个中心坐标间的距离，若距离大于我们设定的阈值T，则标记为运动状态S₁，否则，标记为相对静止状态S₀；(3.2) According to (1.3)～(1.7), extract the center coordinates of two adjacent frames of interest points, and calculate the distance between the two center coordinates. If the distance is greater than the threshold T we set, it will be marked as the motion state S ₁ , Otherwise, it is marked as relatively static state S ₀ ;

(3.3)t取值逐步加1，重复(3.2)直到t取值比新输入的视频最后一帧取值小1，，统计连续的S₀和S₁数量，当检测到连续的S₀或S₁数量大于或等于N时，N＞10，用(1.8)的方法分割出连续的S₀或S₁对应帧中目标兴趣点存储到帧队列中，否则丢弃连续的S₀或S₁对应帧。(3.3) The value of t is gradually increased by 1, repeat (3.2) until the value of t is 1 smaller than the value of the last frame of the newly input video, and the number of continuous S ₀ and S ₁ is counted, when continuous S ₀ or When the number of S ₁ is greater than or equal to N, N>10, use the method (1.8) to segment out the target interest points in the corresponding frame of S ₀ or S ₁ and store them in the frame queue, otherwise discard the corresponding S ₀ or S ₁ frame.

(3.4)帧队列各个连续的S₀或S₁对应帧的集合中从第i帧开始提取连续的关键帧，i＞5，使其与数据集中分类好的各段视频的帧数相同。(3.4) In the set of frames corresponding to each continuous S ₀ or S ₁ in the frame queue, extract continuous key frames starting from the i-th frame, i>5, so that it is the same as the number of frames of each segment of video classified in the data set.

(3.5)将(3.4)中关键帧按先后顺序组成的视频输入到(2.5)中训练好的多视图三维卷积神经网络中对员工行为进行分类识别；(3.5) Input the video composed of key frames in (3.4) into the multi-view three-dimensional convolutional neural network trained in (2.5) to classify and identify employee behavior;

(3.6)将(3.5)中得到的行为类别与预先定义的标准工作流进行比对。(3.6) Compare the behavior categories obtained in (3.5) with the pre-defined standard workflow.

Claims

1. A workflow recognition method based on a three-dimensional convolutional neural network, characterized in that: the specific steps of the method are:

Step (1), export workflow videos containing multiple perspectives from the data set, and obtain the video resolution and frame number of the workflow videos of each perspective;

Step (2), initializing the inter-frame difference thresholds of the workflow videos of each perspective; performing steps (3) to (11) for the workflow videos of each perspective;

Step (3), setting t=2;

Step (4), read three consecutive video frames of t-1, t, and t+1 and carry out grayscale and median filter processing of these three video frames;

Step (5), performing inter-frame difference operations on the first two frames and the last two frames respectively to obtain two inter-frame difference images;

Step (6), dynamically update the inter-frame difference threshold according to the two inter-frame difference images obtained in step (5); the inter-frame difference threshold dynamic update method is as follows:

6.1 Set l=1, the inter-frame difference threshold of the tth frame d _k is the pixel value of the kth pixel in the inter-frame difference image, max{d _k } is the maximum value of the pixel value in the inter-frame difference image, and min{d _k } is the minimum value of the pixel value in the inter-frame difference image;

N ₁ and N ₂ respectively represent the satisfaction of with The total number of pixels;

6.3 If then will Assign value to τ ¹ _t , otherwise, let l=l+1, repeat step 6.2;

In step (7), the current frame is binarized according to the inter-frame difference threshold obtained in (6), and the pixels greater than the inter-frame difference threshold are set to 1, and the pixels less than the inter-frame difference threshold are set to 0;

Step (8), running and operating the front and rear two frame difference images to obtain three frame difference images, and using the block extraction method to obtain the central coordinates of the point of interest;

Step (9), segmenting the extracted point of interest from the original image of the current frame;

Step (10), the value of t is gradually increased by 1, and steps (4) to (9) are repeated until the value of t is 1 smaller than the value of the last frame of the workflow video, and the segmentation size of step (9) remains unchanged during the repetition process ; The point-of-interest image obtained in step (9) in each repetition process is preserved as the point-of-interest video in sequence, and the point-of-interest video is classified according to the classification rules in the data set;

Step (11), randomly select 90% from the point-of-interest video obtained in step (10) as a training set, and the rest as a test set;

Step (12), construct a multi-view three-dimensional convolutional neural network, and initialize the number of training rounds to be 5000; the construction method of the multi-view three-dimensional convolutional neural network is as follows:

12.1. The convolution and pooling operations are as follows:

① Initialize a four-dimensional convolution kernel with a size of 9*9*9*10 for the first convolutional layer, the activation function is sigmoid, the window size of the first pooling layer is 2, and the step size is 2;

②Initialize a four-dimensional convolution kernel with a size of 9*9*7*30 for the second convolutional layer, the activation function is sigmoid, the window size of the second pooling layer is 2, and the step size is 2;

③Initialize a four-dimensional convolution kernel with a size of 9*8*5*50 for the third convolutional layer, the activation function is sigmoid, the window size of the third pooling layer is 2, and the step size is 2;

④ Initialize a four-dimensional convolution kernel with a size of 4*3*3*150 for the fourth convolutional layer, the activation function is sigmoid, the window size of the fourth pooling layer is 2, and the step size is 2;

12.2. Initialize the weight parameters of each feature map in the weighted average view pooling layer is a random value in [0,1], and The weighted average view pooling operation in the weighted average view pooling layer is as follows:

<mrow> <mi>a</mi> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&Sigma;</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msub> <msup> <mi>exp</mi> <msup> <mi>&alpha;</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msup> </msup> <mo>&CenterDot;</mo> <msup> <mi>z</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msup> </mrow> <mrow> <msub> <mi>&Sigma;</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msub> <msup> <mi>exp</mi> <msup> <mi>&alpha;</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msup> </msup> </mrow> </mfrac> </mrow>

In the formula, a is the weighted average feature map after the weighted average view pooling operation, t ₁ is the serial number of the pooled feature map after the convolution and pooling operation, is the weight of the pooled feature map corresponding to the serial number t ₁ , exp represents the exponential function with e as the base, is the pooled feature map corresponding to the serial number t ₁ ;

12.3. Initialize a convolution kernel of 3000*1500 and 1500*750 for the first two fully connected layers, and set the activation function to Relu; the weighted average feature map after the weighted average view pooling operation is input to the first two fully connected layers ;

12.4. Initialize a 750*14 convolution kernel for the last fully connected layer and set the Softmax classification function;

In step (13), 20 videos are randomly selected from the training set corresponding to the workflow videos of each perspective and input into the multi-view 3D convolutional neural network in (12) for feature training, and the training error is output;

Step (14), randomly select 10 videos from the training set corresponding to the workflow videos of each perspective and input them into the multi-view three-dimensional convolutional neural network for verification, and obtain the accuracy rate of classification and recognition of the multi-view three-dimensional convolutional neural network;

Step (15), repeating steps (13) to (14), the number of training rounds is reduced by 1 each time it is repeated, until the number of training rounds is 0, and a trained multi-view three-dimensional convolutional neural network is obtained;

Step (16), testing the multi-view three-dimensional convolutional neural network in step (15) using the test set corresponding to the workflow video of each perspective;

Step (17), for the newly input workflow video, obtain the video resolution and the number of frames, and initialize the difference threshold between frames; set t=2;

Step (18), extract the center coordinates of two adjacent frames of interest points according to steps (4) to (8), and calculate the distance between the two center coordinates, if the distance is greater than the set threshold T, mark it as a motion state S ₁ , otherwise, it is marked as relative static state S ₀ ;

Step (19), the value of t is gradually increased by 1, and step (18) is repeated until the value of t is smaller than the value of the last frame of the newly input workflow video by 1, and the number of continuous S ₀ and S ₁ is counted. When the number of S ₀ or S ₁ is greater than or equal to N, the target interest points in the corresponding frames of continuous S ₀ or S ₁ are divided and stored in the frame queue, otherwise the corresponding frames of continuous S ₀ or S ₁ are discarded;

Step (20), in the collection of each continuous S ₀ or S ₁ corresponding frames of the frame queue, start to extract continuous key frames from the i-th frame, i>5, so that the number of key frames is the same as that of each section of video classified in the data set same number of frames;

Step (21), the key frame in the step (20) is entered into the video input that the key frame forms successively in the multi-view three-dimensional convolutional neural network trained in the step (15) and employee behavior is classified and identified;

Step (22), comparing the behavior category obtained in step (21) with the predefined standard workflow.