Disclosure of Invention
The invention aims to solve the technical problem of improving the accuracy of vehicle re-identification in the scene of collecting images by a plurality of different cameras.
The invention solves the technical problems by the following technical means that a vehicle re-identification method guided by a camera topological graph comprises the following steps:
firstly, constructing a training set and acquiring vehicle characteristic representation;
constructing a camera topological graph based on the vehicle characteristic representation;
Step three, constructing a topological relation between feature representations of any two vehicles based on a camera topological graph, inputting the topological relation into a graph convolution network, and obtaining a final aggregation feature;
Fusing the final aggregation characteristics and the vehicle characteristic representation, and inputting the fused result into a full-connection layer for class prediction;
Constructing a target loss function, training the graph rolling network, and stopping training until the target loss function value is minimum to obtain a trained graph rolling network;
And step six, carrying out vehicle re-identification by using the trained graph convolutional network.
The method has the advantages that firstly, a training set is built to obtain the vehicle characteristic representation, in addition, a camera topological graph is built to input the topological relation into a graph rolling network to obtain the aggregation characteristic, then the two characteristics are fused to obtain the fused characteristic, the class prediction result is finally obtained according to the fusion characteristic, the whole characteristic recognition process considers the original visual characteristic, namely the vehicle characteristic representation, and the aggregation characteristic is obtained based on the camera topological graph, so that under the condition that a plurality of different cameras acquire images, the difference of the images acquired by the different cameras and the connection between the adjacent cameras can be represented, the acquired characteristic vector can accurately express the vehicle information, and the accuracy of vehicle re-recognition is higher.
Further, the first step includes:
Building training sets Where x i represents the ith image, N T represents the total number of pictures of the training set, y i represents its identity tag,Representing its camera tag;
the training set is input into the vehicle representation model ResNet-50 to extract a vehicle feature representation, which is the feature representation of the { h 1,h2,...,hN},hN } vehicle.
Further, the second step includes:
According to the vehicle characteristic representation, different cameras are taken as nodes, and edges are constructed according to various relations among the cameras, so that a camera topological graph G= (V, E) is constructed, wherein V represents the camera nodes, Representing the C T th camera node, E is an edge set in a camera topological graph, E= { E system,Eposition,Eorientation,Eindividual},Esystem,Eposition,Eorientation,Eindividual respectively represents the edge set constructed by the relation of the camera system, the position, the direction and the identity, and the camera topological graph based on the camera system, the position, the direction and the identity is respectively represented as G system,Gposition,Gorientation,Gindividual.
Further, the third step includes:
The topological relation a ij of the feature representations h i and h j of any two vehicles is expressed as:
Wherein, the Representing the edge between the ith camera tab and the jth camera tab in the camera topology graph G.
Further, the working process of the graph rolling network in the third step is as follows:
By the formula Calculating a mask matrix, wherein topk represents topk algorithm, sim i represents feature similarity between the ith image and the jth image, representing all samples, (Sim i, representing comparing the ith sample with all samples);
Acquiring an aggregation feature through a formula h' i=σ(∑jMhjnorm(Mask⊙A)ij) based on a mask matrix, wherein sigma represents a ReLU activation function, M represents a learnable transformation matrix, norm represents a normalization function, and the term as a product of elements;
By the formula The aggregate characteristics are weighted and updated to obtain final aggregate characteristics, wherein,Is a cameraIs a combination of the learning weight vectors of the (c),Represents line d of MIs scaled by the d-th element of (c).
Further, the fourth step includes:
Connecting the vehicle characteristic representation and the final aggregate characteristic through a formula f i=Concat(hi,h″i) to obtain a final vehicle characteristic { f 1,f2,...,fN},hi represents the characteristic representation of the ith vehicle, h' i represents the final aggregate characteristic of the ith vehicle, f N represents the final vehicle characteristic of the Nth vehicle, and placing f i into a fully connected layer to obtain a similar prediction result.
Further, the fifth step includes:
By the formula Constructing a first loss function, wherein y i represents an identity tag of an ith image, FC represents a full connection layer, II is a L2 standard distance, f i,p and f i,n represent positive and negative characteristics of an ith image x i in each small batch, and m represents a triplet distance;
By the formula
Constructing a second loss function, wherein S i represents the number of positive samples of the ith picture, and Softplus represents a function for acquiring non-negative probability;
By the formula Constructing a target loss function;
And adjusting parameters of the graph rolling network, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain the trained graph rolling network.
The invention also provides a device for identifying the vehicle guided by the camera topological graph, which comprises:
The feature representation module is used for constructing a training set and acquiring vehicle feature representation;
A topology construction module for constructing a camera topology map based on the vehicle feature representation;
the feature aggregation module is used for constructing a topological relation between feature representations of any two vehicles based on the camera topological graph, inputting the topological relation into the graph rolling network and obtaining final aggregation features;
The class prediction module is used for fusing the final aggregation characteristics with the vehicle characteristic representation, and inputting the fused result into the full-connection layer for class prediction;
The model training module is used for constructing a target loss function, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain a trained graph rolling network;
and the re-identification module is used for carrying out vehicle re-identification by utilizing the trained graph convolutional network.
Further, the feature representation module is further configured to:
Building training sets Where x i represents the ith image, N T represents the total number of pictures of the training set, y i represents its identity tag,Representing its camera tag;
the training set is input into the vehicle representation model ResNet-50 to extract a vehicle feature representation, which is the feature representation of the { h 1,h2,...,hN},hN } vehicle.
Further, the topology construction module is further configured to:
According to the vehicle characteristic representation, different cameras are taken as nodes, and edges are constructed according to various relations among the cameras, so that a camera topological graph G= (V, E) is constructed, wherein V represents the camera nodes, Representing the C T th camera node, E is an edge set in a camera topological graph, E= { E system,Eposition,Eorientation,Eindividual},Esystem,Eposition,Eorientation,Eindividual respectively represents the edge set constructed by the relation of the camera system, the position, the direction and the identity, and the camera topological graph based on the camera system, the position, the direction and the identity is respectively represented as G system,Gposition,Gorientation,Gindividual.
Further, the feature aggregation module is further configured to:
The topological relation a ij of the feature representations h i and h j of any two vehicles is expressed as:
Wherein, the Representing the edge between the ith camera tab and the jth camera tab in the camera topology graph G.
Further, the working process of the graph rolling network in the feature aggregation module is as follows:
By the formula Calculating a mask matrix, wherein topk represents topk algorithm, sim i represents feature similarity between the ith image and the jth image, representing all samples, (Sim i, representing comparing the ith sample with all samples);
Acquiring an aggregation feature through a formula h' i=σ(∑jMhjnorm(Mask⊙A)ij) based on a mask matrix, wherein sigma represents a ReLU activation function, M represents a learnable transformation matrix, norm represents a normalization function, and the term as a product of elements;
By the formula The aggregate characteristics are weighted and updated to obtain final aggregate characteristics, wherein,Is a cameraIs a combination of the learning weight vectors of the (c),Represents line d of MIs scaled by the d-th element of (c).
Further, the class prediction module is further configured to:
Connecting the vehicle characteristic representation and the final aggregate characteristic through a formula f i=Concat(hi,h″i) to obtain a final vehicle characteristic { f 1,f2,...,fN},hi represents the characteristic representation of the ith vehicle, h' i represents the final aggregate characteristic of the ith vehicle, f N represents the final vehicle characteristic of the Nth vehicle, and placing f i into a fully connected layer to obtain a similar prediction result.
Further, the model training module is further configured to:
By the formula Constructing a first loss function, wherein y i represents an identity tag of an ith image, FC represents a full connection layer, II is a L2 standard distance, f i,p and f i,n represent positive and negative characteristics of an ith image x i in each small batch, and m represents a triplet distance;
By the formula
Constructing a second loss function, wherein S i represents the number of positive samples of the ith picture, and Softplus represents a function for acquiring non-negative probability;
By the formula Constructing a target loss function;
And adjusting parameters of the graph rolling network, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain the trained graph rolling network.
The method has the advantages that firstly, the training set is built to obtain the vehicle characteristic representation, in addition, the camera topological graph is built to input the topological relation into the graph rolling network to obtain the aggregation characteristic, then the two characteristics are fused to obtain the fused characteristic, the class prediction result is finally obtained according to the fused characteristic, the whole characteristic recognition process considers the original visual characteristic, namely the vehicle characteristic representation, and the aggregation characteristic is obtained based on the camera topological graph, so that under the condition that a plurality of different cameras acquire images, the difference of the images acquired by the different cameras and the connection between the adjacent cameras can be represented, the acquired characteristic vector can accurately express the vehicle information, and the accuracy of vehicle re-recognition is higher.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in FIG. 1, which is a phenomenon diagram of a prior art strong recognition reference model on VeRi-776 dataset, three phenomena were found, namely (1) the Rank-1 performance under the whole camera system is far higher than that under each camera, and as shown in FIG. 1 (1), five-pointed star in the diagram represents the Rank-1 performance under the whole camera system. This shows that the Rank-1 performance of the prior art method is exaggerated because it only retrieves easy positive samples under the entire camera system and does not hit positive samples under each camera accurately. (2) The performance of the mAP under the whole camera system is far lower than that under each camera, as shown in FIG. 1 (2), which shows the mAP under the whole camera system with five stars. This indicates that the positive samples under each camera are more aggregated than the positive samples under the entire camera system. (3) Eliminating the top-ranked samples will significantly reduce re-recognition performance, as shown in fig. 1 (3). It shows that the re-recognition performance obtained by the conventional method is sub-optimal and is susceptible to camera interference. Furthermore, it is well known that the information of each identity under each camera is limited. If the information of the vehicle can be aggregated under the entire camera system, its information is sufficient and robust.
Thus, as shown in fig. 2, the present invention introduces a camera topology guided vehicle re-recognition method for vehicle re-recognition to fully explore easy positive and difficult to separate positive samples under the whole camera system, the method comprising:
s1, constructing a training set and acquiring vehicle characteristic representation, wherein the specific process comprises the following steps:
Building training sets Where x i represents the ith image, N T represents the total number of pictures of the training set, y i represents its identity tag,Representing its camera tag;
the training set is input into the vehicle representation model ResNet-50 to extract a vehicle feature representation, which is: h N represents a feature representation of the nth vehicle.
S2, constructing a camera topological graph based on vehicle characteristic representation, wherein the specific process is as follows:
As shown in fig. 3, according to the vehicle characteristic representation, different cameras are taken as nodes, and edges are constructed according to various relations between cameras, thereby constructing a camera topology graph g= (V, E), where V represents a camera node, Representing the C T th camera node, E is an edge set in a camera topological graph, E= { E system,Eposition,Eorientation,Eindividual},Esystem,Eposition,Eorientation,Eindividual respectively represents the edge set constructed by the relation of the camera system, the position, the direction and the identity, and the camera topological graph based on the camera system, the position, the direction and the identity is respectively represented as G system,Gposition,Gorientation,Gindividual. The present embodiment is a camera topology constructed based on a cctv camera system, where fig. 3 (a) is a schematic diagram of the cctv camera system and fig. 3 (b) is a corresponding camera topology.
Fig. 4 shows the camera topology based on the camera system, the position, the direction and the identity (individual), wherein G system represents the camera topology based on the camera system, and each neighboring node is connected in turn for default setting, as shown in fig. 4 (a).
G position represents a camera topology map based on camera position. Cameras of successive intersections are first defined as spatially adjacent nodes. Camera5, camera7, and camera8 are regarded as adjacent nodes according to camera positions in the closed-circuit television camera system (fig. 3 (b)), and there is an edge between these adjacent nodes, as shown in fig. 4 (b). The camera relationship of G position is easier than that of G system because it requires positive samples from neighboring cameras to present a consistent feature representation. Since a continuously moving vehicle can be captured by two adjacent cameras, G position complies with vehicle travel logic. G position is interacted with the positive sample under the adjacent camera.
And G orientation, representing a camera topological graph based on camera directions. The more uniform the camera orientation, the more uniform the appearance of the positive sample. As shown in fig. 4 (c), the solid line indicates the side where the positional relationship between the cameras is determined, the broken line indicates the side where the directional relationship between the cameras is determined, and camera3 and camera4 are adjacent cameras, but since their camera directions are different, there is no side where the directional relationship between them is determined. The camera relationship of G orientation is easier than that of G position because it ignores irrelevant nodes based on camera direction. Notably, the present invention defines cameras whose two directions are orthogonal as well as neighboring cameras, such as camera5 and camera7 in fig. 4 (c). G orientation is interacting with the positive sample under the camera in the same direction.
G individual represents a camera topology map based on camera individuals. A video sequence of the target vehicle may be captured under the same camera. As shown in fig. 4 (d), any camera will have an edge on its own. The camera relationship of G individual is easiest because intra-class images captured under the same camera tend to have a large information overlap. G individual is interacting with the positive sample under the same camera.
Learning camera systems, locations, directions, and identity relationships helps reduce the range of feature interactions for the feature learning phase and the assessment phase. The four subgraphs are used to construct a camera topology. In the camera topology graph g= (V, E), the edges of two cameras may be represented as E ij, and the larger the value, the stronger the relationship between the cameras. In the four subgraphs, if edges exist between nodes, the value is 1, otherwise, the value is 0. The final goal is to obtain hierarchically aggregated topological features through four topological relationships between cameras. Such topological features are complementary to visual features, making the final feature more robust and more robust.
S3, constructing a topological relation between feature representations of any two vehicles based on a camera topological graph, inputting the topological relation into a graph convolution network, and obtaining a final aggregation feature, wherein the specific process comprises the following steps:
To embed the topological relation into the feature representation, the topological relation between cameras is converted into sample pairs. Creating an adjacency matrix between visual features, i.e. between the above-mentioned vehicle feature representations, using a closed-circuit television camera system-guided camera topology The topological relation a ij of the feature representations h i and h i of any two vehicles is expressed as:
Wherein, the Representing the edge between the ith camera tab and the jth camera tab in the camera topology graph G.
As can be seen from the above formula, the characteristic relationship between samples is represented by the camera relationship between samples. This is because the stronger the camera relationship between samples, the more overlap between vehicle images. However, this process involves many uncorrelated samples and adds a significant computational burden.
To discard uncorrelated samples and reduce the amount of computation, a mask matrix is introduced. Assuming that the two vehicle images are visually adjacent in the feature space, they are likely to be correlated. To this end, a k-nearest neighbor mask is calculated from the visual similarityIt will process the top k values of each row of similarity. Specifically, by the formula The mask matrix is calculated, wherein topk represents topk algorithm, simi represents feature similarity between the ith image and the jth image, all samples are represented, (Sim i:) represents comparison Sim i and all samples, topk algorithm is an existing algorithm, which mainly refers to finding the maximum K number in the unordered sequence of N numbers, in this embodiment, K data before similarity is found by comparison Sim i and all samples, and details of the algorithm are not described here.
The aggregation features are obtained by the formula h' i=σ(∑jMhjnorm(Mask⊙A)ij) based on a Mask matrix, where σ represents a ReLU activation function, M represents a learnable transformation matrix, norm represents a normalization function, and # represents an element product, and adding the Mask matrix Mask to a weighted transformation matrix achieves feature aggregation that occurs only in neighboring cameras, which increases the interest in more relevant images.
While the above equation achieves a more robust aggregation feature while reducing computational complexity, such an aggregation process may introduce unwanted camera noise. To solve this problem, a learning camera memory matrix is designed, and the memory matrix is usedWeighted transformation matrixTo store the transformation matrices for the different cameras. Specifically, by the formulaThe aggregate characteristics are weighted and updated to obtain final aggregate characteristics, wherein,Is a cameraIs a learning weight vector of (a) cameraIs used for the storage matrix of the (c),Represents line d of MIs scaled by the d-th element of (c).
S4, fusing the final aggregation characteristics and the vehicle characteristic representation, and inputting the fused result into a full-connection layer for class prediction, wherein the specific process is as follows:
In a graph rolling network based on camera topology, visual features are transformed into topological features, i.e. final aggregated features, by adjacency relations and specific transformation matrices. A graph roll-up network based on camera topology is utilized to learn the cross-camera representation to obtain more discernable vehicle features. The network is rolled based on a camera topology map, only manageable neighbor nodes are aggregated, and different weight matrices are learned for different cameras. The capability of interaction between the traditional graph rolling network and the graph nodes is reserved, and learning of different camera topological relations is introduced. In addition, the vehicle feature representation is connected to the final aggregate feature by equation f i=Concat(hi,h″i) to obtain the final vehicle feature H i represents the feature representation of the ith vehicle, h '' i represents the final aggregate feature of the ith vehicle, f N represents the final vehicle feature of the nth vehicle, and f i is put into the full connection layer to obtain a class prediction result. In practical applications, as shown in fig. 2, the vehicle feature representation may also be input to the hidden layer and the final aggregate feature may also be input to the hidden layer, and then the two may be fused by the formula f i=Concat(hi,h″i).
S5, constructing a target loss function, training the graph rolling network, and stopping training until the target loss function value is minimum to obtain the trained graph rolling network, wherein the specific process comprises the following steps:
By the formula Constructing a first loss function, wherein y i represents an identity tag of an ith image, FC represents a full connection layer, II represents an L2 standard distance, f i,p and f i,n represent the most difficult positive and negative characteristics of an ith image x i in each small batch, and m represents a triplet distance, and although the first loss function is widely applied to the field of vehicle re-identification, the first loss function has limitation and cannot consider the topological relation among samples.
Therefore, the invention proposes a new topological cross entropy loss according to the topological relation in the topological cross entropy loss, promotes positive samples to cluster from strong to weak, optimizes the topological relation among the positive samples, trains the whole network in an end-to-end manner, and the topological cross entropy loss is also the key for aggregating vehicles under adjacent cameras, which makes the aggregation process more effective and efficient, and is concretely represented by the formula
Constructing a second loss function, wherein S i represents the number of positive samples of the ith picture, and Softplus represents a function for acquiring non-negative probability;
By the formula Constructing a target loss function;
And adjusting parameters of the graph rolling network, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain the trained graph rolling network.
And S6, acquiring vehicle characteristic representations in real time from image input ResNet-50 of the vehicle, constructing a camera topological graph, inputting the camera topological graph into a trained graph convolution network, performing vehicle re-identification by using the trained graph convolution network, fusing the identification result with the vehicle characteristic representations, and inputting the fusion result into a full-connection layer to obtain a prediction type result.
According to the technical scheme, the training set is firstly constructed to obtain the vehicle feature representation, the camera topological graph is additionally constructed to input the topological relation into the graph rolling network to obtain the aggregation feature, then the two features are fused to obtain the fused feature, the class prediction result is finally obtained according to the fused feature, the original visual feature, namely the vehicle feature representation, is considered in the whole feature recognition process, the aggregation feature is also obtained based on the camera topological graph, and therefore under the condition that images are collected by a plurality of different cameras, the difference of the images collected by the different cameras and the connection between the adjacent cameras can be represented, the collected feature vectors can accurately express the vehicle information, and the accuracy of vehicle re-recognition is higher.
Example 2
Based on embodiment 1, embodiment 2 of the present invention further provides a vehicle re-recognition device guided by a camera topology map, the device comprising:
The feature representation module is used for constructing a training set and acquiring vehicle feature representation;
A topology construction module for constructing a camera topology map based on the vehicle feature representation;
the feature aggregation module is used for constructing a topological relation between feature representations of any two vehicles based on the camera topological graph, inputting the topological relation into the graph rolling network and obtaining final aggregation features;
The class prediction module is used for fusing the final aggregation characteristics with the vehicle characteristic representation, and inputting the fused result into the full-connection layer for class prediction;
The model training module is used for constructing a target loss function, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain a trained graph rolling network;
and the re-identification module is used for carrying out vehicle re-identification by utilizing the trained graph convolutional network.
Specifically, the feature representation module is further configured to:
Building training sets Where x i represents the ith image, N T represents the total number of pictures of the training set, y i represents its identity tag,Representing its camera tag;
the training set is input into the vehicle representation model ResNet-50 to extract a vehicle feature representation, which is the feature representation of the { h 1,h2,...,hN},hN } vehicle.
Specifically, the topology construction module is further configured to:
According to the vehicle characteristic representation, different cameras are taken as nodes, and edges are constructed according to various relations among the cameras, so that a camera topological graph G= (V, E) is constructed, wherein V represents the camera nodes, Representing the C T th camera node, E is an edge set in a camera topological graph, E= { E system,Eposition,Eorientation,Eindividual},Esystem,Eposition,Eorientation,Eindividual respectively represents the edge set constructed by the relation of the camera system, the position, the direction and the identity, and the camera topological graph based on the camera system, the position, the direction and the identity is respectively represented as G system,Gposition,Gorientation,Gindividual.
Specifically, the feature aggregation module is further configured to:
The topological relation a ij of the feature representations h i and h j of any two vehicles is expressed as:
Wherein, the Representing the edge between the ith camera tab and the jth camera tab in the camera topology graph G.
More specifically, the working process of the graph rolling network in the feature aggregation module is as follows:
By the formula Calculating a mask matrix, wherein topk represents topk algorithm, sim i represents feature similarity between the ith image and the jth image, representing all samples, (Sim i,: representing comparison Sim i and all samples;
Acquiring an aggregation feature through a formula h' i=σ(∑jMhjnorm(Mask⊙A)ij) based on a mask matrix, wherein sigma represents a ReLU activation function, M represents a learnable transformation matrix, norm represents a normalization function, and the term as a product of elements;
By the formula The aggregate characteristics are weighted and updated to obtain final aggregate characteristics, wherein,Is a cameraIs a combination of the learning weight vectors of the (c),Represents line d of MIs scaled by the d-th element of (c).
Specifically, the class prediction module is further configured to:
Connecting the vehicle characteristic representation and the final aggregate characteristic through a formula f i=Concat(hi,h″i) to obtain a final vehicle characteristic { f 1,f2,...,fN},hi represents the characteristic representation of the ith vehicle, h' i represents the final aggregate characteristic of the ith vehicle, f N represents the final vehicle characteristic of the Nth vehicle, and placing f i into a fully connected layer to obtain a similar prediction result.
Specifically, the model training module is further configured to:
By the formula Constructing a first loss function, wherein y i represents an identity tag of an ith image, FC represents a full connection layer, II is a L2 standard distance, f i,p and f i,n represent positive and negative characteristics of an ith image x i in each small batch, and m represents a triplet distance;
By the formula
Constructing a second loss function, wherein S i represents the number of positive samples of the ith picture, and Softplus represents a function for acquiring non-negative probability;
By the formula Constructing a target loss function;
And adjusting parameters of the graph rolling network, training the graph rolling network, and stopping training until the target loss function value is minimum, so as to obtain the trained graph rolling network.
The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solution of the embodiments of the present invention in essence.