CN114996515A

CN114996515A - Training method of video feature extraction model, text generation method and device

Info

Publication number: CN114996515A
Application number: CN202210615076.XA
Authority: CN
Inventors: 林和政; 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-02
Anticipated expiration: 2042-05-31
Also published as: CN114996515B

Abstract

The present disclosure relates to a training method, text generation method and device for a video feature extraction model, belonging to the technical field of computers. In the embodiment of the present disclosure, the image information and text information of the sample video and the text label and image label of the sample video are used to perform model training on the video feature extraction model, and a model training method based on dual training tasks is provided. When the text generation task is the main task and the image reconstruction task is the auxiliary task, since the image label of the sample video represents the image reconstruction feature, in the process of model training, the video feature extraction model can improve the image feature extraction ability. , and then obtain high-quality image features. On the basis of obtaining high-quality image features, a video feature extraction model with better text generation capability can be trained, which improves the training effect of the video feature extraction model.

Description

Training method, text generation method and device for video feature extraction model

技术领域technical field

本公开涉及计算机技术领域，尤其涉及一种视频特征提取模型的训练方法、文本生成方法及装置。The present disclosure relates to the field of computer technologies, and in particular, to a method for training a video feature extraction model, a method for generating text, and an apparatus.

背景技术Background technique

随着计算机技术和互联网技术的飞速发展，视频处理技术逐渐成为新兴的研究热点。在视频处理技术中，通常需要提取能够表征视频内容的视频特征，进而利用该视频特征来进行视频推荐、视频分类或视频搜索等处理过程。With the rapid development of computer technology and Internet technology, video processing technology has gradually become an emerging research hotspot. In the video processing technology, it is usually necessary to extract video features that can characterize video content, and then use the video features to perform processing processes such as video recommendation, video classification, or video search.

目前，在对视频进行特征提取之前，通常会根据多个样本视频的图像信息以及该多个样本视频的类别标签，训练一个视频分类模型，进而利用所训练得到的视频分类模型对视频中的图像进行处理，以获得该视频的类别特征。然而，上述视频分类模型，特征提取能力较弱，不利于后续视频推荐、视频分类或视频搜索的处理过程。At present, before feature extraction is performed on a video, a video classification model is usually trained according to the image information of multiple sample videos and the category labels of the multiple sample videos, and then the images in the video are classified by the trained video classification model. process to obtain the category features of the video. However, the above-mentioned video classification model has weak feature extraction capability, which is not conducive to the subsequent processing of video recommendation, video classification or video search.

发明内容SUMMARY OF THE INVENTION

本公开提供一种视频特征提取模型的训练方法、文本生成方法及装置，能够训练出文本生成能力较优的视频特征提取模型，提升了视频特征提取模型的训练效果。本公开的技术方案如下：The present disclosure provides a video feature extraction model training method, text generation method and device, which can train a video feature extraction model with better text generation capability and improve the training effect of the video feature extraction model. The technical solutions of the present disclosure are as follows:

根据本公开实施例的第一方面，提供一种视频特征提取模型的训练方法，该方法包括：According to a first aspect of the embodiments of the present disclosure, a training method for a video feature extraction model is provided, the method comprising:

获取样本视频的图像信息、文本信息、图像标签以及文本标签，该图像标签表示图像重建特征，该文本标签表示该样本视频的内容描述文本；Obtain the image information, text information, image label and text label of the sample video, the image label represents the image reconstruction feature, and the text label represents the content description text of the sample video;

将该图像信息与该文本信息输入视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该样本视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征；Input the image information and the text information into the video feature extraction model, perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model, obtain the image features of the sample video, and extract the features of the model through the video feature extraction model. The embedding layer of the fusion sub-model processes the text information to obtain the text features of the sample video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model to obtain the fusion features of the sample video. ;

通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到原始图像大小的图像训练结果，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果；Image restoration is performed on the image features in the fusion feature through the image reconstruction sub-model of the video feature extraction model to obtain an image training result of the original image size, and the fusion feature is processed through the text generation sub-model of the video feature extraction model, Get text training results;

基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练。Based on the image training result, the text training result and the image label and text label of the sample video, the model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model are adjusted, to train the video feature extraction model.

本公开实施例中，利用样本视频的图像信息、文本信息以及该样本视频的文本标签、图像标签，来对视频特征提取模型进行模型训练，其中，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到样本视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该样本视频的文本特征，还能够对样本视频的图像特征和文本特征进行特征融合，以便后续基于融合特征，一方面，能够对样本视频进行图像重建，以获得高质量的图像特征，另一方面，还能够生成该样本视频的内容描述文本，如此，提供了一种基于双训练任务的模型训练方法，在以文本生成任务为主任务而图像重建任务为辅任务的情况下，由于该样本视频的图像标签表示图像重建特征，因此在模型训练的过程中，能够提升视频特征提取模型针对图像特征的提取能力，进而获取到高质量的图像特征，在获取到高质量的图像特征的基础上，也就能够训练出文本生成能力较优的视频特征提取模型，提升了视频特征提取模型的训练效果。In the embodiment of the present disclosure, the image information and text information of the sample video and the text label and image label of the sample video are used to perform model training on the video feature extraction model. The model can accurately extract the image features of the sample video. By constructing a feature fusion sub-model in the video feature extraction model, not only can the text features of the sample video be obtained, but also the image features and text features of the sample video. Fusion, so that based on the fusion features, on the one hand, the image reconstruction of the sample video can be performed to obtain high-quality image features, and on the other hand, the content description text of the sample video can also be generated. The model training method for training tasks, in the case where the text generation task is the main task and the image reconstruction task is the auxiliary task, since the image label of the sample video represents the image reconstruction feature, in the process of model training, the video features can be improved. The extraction model is based on the extraction ability of image features, and then obtains high-quality image features. On the basis of obtaining high-quality image features, a video feature extraction model with better text generation ability can be trained, which improves the video features. Extract the training effect of the model.

在一些实施例中，该样本视频的图像信息的获取过程包括下述至少一项：In some embodiments, the acquisition process of the image information of the sample video includes at least one of the following:

获取该样本视频的封面图像；或，获取该样本视频内的至少一帧图像。Obtain the cover image of the sample video; or, obtain at least one frame of image in the sample video.

本公开实施例中，通过获取样本视频的封面图像或样本视频所包括的图像帧，均能够快速获取到样本视频的图像信息，在确保获取图像信息的效率的同时，还丰富了图像信息的类型，提升了获取图像信息的灵活性。In the embodiment of the present disclosure, by obtaining the cover image of the sample video or the image frame included in the sample video, the image information of the sample video can be quickly obtained, which not only ensures the efficiency of obtaining image information, but also enriches the types of image information. , which improves the flexibility of acquiring image information.

在一些实施例中，该样本视频的文本信息的获取过程包括下述至少一项：In some embodiments, the process of acquiring the text information of the sample video includes at least one of the following:

获取该样本视频的描述信息；获取该样本视频的标题信息；获取该样本视频的字幕信息；获取该样本视频的文字识别结果，该文字识别结果为对该样本视频内的至少一帧图像进行文字识别所得到的结果；获取该样本视频的音频识别结果，该音频识别结果为对该样本视频的背景音频进行音频识别所得到的结果。Obtain the description information of the sample video; obtain the title information of the sample video; obtain the subtitle information of the sample video; Identify the obtained result; obtain the audio recognition result of the sample video, where the audio recognition result is the result obtained by performing audio recognition on the background audio of the sample video.

本公开实施例中，通过获取样本视频的描述信息、标题信息、字幕信息、文字识别结果或音频识别结果，均能够快速获取到样本视频的文本信息，在确保获取文本信息的效率的同时，还丰富了文本信息的类型，提升了获取文本信息的灵活性。In the embodiment of the present disclosure, by obtaining the description information, title information, subtitle information, text recognition result or audio recognition result of the sample video, the text information of the sample video can be quickly obtained. The types of text information are enriched, and the flexibility of obtaining text information is improved.

在一些实施例中，该内容描述文本为内容类目描述文本、内容形式描述文本、内容主题描述文本以及内容详情描述文本中的至少一类。In some embodiments, the content description text is at least one of content category description text, content form description text, content topic description text, and content detail description text.

本公开实施例中，通过设置多种类型的内容描述文本，一方面，能够生成更具有表达能力的内容描述文本，另一方面，该多种类型的内容描述文本能够从不同的维度对视频的内容进行描述，丰富了所生成的内容描述文本的类型，能够更加充分完整的表征视频。In the embodiment of the present disclosure, by setting multiple types of content description texts, on the one hand, more expressive content description texts can be generated; The content is described, which enriches the type of the generated content description text, and can more fully and completely characterize the video.

在一些实施例中，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征包括下述任一项：In some embodiments, feature fusion is performed on the image feature and the text feature through the feature fusion layer of the feature fusion sub-model to obtain the fusion feature of the sample video including any of the following:

通过该特征融合子模型所包括的自注意力层，对该图像特征与该文本特征进行处理，得到该样本视频的融合特征；Through the self-attention layer included in the feature fusion sub-model, the image feature and the text feature are processed to obtain the fusion feature of the sample video;

通过该特征融合子模型所包括的深度置信网络，对该图像特征与该文本特征进行处理，得到该样本视频的融合特征。Through the deep belief network included in the feature fusion sub-model, the image feature and the text feature are processed to obtain the fusion feature of the sample video.

本公开实施例中，通过在特征融合子模型中设置自注意力层或深度置信网络，进而利用自注意力机制或深度置信网络来进行特征融合，能够获得更具备视频表征能力的特征，提高了特征融合的准确性。In the embodiment of the present disclosure, by setting a self-attention layer or a deep belief network in the feature fusion sub-model, and then using the self-attention mechanism or the deep belief network to perform feature fusion, it is possible to obtain features with more video representation capabilities, and improve the performance of the video. The accuracy of feature fusion.

在一些实施例中，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果包括：In some embodiments, the fusion feature is processed by the text generation sub-model of the video feature extraction model, and the obtained text training result includes:

通过该文本生成子模型所包括的自注意力层，对该融合特征进行处理，得到该文本训练结果。Through the self-attention layer included in the text generation sub-model, the fusion feature is processed to obtain the text training result.

本公开实施例中，通过在文本生成子模型中设置自注意力层，进而利用自注意力机制来生成内容描述文本，提高了文本生成的准确性。In the embodiment of the present disclosure, by setting a self-attention layer in the text generation sub-model, and then using the self-attention mechanism to generate content description text, the accuracy of text generation is improved.

在一些实施例中，该文本训练结果包括多个类型的内容描述文本；In some embodiments, the text training result includes multiple types of content description text;

通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到该文本训练结果之前，该方法还包括：The fusion feature is processed by the text generation sub-model of the video feature extraction model to obtain the text training result, the method further includes:

在该融合特征上，添加各个类型的类型标识；On the fusion feature, add the type identifier of each type;

通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到该文本训练结果包括：The fusion feature is processed by the text generation sub-model of the video feature extraction model, and the text training result obtained includes:

将添加该类型标识后的融合特征输入该文本生成子模型，通过该文本生成子模型，分别基于各个类型标识对应的处理机制，对该融合特征进行处理，得到该多个类型的内容描述文本。The fusion feature after adding the type identifier is input into the text generation sub-model, and the fusion feature is processed based on the processing mechanism corresponding to each type identifier through the text generation sub-model to obtain the multiple types of content description texts.

本公开实施例中，通过在融合特征上添加各个类型的类型标识，以便视频特征提取模型中的��本生成子模型能够基于各个类型的类型标识，来触发生成样本视频的多个类型的内容描述文本，确保文本生成的顺利进行。In the embodiment of the present disclosure, by adding various types of type identifiers to the fusion feature, the text generation sub-model in the video feature extraction model can trigger the generation of multiple types of content description texts of the sample video based on the various types of type identifiers. , to ensure smooth text generation.

在一些实施例中，该文本信息的数量为多个；In some embodiments, the number of the text information is multiple;

通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征之前，该方法还包括：The text information is processed by the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text feature of the sample video, the method further includes:

对多个该文本信息进行拼接，得到拼接后的该文本信息；Splicing a plurality of the text information to obtain the spliced text information;

基于拼接后的该文本信息，执行该通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征的步骤。Based on the spliced text information, the step of processing the text information through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain text features of the sample video is performed.

本公开实施例中，在文本信息的数量为多个的情况下，通过对文本信息进行拼接，以得到拼接后的文本信息，进而利用拼接后的文本信息来进行提取文本特征的过程，参��了多种类型的文本信息，提高了提取文本特征的准确性。In the embodiment of the present disclosure, when the number of text information is multiple, the spliced text information is obtained by splicing the text information, and then the spliced text information is used to perform the process of extracting text features. Various types of text information improve the accuracy of extracting text features.

在一些实施例中，对多个该文本信息进行拼接，得到拼接后的该文本信息之后，该方法还包括：In some embodiments, after splicing a plurality of the text information to obtain the spliced text information, the method further includes:

从拼接后的该文本信息中，提取前目标数量的字符；From the spliced text information, extract the characters of the previous target number;

基于所提取的字符，执行该通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征的步骤。Based on the extracted characters, the step of processing the text information through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain text features of the sample video is performed.

本公开实施例中，在拼接后的文本信息中，提取前目标数量的字符，以便基于所提取的一定数量的字符，来进行后续提取文本特征的过程，在确保输入充足的文本信息的基础上，减小了视频特征提取模型的运算量，提高了提取文本特征的效率。In the embodiment of the present disclosure, in the spliced text information, the characters of the previous target number are extracted, so as to perform the subsequent process of extracting text features based on the extracted certain number of characters, and on the basis of ensuring that sufficient text information is input , which reduces the computational complexity of the video feature extraction model and improves the efficiency of text feature extraction.

在一些实施例中，基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练包括：In some embodiments, the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model are adjusted based on the image training result, the text training result and the image labels and text labels of the sample video The model parameters of the sub-model to train this video feature extraction model include:

在模型训练的第i次迭代过程中，基于该第i次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，确定该第i次迭代过程的模型损失值，该i为大于1的正整数；In the ith iteration process of model training, the model loss value of the ith iteration process is determined based on the image training results and text training results of the ith iteration process and the image labels and text labels of the sample video. i is a positive integer greater than 1;

基于该第i次迭代过程的模型损失值，调整第i-1次迭代过程所确定的该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，重复上述训练的迭代过程，直至训练满足目标条件。Based on the model loss value of the i-th iterative process, adjust the model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model determined in the i-1-th iteration process , and repeat the iterative process of the above training until the training meets the target condition.

本公开实施例中，在模型训练的任一次迭代过程中，均利用本次迭代过程的模型损失值来对视频特征提取模型中的各个子模型进行模型参数的调整，以提升视频特征提取模型的文本生成能力，从而训练出文本生成能力较高的视频特征提取模型。In the embodiment of the present disclosure, in any iterative process of model training, the model loss value of this iterative process is used to adjust the model parameters of each sub-model in the video feature extraction model, so as to improve the performance of the video feature extraction model. Text generation ability, so as to train a video feature extraction model with high text generation ability.

在一些实施例中，基于该第i次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，确定该第i次迭代过程的模型损失值包括：In some embodiments, determining the model loss value of the i-th iteration process based on the image training results and text training results of the i-th iteration process and the image labels and text labels of the sample video includes:

基于该第i次迭代过程的图像训练结果与该样本视频的图像标签，确定该第i次迭代过程的图像重建损失值，该图像重建损失值表示该图像训练结果与该图像标签之间的差异；Based on the image training result of the ith iteration process and the image label of the sample video, the image reconstruction loss value of the ith iteration process is determined, and the image reconstruction loss value represents the difference between the image training result and the image label ;

基于该第i次迭代过程的文本训练结果与该样本视频的文本标签，确定该第i次迭代过程的文本生成损失值，该文本生成损失值表示该文本训练结果与该文本标签之间的差异；Based on the text training result of the ith iteration process and the text label of the sample video, determine the text generation loss value of the ith iteration process, where the text generation loss value represents the difference between the text training result and the text label ;

基于该图像重建损失值与该文本生成损失值，确定该第i次迭代过程的模型损失值。Based on the image reconstruction loss value and the text generation loss value, the model loss value of the i-th iteration process is determined.

在一些实施例中，基于该图像重建损失值与该文本生成损失值，确定该第i次迭代过程的模型损失值包括：In some embodiments, based on the image reconstruction loss value and the text generation loss value, determining the model loss value of the i-th iteration process includes:

基于该图像重建损失值、该图像重建损失值对应的权重系数、该文本生成损失值以及该文本生成损失值对应的权重系数，进行加权求和，得到该第i次迭代过程的模型损失值。Based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value, and the weight coefficient corresponding to the text generation loss value, a weighted sum is performed to obtain the model loss value of the i-th iteration process.

本公开实施例中，针对视频特征提取模型的图像重建任务和文本生成任务，分别设置有各个任务对应的权重系数，进而利用各个任务的损失值以及各个任务对应的权重系数，来确定模型损失值，提高了确定模型损失值的准确性。In the embodiment of the present disclosure, for the image reconstruction task and the text generation task of the video feature extraction model, weight coefficients corresponding to each task are respectively set, and then the loss value of each task and the weight coefficient corresponding to each task are used to determine the model loss value , which improves the accuracy of determining the loss value of the model.

基于该第i次迭代过程的文本训练结果与该样本视频的文本标签，确定该第i次迭代过程的文本生成损失值包括：Based on the text training result of the ith iterative process and the text label of the sample video, determining the text generation loss value of the ith iterative process includes:

对于任一类型，基于该第i次迭代过程在该类型上的文本训练结果与该样本视频在该类型上的文本标签，确定该第i次迭代过程在该类型上的损失值；For any type, determine the loss value of the i-th iteration process on the type based on the text training result of the i-th iteration process on the type and the text label of the sample video on the type;

基于该第i次迭代过程在该多个类型上的损失值以及该视频特征提取网络在该多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值。Based on the loss values of the i-th iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, weighted summation is performed to obtain the text generation loss value of the i-th iteration process.

本公开实施例中，针对文本生成所涉及的各个类型，分别设置有各个类型对应的权重系数，进而利用各个类型对应的损失值以及各个类型对应的权重系数，来确定文本生成损失值，提高了确定文本生成损失值的准确性。In the embodiment of the present disclosure, for each type involved in text generation, a weight coefficient corresponding to each type is set respectively, and then the loss value corresponding to each type and the weight coefficient corresponding to each type are used to determine the text generation loss value, which improves the Determines the accuracy of the text generation loss value.

在一些实施例中，基于该第i次迭代过程在该多个类型上的损失值以及该视频特征提取网络在多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值之前，该方法还包括：In some embodiments, based on the loss values of the i-th iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, a weighted sum is performed to obtain the i-th iteration process. Before text generation loss value, the method also includes:

对于任一类型，基于该第i次迭代过程在该类型上的正确文本数量以及总文本数量，确定该第i次迭代过程在该类型上的正确比例，该正确比例表示在该第i次迭代过程中正确文本数量占总文本数量的比例；For any type, based on the correct number of texts on the type and the total number of texts in the ith iteration process, determine the correct scale of the ith iteration process on the type, the correct scale representing the ith iteration process The ratio of the number of correct texts to the total number of texts in the process;

基于该第i次迭代过程在该类型上的正确比例，确定该视频特征提取网络在该类型上的权重系数，其中，该正确比例与该权重系数成负相关。Based on the correct scale of the i-th iterative process on the type, the weight coefficient of the video feature extraction network on the type is determined, wherein the correct scale is negatively correlated with the weight coefficient.

本公开实施例中，针对文本生成所涉及的各个类型，分别按照各个类型所对应的正确比例，来确定各个类型对应的权重系数，由于正确比例表示正确文本数量占总文本数量的比例，且由于正确比例与权重系数之间成负相关，因此，在计算文本生成损失值的情况下，为正确比例大的类型设置较小的权重系数，为正确比例小的类型设置较大的权重系数，提高了确定权重系数的准确性，也就提高了确定文本生成损失值的准确性。In the embodiment of the present disclosure, for each type involved in text generation, the weight coefficient corresponding to each type is determined according to the correct ratio corresponding to each type, because the correct ratio represents the ratio of the correct text quantity to the total text quantity, and since the correct ratio represents the ratio of the correct text quantity to the total text quantity There is a negative correlation between the correct scale and the weight coefficient. Therefore, in the case of calculating the text generation loss value, set a smaller weight coefficient for the type with a large correct scale, and set a larger weight coefficient for the type with a small correct scale. In order to determine the accuracy of the weight coefficient, the accuracy of determining the text generation loss value is also improved.

根据本公开实施例的第二方面，提供一种基于视频特征提取模型的文本生成方法，该视频特征提取模型基于上述第一方面或该第一方面中任一实施例所示的训练方法训练得到，该方法包括：According to a second aspect of the embodiments of the present disclosure, there is provided a text generation method based on a video feature extraction model, where the video feature extraction model is obtained by training based on the first aspect or the training method shown in any embodiment of the first aspect. , the method includes:

获取目标视频的图像信息与文本信息；Obtain the image information and text information of the target video;

将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该目标视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该目标视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该目标视频的融合特征；The image information and the text information are input into the video feature extraction model, and feature extraction is performed on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image features of the target video. The embedding layer of the feature fusion sub-model processes the text information to obtain the text features of the target video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model to obtain the fusion of the target video. feature;

通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，输出满足文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本。The fusion feature is processed by the text generation sub-model of the video feature extraction model, a plurality of characters satisfying the text generation conditions are output, and the content description text of the target video is generated based on the plurality of characters.

本公开实施例中，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到目标视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该目标视频的文本特征，还能够对目标视频的图像特征和文本特征进行特征融合，以便后续基于融合特征进行处理，能够输出满足文本生成条件的多个字符，进而基于所输出的多个字符能够自动生成该目标视频的内容描述文本，提供了一种基于文本生成的视频特征提取模型，所生成的内容描述文本包含了丰富的信息量，能够更好的表征该目标视频，提升了视频表征的准确性。In the embodiment of the present disclosure, by constructing an image feature extraction sub-model in the video feature extraction model, the image features of the target video can be accurately extracted, and by constructing a feature fusion sub-model in the video feature extraction model, not only the target video can be obtained The text features of the video can also feature fusion of the image features and text features of the target video, so that the subsequent processing based on the fusion features can output multiple characters that meet the text generation conditions, and then automatically generate based on the output multiple characters. The content description text of the target video provides a video feature extraction model based on text generation. The generated content description text contains rich information, which can better characterize the target video and improve the accuracy of video representation. .

在一些实施例中，该方法还包括：In some embodiments, the method further includes:

通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到该目标视频的原始图像大小的图像重建特征。Image reconstruction is performed on the image features in the fusion feature through the image reconstruction sub-model of the video feature extraction model to obtain the image reconstruction feature of the original image size of the target video.

根据本公开实施例的第三方面，提供一种视频特征提取模型的训练装置，该装置包括：According to a third aspect of the embodiments of the present disclosure, there is provided a training device for a video feature extraction model, the device comprising:

获取单元，被配置为执行获取样本视频的图像信息、文本信息、图像标签以及文本标签，该图像标签表示图像重建特征，该文本标签表示该样本视频的内容描述文本；an acquisition unit, configured to perform acquisition of image information, text information, image tags and text tags of the sample video, where the image tag represents an image reconstruction feature, and the text tag represents the content description text of the sample video;

输入单元，被配置为执行将该图像信息与该文本信息输入视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该样本视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征；The input unit is configured to input the image information and the text information into the video feature extraction model, perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model, and obtain the image feature of the sample video, through The text information is processed by the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the sample video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model, Obtain the fusion feature of the sample video;

处理单元，被配置为执行通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到原始图像大小的图像训练结果，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果；The processing unit is configured to perform image restoration through the image reconstruction sub-model of the video feature extraction model to the image features in the fusion feature to obtain an image training result of the original image size, and generate the sub-model through the text of the video feature extraction model. The fusion feature is processed to obtain the text training result;

训练单元，被配置为执行基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练。A training unit configured to perform image labeling and text labeling based on the image training result, the text training result and the sample video, and adjust the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text Generate the model parameters of the sub-model to train the video feature extraction model.

在一些实施例中，该获取单元，被配置为执行下述至少一项：In some embodiments, the obtaining unit is configured to perform at least one of the following:

在一些实施例中，该输入单元包括处理子单元，被配置为执行下述任一项：In some embodiments, the input unit includes a processing subunit configured to perform any of the following:

在一些实施例中，该处理单元包括文本生成子单元，被配置为执行：In some embodiments, the processing unit includes a text generation sub-unit configured to perform:

该装置还包括添加单元，被配置为执行在该融合特征上，添加各个类型的类��标识；The device also includes an adding unit configured to perform on the fusion feature, adding type identifiers of various types;

该处理单元包括文本生成子单元，还被配置为执行将添加该类型标识后的融合特征输入该文本生成子模型，通过该文本生成子模型，分别基于各个类型标识对应的处理机制，对该融合特征进行处理，得到该多个类型的内容描述文本。The processing unit includes a text generation sub-unit, and is further configured to input the fusion feature added with the type identifier into the text generation sub-model, and through the text generation sub-model, based on the processing mechanisms corresponding to each type identifier, the fusion The features are processed to obtain the multiple types of content description texts.

该装置还包括拼接单元，被配置为执行对多个该文本信息进行拼接，得到拼接后的该文本信息；The device also includes a splicing unit configured to perform splicing of a plurality of the text information to obtain the spliced text information;

该输入单元，还被配置为执行基于拼接后的该文本信息，执行该通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征的步骤。The input unit is further configured to perform the step of processing the text information through the embedding layer of the feature fusion sub-model of the video feature extraction model based on the spliced text information to obtain text features of the sample video.

在一些实施例中，该输入单元，还被配置为执行：In some embodiments, the input unit is further configured to perform:

在一些实施例中，该训练单元，包括：In some embodiments, the training unit includes:

确定子单元，被配置为执行在模型训练的第i次迭代过程中，基于该第i次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，确定该第i次迭代过程的模型损失值，该i为大于1的正整数；The determining subunit is configured to perform in the ith iteration process of model training, based on the image training result of the ith iteration process, the text training result and the image label and text label of the sample video, determine the ith time The model loss value of the iterative process, the i is a positive integer greater than 1;

调整子单元，被配置为执行基于该第i次迭代过程的模型损失值，对第i-1次迭代过程所确定的视频特征提取模型的模型参数进行调整，重复上述训练的迭代过程，直至训练满足目标条件。The adjustment subunit is configured to execute the model loss value based on the i-th iteration process, adjust the model parameters of the video feature extraction model determined in the i-1-th iteration process, and repeat the above-mentioned iterative process of training until training meet the target conditions.

在一些实施例中，该确定子单元，包括：In some embodiments, the determining subunit includes:

图像重建损失值确定子单元，被配置为执行基于该第i次迭代过程的图像训练结果与该样本视频的图像标签，确定该第i次迭代过程的图像重建损失值，该图像重建损失值表示该图像训练结果与该图像标签之间的差异；The image reconstruction loss value determination subunit is configured to execute the image training result based on the i-th iteration process and the image label of the sample video, and determine the image reconstruction loss value of the i-th iteration process, and the image reconstruction loss value represents the difference between the training result of the image and the label of the image;

文本生成损失值确定子单元，被配置为执行基于该第i次迭代过程的文本训练结果与该样本视频的文本标签，确定该第i次迭代过程的文本生成损失值，该文本生成损失值表示该文本训练结果与该文本标签之间的差异；The text generation loss value determination subunit is configured to execute the text generation loss value of the ith iteration process based on the text training result of the ith iteration process and the text label of the sample video, and the text generation loss value represents the difference between the text training result and the text label;

模型损失值确定子单元，被配置为执行基于该图像重建损失值与该文本生成损失值，确定该第i次迭代过程的模型损失值。The model loss value determination subunit is configured to perform the reconstruction loss value based on the image and the text generation loss value, and determine the model loss value of the ith iteration process.

在一些实施例中，该模型损失值确定子单元，被配置为执行：In some embodiments, the model loss value determination subunit is configured to perform:

该文本生成损失值确定子单元，被配置为执行：The text generation loss value determination subunit is configured to perform:

在一些实施例中，该装置还包括确定单元，被配置为执行：In some embodiments, the apparatus further includes a determining unit configured to perform:

对于任一类型，基于该第i次迭代过程在该类型上的正确文本数量以及总文本数量，确定该第i次迭代过程在该类型上的正确比例，该正确比例表示在该第i次迭代过程中正确文本数量占总文本数量的比例；For any type, based on the correct number of texts on the type in the ith iteration and the total number of texts, determine the correct scale of the ith iteration on the type, the correct scale representing the ith iteration The ratio of the number of correct texts to the total number of texts in the process;

基于该第i次迭代过程在该类型上的正确比例，确定该视频特征提取网络在该类型上的权重系数，其中，该正确比例与该权重系数成负相关。Based on the correct scale of the ith iterative process on the type, the weight coefficient of the video feature extraction network on the type is determined, wherein the correct scale is negatively correlated with the weight coefficient.

根据本公开实施例的第四方面，提供一种基于视频特征提取模型的文本生成装置，该视频特征提取模型基于上述第一方面或该第一方面中任一实施例所示的训练方法训练得到，该装置包括：According to a fourth aspect of the embodiments of the present disclosure, there is provided a text generation device based on a video feature extraction model, where the video feature extraction model is trained based on the first aspect or the training method shown in any embodiment of the first aspect. , the device includes:

获取单元，被配置为执行获取目标视频的图像信息与文本信息；an acquisition unit, configured to perform acquisition of image information and text information of the target video;

输入单元，被配置为执行将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该目标视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该目标视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该目标视频的融合特征；The input unit is configured to input the image information and the text information into the video feature extraction model, and perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image feature of the target video, The text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the target video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model. , to obtain the fusion feature of the target video;

处理单元，被配置为执行通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，输出满足文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本。The processing unit is configured to process the fusion feature through the text generation sub-model of the video feature extraction model, output a plurality of characters satisfying the text generation condition, and generate a content description text of the target video based on the plurality of characters.

在一些实施例中，该处理单元，��被配置为执行：In some embodiments, the processing unit is further configured to execute:

根据本公开实施例的第五方面，提供一种计算机设备，该计算机设备包括：According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer device, the computer device comprising:

一个或多个处理器；one or more processors;

用于存储该处理器可执行程序代码的存储器；memory for storing executable program code for the processor;

其中，该处理器被配置为执行该程序代码，以实现上述第一方面或该第一方面中任一��施例所示的视频特征提取模型的训练方法，或上述第二方面或该第二方面中任一实施例所示的基于视频特征提取模型的文本生成方法。Wherein, the processor is configured to execute the program code to implement the above-mentioned first aspect or the method for training a video feature extraction model shown in any embodiment of the first aspect, or the above-mentioned second aspect or the second aspect The text generation method based on the video feature extraction model shown in any one of the embodiments.

根据本公开实施例的第六方面，提供一种计算机可读存储介质，该计算机可读存储介质包括：当该计算机可读存储介质中的程序代码由计算机设备的处理器执行时，使得计算机设备能够执行上述第一方面或该第一方面中任一实施例所示的视频特征提取模型的训练方法，或上述第二方面或该第二方面中任一实施例所示的基于视频特征提取模型的文本生成方法。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, the computer-readable storage medium comprising: when program code in the computer-readable storage medium is executed by a processor of a computer device, causing the computer device A training method capable of performing the video feature extraction model shown in the first aspect or any embodiment of the first aspect, or the video-based feature extraction model shown in any embodiment of the second aspect or the second aspect text generation method.

根据本公开实施例的第七方面，提供一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述第一方面或该第一方面中任一实施例所示的视频特征提取模型的训练方法，或上述第二方面或该第二方面中任一实施例所示的基于视频特征提取模型的文本生成方法。According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product, including a computer program, which, when executed by a processor, implements the video feature extraction according to the first aspect or any one of the embodiments of the first aspect. A training method for a model, or a text generation method based on a video feature extraction model shown in the second aspect or any embodiment of the second aspect.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理，并不构成对本公开的不当限定。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.

图1是根据一示例性实施例示出的一种视频特征提取模型的训练方法的实施环境示意图；1 is a schematic diagram of an implementation environment of a method for training a video feature extraction model according to an exemplary embodiment;

图2是根据一示例性实施例示出的一种视频特征提取模型的训练方法的流程图；2 is a flowchart of a method for training a video feature extraction model according to an exemplary embodiment;

图3是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成方法的流程图；3 is a flow chart of a text generation method based on a video feature extraction model according to an exemplary embodiment;

图4是根据一示例性实施例示出的一种视频特征提取模型的训练方法的流程图；4 is a flowchart of a method for training a video feature extraction model according to an exemplary embodiment;

图5是根据一示例性实施例示出的一种视频特征提取模型的框架；5 is a framework of a video feature extraction model according to an exemplary embodiment;

图6是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成方法的流程图；6 is a flowchart of a text generation method based on a video feature extraction model according to an exemplary embodiment;

图7是根据一示例性实施例示出的一种视频特征提取模型的训练装置的框图；FIG. 7 is a block diagram of a training apparatus for a video feature extraction model according to an exemplary embodiment;

图8是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成装置的框图；FIG. 8 is a block diagram of a text generation apparatus based on a video feature extraction model according to an exemplary embodiment;

图9是根据一示例性实施例示出的一种终端的框图；FIG. 9 is a block diagram of a terminal according to an exemplary embodiment;

图10是根据一示例性实施例示出的一种服务器的框图。Fig. 10 is a block diagram of a server according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案，下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

需要说明的是，本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

需要说明的是，本公开实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号，均为经用户授权或者经过各方充分授权的，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如，本公开实施例所涉及的图像信息和文本信息都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) involved in the embodiments of the present disclosure, and Signals are authorized by users or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the image information and text information involved in the embodiments of the present disclosure are obtained under the condition of full authorization.

在一些实施例中，终端提供有权限询问页面，该权限询问页面用于询问用户是否授予视频的图像信息和文本信息的获取权限，在该权限询问页面中，显示同意授权控件和拒绝授权控件，在检测到用户对该同意授权控件的触发操作的情况下，利用本公开实施例所提供的视频特征提取模型的训练方法来获取样本视频的图像信息和文本信息，进而基于该样本视频的图像信息和文本信息，来对视频特征提取模型进行模型训练。In some embodiments, the terminal is provided with a permission query page, which is used to ask the user whether to grant the user permission to obtain the image information and text information of the video. On the permission query page, an authorization authorization control and an authorization rejection control are displayed, In the case where the user's triggering operation on the consent authorization control is detected, the training method of the video feature extraction model provided by the embodiment of the present disclosure is used to obtain the image information and text information of the sample video, and then based on the image information of the sample video and text information to train the video feature extraction model.

图1是根据一示例性实施例示出的一种视频特征提取模型的训练方法的实施环境示意图，参见图1，该实施环境中包括：终端101和服务器102。FIG. 1 is a schematic diagram of an implementation environment of a training method for a video feature extraction model according to an exemplary embodiment. Referring to FIG. 1 , the implementation environment includes: a terminal 101 and a server 102 .

终端101可以为智能手机、智能手表、台式电脑、手提电脑、虚拟现实终端、增强现实终端、无线终端和膝上型便携计算机等设备中的至少一种。终端101具有通信功能，可以接入有线网络或无线网络。终端101可以泛指多个终端中的一个，本实施例仅以终端101来举例说明。本领域技术人员可以知晓，上述终端的数量可以更多或更少。The terminal 101 may be at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, and a laptop portable computer. The terminal 101 has a communication function and can access a wired network or a wireless network. The terminal 101 may generally refer to one of multiple terminals, and this embodiment only takes the terminal 101 as an example for illustration. Those skilled in the art may know that the number of the above-mentioned terminals may be more or less.

服务器102可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式文件系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。在一些实施例中，服务器102与终端101通过有线或无线通信方式进行直接或间接的连接，本公开实施例对此不作限定。可选地，上述服务器102的数量可以更多或更少，本公开实施例对此不加以限定。当然，服务器102还可以包括其他功能服务器，以便提供更全面且多样化的服务。The server 102 may be an independent physical server, or a server cluster or a distributed file system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications , middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. In some embodiments, the server 102 and the terminal 101 are directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present disclosure. Optionally, the number of the foregoing servers 102 may be more or less, which is not limited in this embodiment of the present disclosure. Of course, the server 102 may also include other functional servers in order to provide more comprehensive and diversified services.

在一些实施例中，本公开实施例所提供的视频特征提取模型的训练方法由终端101执行，例如，终端101响应于对该视频特征提取模型的训练操作，利用本公开实施例所提供的视频特征提取模型的训练方法，对该视频特征提取模型进行模型训练；在又一些实施例中，本公开实施例所提供的视频特征提取模型的训练方法由终端101和服务器102共同执行，例如，终端101响应于对该视频特征提取模型的训练数据的上传操作，向服务器102发送该视频特征提取模型的训练数据，则服务器102接收终端101所上传的训练数据，利用本公开实施例所提供的视频特征提取模型的训练方法，对该视频特征提取模型进行模型训练。In some embodiments, the training method of the video feature extraction model provided by the embodiment of the present disclosure is performed by the terminal 101. For example, the terminal 101 uses the video provided by the embodiment of the present disclosure in response to the training operation of the video feature extraction model. The training method of the feature extraction model, the model training is performed on the video feature extraction model; in some other embodiments, the training method of the video feature extraction model provided by the embodiments of the present disclosure is performed jointly by the terminal 101 and the server 102, for example, the terminal 101 sends the training data of the video feature extraction model to the server 102 in response to the uploading operation of the training data of the video feature extraction model, then the server 102 receives the training data uploaded by the terminal 101, and uses the video provided by the embodiment of the present disclosure. The training method of the feature extraction model is to perform model training on the video feature extraction model.

需要说明的是，本公开实施例所提供的视频特征提取模型，可以应用在视频推荐、视频分类或视频搜索的场景中。在一些实施例中，利用本公开实施例所训练的视频特征提取模型，获取视频的内容描述文本，进而利用所获取的内容描述文本来实现视频推荐、视频分类或视频搜索的功能。It should be noted that, the video feature extraction model provided by the embodiments of the present disclosure can be applied in the scenarios of video recommendation, video classification, or video search. In some embodiments, the video feature extraction model trained by the embodiments of the present disclosure is used to obtain the content description text of the video, and then the functions of video recommendation, video classification or video search are implemented by using the obtained content description text.

图2是根据一示例性实施例示出的一种视频特征提取模型的训练方法的流程图，如图2所示，该方法由计算机设备执行，该计算机设备可提供为上述图1所示出的终端或服务器，示意性地，该方法包括以下步骤：Fig. 2 is a flow chart of a method for training a video feature extraction model according to an exemplary embodiment. As shown in Fig. 2, the method is executed by a computer device, and the computer device can be provided as shown in Fig. 1 above. Terminal or server, schematically, the method includes the following steps:

在步骤201中，计算机设备获取样本视频的图像信息、文本信息、图像标签以及文本标签，该图像标签表示图像重建特征，该文本标签表示该样本视频的内容描述文本。In step 201, the computer device acquires image information, text information, image label and text label of the sample video, where the image label represents the image reconstruction feature, and the text label represents the content description text of the sample video.

在步骤202中，计算机设备将该图像信息与该文本信息输入视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该样本视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征。In step 202, the computer equipment inputs the image information and the text information into the video feature extraction model, and performs feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image features of the sample video, The text information is processed by the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the sample video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model, Get the fusion features of the sample video.

在步骤203中，计算机设备通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到原始图像大小的图像训练结果，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果。In step 203, the computer equipment performs image restoration on the image features in the fusion feature through the image reconstruction sub-model of the video feature extraction model to obtain an image training result of the original image size, and generates a sub-model through the text of the video feature extraction model. The fusion feature is processed to obtain the text training result.

在步骤204中，计算机设备基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练。In step 204, the computer device adjusts the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text based on the image training result, the text training result and the image label and text label of the sample video Generate the model parameters of the sub-model to train the video feature extraction model.

本公开实施例提供的技术方案，利用样本视频的图像信息、文本信息以及该样本视频的文本标签、图像标签，来对视频特征提取模型进行模型训练，其中，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到样本视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该样本视频的文本特征，还能够对样本视频的图像特征和文本特征进行特征融合，以便后续基于融合特征，一方面，能够对样本视频进行图像重建，以获得高质量的图像特征，另一方面，还能够生成该样本视频的内容描述文本，如此，提供了一种基于双训练任务的模型训练方法，在以文本生成任务为主任务而图像重建任务为辅任务的情况下，由于该样本视频的图像标签表示图像重建特征，因此在模型训练的过程中，能够提升视频特征提取模型针对图像特征的提取能力，进而获取到高质量的图像特征，在获取到高质量的图像特征的基础上，也就能够训练出文本生成能力较优的视频特征提取模型，提升了视频特征提取模型的训练效果。In the technical solution provided by the embodiments of the present disclosure, the image information and text information of the sample video and the text label and image label of the sample video are used to perform model training on the video feature extraction model. The feature extraction sub-model can accurately extract the image features of the sample video. By constructing the feature fusion sub-model in the video feature extraction model, not only the text features of the sample video can be obtained, but also the image features and text of the sample video. Feature fusion is performed on features, so that based on the fusion features, on the one hand, image reconstruction can be performed on the sample video to obtain high-quality image features, and on the other hand, the content description text of the sample video can also be generated. A model training method based on dual training tasks. In the case where the text generation task is the main task and the image reconstruction task is the auxiliary task, since the image label of the sample video represents the image reconstruction feature, in the process of model training, it can be Improve the ability of the video feature extraction model to extract image features, and then obtain high-quality image features. On the basis of obtaining high-quality image features, a video feature extraction model with better text generation ability can be trained to improve The training effect of the video feature extraction model.

获取该样本视频的封面图像；��，获取该样本视频内的至少一帧图像。Obtain the cover image of the sample video; or, obtain at least one frame of image in the sample video.

在该融合特征上，添加该各个类型的类型标识；On the fusion feature, add the type identifier of each type;

��于任一类型，基于该第i次迭代过程在该类型上的文本训练结果与该样本视频在该类型上的文本标签，确定该第i次迭代过程在该类型上的损失值；For any type, determine the loss value of the i-th iteration process on the type based on the text training result of the i-th iteration process on the type and the text label of the sample video on the type;

在一些实施例中，基于该第i次迭代过程在该多个类型上的损失值以及该视频特征提取网络在该多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值之前，该方法还包括：In some embodiments, based on the loss value of the i-th iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, a weighted sum is performed to obtain the i-th iteration process Before generating the loss value for the text of , the method also includes:

图3是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成方法的流程图，如图3所示，该方法由计算机设备执行，该计算机设备可提供为上述图1所示出的终端或服务器，示意性地，该方法包括以下步骤：Fig. 3 is a flow chart of a text generation method based on a video feature extraction model according to an exemplary embodiment. As shown in Fig. 3, the method is executed by a computer device, and the computer device can be provided as shown in Fig. 1 above. The terminal or server that is sent out, schematically, the method includes the following steps:

在步骤301中，计算机设备获取目标视频的图像信息与文本信息。In step 301, the computer device acquires image information and text information of the target video.

在步骤302中，计算机设备将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该目标视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该目标视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该目标视频的融合特征。In step 302, the computer equipment inputs the image information and the text information into the video feature extraction model, and performs feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image feature of the target video, The text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the target video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model. , to obtain the fusion features of the target video.

在步骤303中，计算机设备通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，输出满足文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本。In step 303, the computer device processes the fusion feature through the text generation sub-model of the video feature extraction model, outputs a plurality of characters satisfying the text generation condition, and generates a content description text of the target video based on the plurality of characters.

本公开实施例提供的技术方案，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到目标视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该目标视频的文本特征，还能够对目标视频的图像特征和文本特征进行特征融合，以便后续基于融合特征进行处理，能够输出满足文本生成条件的多个字符，进而基于所输出的多个字符能够自动生成该目标视频的内容描述文本，提供了一种基于文本生成的视频特征提取模型，所生成的内容描述文本包含了丰富的信息量，能够更好的表征该目标视频，提升了视频表征的准确性。According to the technical solutions provided by the embodiments of the present disclosure, by constructing an image feature extraction sub-model in the video feature extraction model, the image features of the target video can be accurately extracted, and by constructing a feature fusion sub-model in the video feature extraction model, not only can To the text features of the target video, it can also perform feature fusion on the image features and text features of the target video, so that the subsequent processing based on the fusion features can output multiple characters that meet the text generation conditions, and then based on the output multiple characters The content description text of the target video can be automatically generated, and a video feature extraction model based on text generation is provided. The generated content description text contains a wealth of information, which can better characterize the target video and improve the video representation. accuracy.

上述图2至图3所示仅为本公开的基本流程，下面基于一种具体实施方式，来对本公开提供的方案进行进一步阐述，图4是根据一示例性实施例示出的一种视频特征提取模型的训练方法的流程图，参见图4，该方法包括：The above-mentioned FIG. 2 to FIG. 3 are only the basic flow of the present disclosure. The solution provided by the present disclosure will be further elaborated below based on a specific implementation manner. FIG. 4 is a video feature extraction according to an exemplary embodiment. For the flowchart of the training method of the model, see Figure 4, the method includes:

在步骤401中，计算机设备获取样本视频的图像信息、文本信息、图像标签以及文本标签，该图像标签表示图像重建特征，该文本标签表示该样本视频的内容描述文本。In step 401, the computer device acquires image information, text information, image label and text label of the sample video, where the image label represents the image reconstruction feature, and the text label represents the content description text of the sample video.

本公开实施例中，采用样本视频来指代用于训练该视频特征提取模型的训练视频，在一些实施例中，样本视频的数量为多个。应理解地，样本视频的图像信息、文本信息、图像标签以及文本标签作为训练数据集而获取。In the embodiments of the present disclosure, sample videos are used to refer to training videos used to train the video feature extraction model. In some embodiments, the number of sample videos is multiple. It should be understood that the image information, text information, image labels and text labels of the sample video are obtained as a training data set.

在一些实施例中，计算机设备获取样本视频的图像信息的过程包括下述至少一项：获取该样本视频的封面图像；或，获取该样本视频内的至少一帧图像。In some embodiments, the process of acquiring the image information of the sample video by the computer device includes at least one of the following: acquiring a cover image of the sample video; or, acquiring at least one frame of image in the sample video.

其中，该至少一帧图像的数量可以为一帧、两帧或两帧以上，如三帧。在一些实施例中，计算机设备获取该至少一帧图像的过程为：从该样本视频所包括的多帧图像中，随机抽取预设数量的图像，例如随机抽取三帧图像；或，从该样本视频所包括的多帧图像中，均匀抽取预设数量的图像，例如等间隔抽取三帧图像；从该样本视频所包括的多帧图像中，抽取关键帧。当然，计算机设备��能够基于其他方式，来获取该样本视频内的至少一帧图像，本公开实施例对此不加以限定。Wherein, the number of the at least one frame of images may be one frame, two frames or more than two frames, such as three frames. In some embodiments, the process of acquiring the at least one frame of image by the computer device is: randomly select a preset number of images from the multiple frames of images included in the sample video, for example, randomly select three frames of images; or, from the sample video From the multiple frames of images included in the video, a preset number of images are uniformly extracted, for example, three frames of images are extracted at equal intervals; key frames are extracted from the multiple frames of images included in the sample video. Certainly, the computer device can also acquire at least one frame of image in the sample video based on other manners, which is not limited in this embodiment of the present disclosure.

在上述实施例中，通过获取样本视频的封面图像或样本视频所包括的图像帧，均能够快速获取到样本视频的图像信息，在确保获取图像信息的效率的同时，还丰富了图像信息的类型，提升了获取图像信息的灵活性。In the above embodiment, by obtaining the cover image of the sample video or the image frame included in the sample video, the image information of the sample video can be quickly obtained, which not only ensures the efficiency of obtaining image information, but also enriches the types of image information. , which improves the flexibility of acquiring image information.

在一些实施例中，计算机设备获取样本视频的文本信息的过程包括下述至少一项：获取该样本视频的描述信息；获取该样本视频的标题信息；获取该样本视频的字幕信息；获取该样本视频的文字识别结果；获取该样本视频的音频识别结果。In some embodiments, the process of acquiring the text information of the sample video by the computer device includes at least one of the following: acquiring description information of the sample video; acquiring title information of the sample video; acquiring subtitle information of the sample video; acquiring the sample The text recognition result of the video; obtain the audio recognition result of the sample video.

其中，描述信息为用于描述该样本视频的视频主题的信息，如主题描述信息或话题描述信息(caption)；或者，描述信息为用于描述该样本视频的视频内容的信息，如内容描述信息(hashtag)。标题信息是指该样本视频的视频标题(title)。在一些实施例中，描述信息和标题信息为样本视频的发布者所设置的信息，例如，样本视频的发布者在发布该样本视频时，该发布者所对应的终端提供有描述信息录入框和标题信息录入框，通过该描述信息录入框和标题信息录入框，发布者能够对该样本视频的描述信息和标题信息进行设置，进而，计算机设备在获取到该样本视频的同时，还能够获取到该样本视频的描述信息和标题信息。Wherein, the description information is information used to describe the video topic of the sample video, such as theme description information or topic description information (caption); or, the description information is information used to describe the video content of the sample video, such as content description information (hashtag). The title information refers to the video title (title) of the sample video. In some embodiments, the description information and title information are information set by the publisher of the sample video. For example, when the publisher of the sample video publishes the sample video, the terminal corresponding to the publisher is provided with a description information input box and Title information entry box, through the description information entry box and the title information entry box, the publisher can set the description information and title information of the sample video, and then, the computer equipment can also obtain the sample video while obtaining the sample video. Description and title information for this sample video.

字幕信息是指样本视频内的图像所包括的字幕(text)，在一些实施例中，计算机设备利用字幕提取工具，提取该样本视频的字幕信息。文字识别结果为对该样本视频内的至少一帧图像进行文字识别所得到的结果，在一些实施例中，计算机设备利用OCR(OpticalCharacter Recognition，光��字符识别)技术，对该样本视频所包括的多帧图像进行文字识别，得到该样本视频的文字识别结果。音频识别结果为对该样本视频的背景音频进行音频识别所得到的结果，在一些实施例中，计算机设备利用ASR(Automatic SpeechRecognition，自动语音识别)技术，对该样本视频的背景音频进行音频识别，得到该样本视频的音频识别结果。The subtitle information refers to the subtitle (text) included in the image in the sample video. In some embodiments, the computer device uses a subtitle extraction tool to extract the subtitle information of the sample video. The text recognition result is a result obtained by performing text recognition on at least one frame of image in the sample video. In some embodiments, the computer device uses OCR (Optical Character Recognition, Optical Character Recognition) technology to perform text recognition on the sample video. The frame image is used for text recognition, and the text recognition result of the sample video is obtained. The audio recognition result is the result obtained by performing audio recognition on the background audio of the sample video. In some embodiments, the computer equipment utilizes ASR (Automatic Speech Recognition, automatic speech recognition) technology to perform audio recognition on the background audio of the sample video, Get the audio recognition result of the sample video.

在上述实施例中，通过获取样本视频的描述信息、标题信息、字幕信息、文字识别结果或音频识别结果，均能够快速获取到样本视频的文本信息，在确保获取文本信息的效率的同时，还丰富了文本信息的类型，提升了获取文本信息的灵活性。In the above-mentioned embodiment, by obtaining the description information, title information, subtitle information, text recognition result or audio recognition result of the sample video, the text information of the sample video can be quickly obtained. The types of text information are enriched, and the flexibility of obtaining text information is improved.

图像标签表示该样本视频经图像重建后的图像重建特征，其中，图像重建是指对视频中的图像进行图像恢复或图像复原，以得到完整的图像特征。文本标签表示该样本视频的内容描述文本，该内容描述文本是指用于描述该样本视频的内容的语句。The image tag represents the image reconstruction feature of the sample video after image reconstruction, wherein the image reconstruction refers to performing image restoration or image restoration on the image in the video to obtain complete image features. The text tag represents the content description text of the sample video, and the content description text refers to a sentence used to describe the content of the sample video.

在一些实施例中，内容类目描述文本包括多级类目描述文本，例如，一级类目描述文本和二级类目描述文本，其中，一级类目描述文本是指用于描述视频的一级类目的语句��二级类目描述文本是指用于描述视频的二级类目的语句。应理解地，一级类目是指视频的总分类，二级类目是指视频在一级类目的基础上的子分类，其中二级类目相对于一级类目为树形结构，也就是说，一级类目下会包括多个二级类目。示例地，一级类目可以是生活类，二级类目可以是生活记录、好物分享或健康养生等等，或者，一级类目可以是美妆类，二级类目可以是美妆教学、美妆测评、护肤保养等等。In some embodiments, the content category description text includes multi-level category description text, for example, a first-level category description text and a second-level category description text, wherein the first-level category description text refers to a description text for describing a video. The sentence of the first-level category, and the description text of the second-level category refers to the sentence used to describe the second-level category of the video. It should be understood that the first-level category refers to the general classification of the video, and the second-level category refers to the sub-category of the video on the basis of the first-level category, wherein the second-level category is a tree structure relative to the first-level category, That is to say, a first-level category will include multiple second-level categories. For example, the first-level category may be life category, the second-level category may be life records, good things sharing or health preservation, etc., or, the first-level category may be beauty makeup, and the second-level category may be beauty makeup teaching. , beauty reviews, skin care and more.

内容形式描述文本是指用于描述视频的内容形式的语句，该内容形式表示该视频的表现形式(或称作拍摄形式)。示例地，内容形式可以是纪录片形式、情景短剧形式或街头采访形式等等。内容主题描述文本是指用于描述视频的内容主题的语句，该内容主题表示视频的内容主题，例如，视频主题可以是视频话题。内容详情描述文本是指用于描述视频的内容详情的语句，例如，用于描述视频的画面内容的语句。The content form description text refers to a sentence used to describe the content form of the video, and the content form represents the presentation form (or called the shooting form) of the video. Illustratively, the content format may be a documentary format, a sitcom format, a street interview format, or the like. The content topic description text refers to a sentence used to describe the content topic of the video, where the content topic represents the content topic of the video, for example, the video topic may be a video topic. The content details description text refers to a sentence for describing the content details of the video, for example, a sentence for describing the screen content of the video.

在上述实施例中，通过设置多种类型的内容描述文本，一方面，能够生成更具有表达能力的内容描述文本，另一方面，该多种类型的内容描述文本能够从不同的维度对视频的内容进行描述，丰富了所生成的内容描述文本的类型，能够更加充分完整的表征视频。In the above embodiment, by setting multiple types of content description texts, on the one hand, more expressive content description texts can be generated; The content is described, which enriches the type of the generated content description text, and can more fully and completely characterize the video.

在步骤402中，计算机设备将该图像信息与该文本信息输入视频特征提取模型。In step 402, the computer device inputs the image information and the text information into the video feature extraction model.

本公开实施例中，视频特征提取模型提供有图像重建和文本生成的功能。在一些实施例中，视频特征提取模型为基于自注意力机制(Self Attention Mechanism)的编码器(Encoder)-解码器(Decoder)架构。其中，自注意力机制是一种基于特征之间的依赖关系来学习特征含义的机制，在自注意力机制中，对于每一个输入的特征，根据该特征与其邻近特征来计算两者的相似性或相关性，如计算两者的向量点积、计算两者的向量相似性或者通过再引入额外的神经网络来求值等，得到该特征与其邻近特征的计算分值，再利用如softmax函数(一种激活函数)的计算方式对该计算分值进行数值转换，如此，一方面，能够将计算分值转化为元素权重之和为1的概率分布，实现了归一化，另一方面，通过softmax函数的内在机制，能够更加突出重要元素的权重，进而，利用各个元素的权重进行加权求和，输出自注意力分值。In the embodiment of the present disclosure, the video feature extraction model provides the functions of image reconstruction and text generation. In some embodiments, the video feature extraction model is an Encoder-Decoder architecture based on a Self Attention Mechanism. Among them, the self-attention mechanism is a mechanism that learns the meaning of features based on the dependencies between features. In the self-attention mechanism, for each input feature, the similarity between the feature and its neighboring features is calculated according to the feature. Or correlation, such as calculating the vector dot product of the two, calculating the vector similarity of the two, or by introducing an additional neural network to evaluate, etc., to obtain the calculated score of the feature and its adjacent features, and then use functions such as softmax ( The calculation method of an activation function) performs numerical conversion on the calculated score. In this way, on the one hand, the calculated score can be converted into a probability distribution with the sum of the element weights being 1, which realizes normalization. On the other hand, through The internal mechanism of the softmax function can highlight the weight of important elements, and then use the weight of each element to perform a weighted summation to output the self-attention score.

示例地，图5是根据一示例性实施例示出的一种视频特征提取模型的框图，参见图5，视频特征提取模型包括图像特征提取子模型、特征融合子模型、图像重建子模型以及文本生成子模型，其中，视频特征提取模型的编码器包括特征融合子模型，视频特征提取模型的解码器包括文本生成子模型。下面基于图5所示出的视频特征提取模型，对本公开实施例提供的视频特征提取模型的训练方法进行说明。Exemplarily, FIG. 5 is a block diagram of a video feature extraction model according to an exemplary embodiment. Referring to FIG. 5 , the video feature extraction model includes an image feature extraction sub-model, a feature fusion sub-model, an image reconstruction sub-model, and a text generation sub-model. The sub-model, wherein the encoder of the video feature extraction model includes a feature fusion sub-model, and the decoder of the video feature extraction model includes a text generation sub-model. The following describes the training method of the video feature extraction model provided by the embodiment of the present disclosure based on the video feature extraction model shown in FIG. 5 .

在步骤403中，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该样本视频的图像特征。In step 403, feature extraction is performed on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the sample video.

本公开实施例中，图像特征提取子模型提供有提取视频的图像特征的功能。在一些实施例中，图像特征提取子模型为Resnet(残差网络)或ViT(Vision Transformer，视觉转换器)或Swin Tiny模型等。In the embodiment of the present disclosure, the image feature extraction sub-model provides a function of extracting image features of a video. In some embodiments, the image feature extraction sub-model is Resnet (residual network) or ViT (Vision Transformer, visual transformer) or Swin Tiny model.

在一些实施例中，计算机设备将该图像信息与该文本信息输入视频特征提取模型之后，通过该视频特征提取模型，将该图像信息输入该视频特征提取模型的图像特征提取子模型，通过图像特征提取子模型对该图像信息进行特征提取，能够得到预定维数的图像特征，如512维度(或其他数量维度)的图像特征。In some embodiments, after the computer device inputs the image information and the text information into the video feature extraction model, the image information is input into the image feature extraction sub-model of the video feature extraction model through the video feature extraction model, and the image feature extraction model is used to extract the image information. The extraction sub-model performs feature extraction on the image information, and can obtain image features of predetermined dimensions, such as image features of 512 dimensions (or other quantitative dimensions).

在步骤404中，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征。In step 404, the text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain text features of the sample video.

其中，嵌入层用于将数值转换为具有固定大小的向量。在一些实施例中，计算机设备将该图像信息与该文本信息输入视频特征提取模型之后，通过该视频特征提取模型，将该文本信息输入该视频特征提取模型的特征融合子模型，通过特征融合子模型的嵌入层对该文本信息进行处理，能够得到预定维数的文本特征，如512维度(或其他数量维度)的文本特征。需要说明的是，文本特征的维数与图像特征的维数相同。Among them, the embedding layer is used to convert the numerical value into a vector with a fixed size. In some embodiments, after inputting the image information and the text information into a video feature extraction model, the computer device inputs the text information into a feature fusion sub-model of the video feature extraction model through the video feature extraction model, and uses the feature fusion sub-model The embedding layer of the model processes the text information, and can obtain text features of predetermined dimensions, such as text features of 512 dimensions (or other quantity dimensions). It should be noted that the dimension of the text feature is the same as the dimension of the image feature.

在一些实施例中，文本信息的数量为多个(如两个或两个以上)，相应地，在对该文本信息进行特征提取之前，该方法还包括：对多个文本信息进行拼接，得到拼接后的该文本信息，基于拼接后的文本信息，执行上述步骤404。示例地，以步骤401所示出的五个文本信息为例，对该样本视频的描述信息、标题信息、字幕信息、文字识别结果以及音频识别结果进行拼接，得到拼接后的文本信息，再将该拼接后的文本信息输入上述文本特征提取子模型，以执行上述步骤404。在该实施例中，在文本信息的数量为多个的情况下，通过对文本信息进行拼接，以得到拼接后的文本信息，进而利用拼接后的文本信息来进行提取文本特征的过程，参考了多种类型的文本信息，提高了提取文本特征的准确性。In some embodiments, the number of text information is multiple (such as two or more). Accordingly, before the feature extraction is performed on the text information, the method further includes: splicing the multiple text information to obtain For the spliced text information, based on the spliced text information, the above step 404 is performed. Illustratively, taking the five text information shown in step 401 as an example, the description information, title information, subtitle information, text recognition result and audio recognition result of the sample video are spliced to obtain the spliced text information, and then The spliced text information is input into the above-mentioned text feature extraction sub-model to execute the above-mentioned step 404 . In this embodiment, when the number of text information is multiple, the spliced text information is obtained by splicing the text information, and then the process of extracting text features is performed by using the spliced text information. Various types of text information improve the accuracy of extracting text features.

进一步地，在一些实施例中，在得到拼接后的文本信息之后，该方法还包括：从拼接后的该文本信息中，提取前目标数量的字符，基于所提取的字符，执行上述步骤404。其中，目标数量为预先设定的固定数量，如200。在该实施例中，在拼接后的文本信息中，提取前目标数量的字符，以��基于所提取的一定数量的字符，来进行后续提取文本特征的过程，在确保输入充足的文本信息的基础上，减小了视频特征提取模型的运算量，提高了提取文本特征的效率。Further, in some embodiments, after obtaining the spliced text information, the method further includes: extracting the characters of the previous target number from the spliced text information, and performing the above step 404 based on the extracted characters. The target number is a preset fixed number, such as 200. In this embodiment, in the spliced text information, the characters of the previous target number are extracted, so that the subsequent process of extracting text features can be performed based on the extracted certain number of characters, and on the basis of ensuring that sufficient text information is input , which reduces the computational complexity of the video feature extraction model and improves the efficiency of text feature extraction.

在步骤405中，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征。In step 405, feature fusion is performed on the image feature and the text feature through the feature fusion layer of the feature fusion sub-model to obtain the fusion feature of the sample video.

其中，特征融合子模型提供有对图像特征和文本特征进行特征融合的功能，以输出更具备视频表征能力的特征。在一些实施例中，该特征融合层提供为自注意力层，如基于自注意力机制的变换层(Transformer layer)，相应地，通过该特征融合子模型所包括的自注意力层，对该图像特征与该文本特征进行处理，得到该样本视频的融合特征；在另一些实施例中，该特征融合层提供为深度置信网络，相应地，通过该特征融合子模型所包括的深度置信网络，对该图像特征与该文本特征进行处理，得到该样本视频的融合特征。在该实施例中，通过在特征融合子模型中设置自注意力层深度置信网络，进而利用自注意力机制深度置信网络来进行特征融合，能够获得更具备视频表征能力的特征，提高了特征融合的准确性。当然，在另一些实施例中，还能够在该特征融合子模型中设置其他具备特征融合功能的网络层来实现特征融合的功能，本公开实施例对此不作限定。Among them, the feature fusion sub-model provides the function of feature fusion of image features and text features to output features with more video representation capabilities. In some embodiments, the feature fusion layer is provided as a self-attention layer, such as a Transformer layer based on a self-attention mechanism. Accordingly, through the self-attention layer included in the feature fusion sub-model, the The image feature and the text feature are processed to obtain the fusion feature of the sample video; in other embodiments, the feature fusion layer is provided as a deep belief network, and correspondingly, through the deep belief network included in the feature fusion sub-model, The image feature and the text feature are processed to obtain the fusion feature of the sample video. In this embodiment, by setting the self-attention layer deep belief network in the feature fusion sub-model, and then using the self-attention mechanism deep belief network to perform feature fusion, it is possible to obtain features with more video representation capabilities and improve feature fusion. accuracy. Of course, in other embodiments, other network layers with a feature fusion function can also be set in the feature fusion sub-model to implement the feature fusion function, which is not limited in the embodiments of the present disclosure.

在本公开实施例中，特征融合子模型包括嵌入层(Embedding层)和特征融合层(如多层自注意力层或深度置信网络)，以便利用该特征融合子模型的嵌入层来提取样本视频的文本特征，进而结合图像特征提取子模型所输出的图像特征，再利用该特征融合子模型的特征融合层来进行特征融合，如此，由于构建了图像特征提取子模型和特征融合��模型的单流架构，能够更加充分的融合不同模态的特征，提高了特征融合的准确性。In the embodiment of the present disclosure, the feature fusion sub-model includes an embedding layer (Embedding layer) and a feature fusion layer (such as a multi-layer self-attention layer or a deep belief network), so as to use the embedding layer of the feature fusion sub-model to extract sample videos and then combine the image features output by the image feature extraction sub-model, and then use the feature fusion layer of the feature fusion sub-model to perform feature fusion. The flow architecture can more fully fuse the features of different modalities and improve the accuracy of feature fusion.

在步骤406中，通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到原始图像大小的图像训练结果。In step 406, image restoration is performed on the image features in the fusion feature through the image reconstruction sub-model of the video feature extraction model to obtain an image training result of the original image size.

其中，图像重建子模型提供有对视频进行图像重建处理的功能，以输出该视频的图像重建特征，在一些实施例中，图像重建子模型提供有对视频中的图像进行图像复原的功能，以输出原始图像大小的图像重建特征。图像复原是对已退化的图像(或称作已降质的图像)进行重建的处理，以将已退化的图像复原为原始图像。原始图像大小为视频中图像的大小，在一些实施例中，原始图像大小基于图像中的水平像素、垂直像素以及图像中的颜色信息来确定。如此，通过将样本视频的图像特征复原为原始图像大小的图像重建特征，以获得高质量的图像重建特征，进而后续利用高质量的图像重建特征来进行模型训练的过程。图像训练结果为模型训练过程中所得到的图像重建特征。在一些实施例中，图像重建子模型包括多个MLP网络(多层神经网络)或其他具备图像重建功能的网络层，本公开实施例对此不作限定。The image reconstruction sub-model is provided with the function of performing image reconstruction processing on the video to output the image reconstruction features of the video. In some embodiments, the image reconstruction sub-model is provided with the function of performing image restoration on the images in the video, so as to Output image reconstruction features at the original image size. Image restoration is the process of reconstructing a degraded image (or called a degraded image) to restore the degraded image to the original image. The original image size is the size of the image in the video, and in some embodiments, the original image size is determined based on horizontal pixels, vertical pixels in the image, and color information in the image. In this way, by restoring the image features of the sample video to the image reconstruction features of the original image size, high-quality image reconstruction features are obtained, and then the high-quality image reconstruction features are subsequently used for the process of model training. The image training result is the image reconstruction feature obtained during the model training process. In some embodiments, the image reconstruction sub-model includes multiple MLP networks (multi-layer neural networks) or other network layers with image reconstruction functions, which are not limited in this embodiment of the present disclosure.

在一些实施例中，从特征融合子模型所输出的融合特征中，提取前预设数量的图像特征，将所提取的图像特征输入该视频特征提取模型的图像重建子模型，进而通过该视频特征提取模型的图像重建子模型，对所提取的图像特征进行处理，输出该样本视频的图像重建特征，也即得到了该图像训练结果。其中，预设数量是指图像特征的数量，需要说明的是，由于特征融合子模型的特征融合层包括多层自注意力层，而自注意力层在输入一定数量特征的前提下，能够输出相同数量的特征，如此，按照预设数量来提取图像特征，能够提取出充足的图像特征，且，由于图像特征和文本特征经��了特征融合子模型，使得所输出的图像特征已经是融合了文本特征的特征，也即提升了图像重建的准确性。In some embodiments, a preset number of image features are extracted from the fusion features output by the feature fusion sub-model, and the extracted image features are input into the image reconstruction sub-model of the video feature extraction model, and then the video features The image reconstruction sub-model of the model is extracted, the extracted image features are processed, and the image reconstruction features of the sample video are output, that is, the image training result is obtained. Among them, the preset number refers to the number of image features. It should be noted that since the feature fusion layer of the feature fusion sub-model includes multiple layers of self-attention layers, and the self-attention layer can output a certain number of features under the premise of inputting a certain number of features. The same number of features, in this way, extracting image features according to the preset number can extract sufficient image features, and since the image features and text features have undergone the feature fusion sub-model, the output image features are already fused with text. The characteristics of the feature, that is, the accuracy of image reconstruction is improved.

在步骤407中，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果。In step 407, the fusion feature is processed by the text generation sub-model of the video feature extraction model to obtain a text training result.

其中，文本生成子模型提供有对视频进行文本生成处理的功能，以输出该视频的内容描述文本。文本训练结果为模型训练过程中所生成的内容描述文本。Among them, the text generation sub-model provides the function of performing text generation processing on the video, so as to output the content description text of the video. The text training result is the content description text generated during the model training process.

在一些实施例中，文本生成子模型包括多层自注意力层，相应地，通过该视频特征提取模型，将该融合特征输入视频特征提取模型的文本生成子模型，通过该文本生成子模型所包括的自注意力层，对该融合特征进行处理，得到该文本训练结果。在一些实施例中，通过该文本生成子模型所包括的多层自注意力层，对该融合特征进行处理，输出自注意力分值达到文本生成条件的多个字符，基于所输出的多个字符生成该样本视频的内容描述文本，也即得到了该文本训练结果。示例地，文本生成条件可以是自注意力分值达到分值阈值。In some embodiments, the text generation sub-model includes multiple layers of self-attention layers. Accordingly, through the video feature extraction model, the fusion feature is input into the text generation sub-model of the video feature extraction model, and the text generation sub-model through the text generation sub-model The included self-attention layer processes the fusion feature to obtain the text training result. In some embodiments, the fusion feature is processed through the multi-layer self-attention layer included in the text generation sub-model, and a plurality of characters whose self-attention scores meet the text generation conditions are output, based on the output of the plurality of characters Characters generate the content description text of the sample video, that is, the text training result is obtained. For example, the text generation condition may be that the self-attention score reaches a score threshold.

基于步骤401所示出的多种类型的内容描述文本，在一种可选的实施例中，该文本生成子模型提供有执行多个类型的文本生成任务的功能，对于任一类型，通过该文本生成子模型所包括的多层自注意力层，分别按照该多个类型对应的文本生成任务的处理机制，对该融合特征进行处理，以生成样本视频的多个类型的内容描述文本。进一步地，针对任一类型，通过该文本生成子模型所包括的多层自注意力层，对融合特征中的第一段特征序列进行处理，输出第一段特征序列中自注意力分值达到文本生成条件的字符；基于第一段特征序列中输出的特征，继续对融合特征中的第二段特征序列进行处理，输出第二段特征序列中自注意力分值达到文本生成条件的字符；基于第一段特征序列中输出的特征与第二段特征序列中输出的特征，继续对融合特征中的第三段特征序列进行处理，输出第三段特征序列中自注意力分值达到文本生成条件的字符，进而，基于已输出的字符，重复执行上述处理过程和输出过程，输出下一段特征序列中自注意力分值达到文本生成条件的字符，得到该文本生成子模型所输出的多个字符，按照该多个字符的输出时序进行拼接，得到该样本视频的内容描述文本，也即得到了上述文本训练结果。Based on the multiple types of content description texts shown in step 401, in an optional embodiment, the text generation sub-model is provided with the function of performing multiple types of text generation tasks. The multi-layer self-attention layers included in the text generation sub-model respectively process the fusion features according to the processing mechanisms of the text generation tasks corresponding to the multiple types to generate multiple types of content description texts of the sample video. Further, for any type, through the multi-layer self-attention layer included in the text generation sub-model, the first segment of the feature sequence in the fusion feature is processed, and the self-attention score in the first segment of the feature sequence is output. Characters for text generation conditions; based on the features output in the first feature sequence, continue to process the second feature sequence in the fusion feature, and output the characters whose self-attention score in the second feature sequence reaches the text generation conditions; Based on the features output in the first feature sequence and the features output in the second feature sequence, continue to process the third feature sequence in the fusion feature, and output the self-attention score in the third feature sequence to achieve text generation Conditional characters, and then, based on the output characters, repeat the above processing and output processes, output the characters whose self-attention score reaches the text generation conditions in the next feature sequence, and obtain a plurality of text generation sub-model output. character, splicing according to the output sequence of the multiple characters to obtain the content description text of the sample video, that is, the above text training result is obtained.

在上述实施例中，通过在文本生成子模型中设置自注意力层，进而利用自注意力机制来生成内容描述文本，提高了文本生成的准确性。In the above embodiment, by setting a self-attention layer in the text generation sub-model, and then using the self-attention mechanism to generate content description text, the accuracy of text generation is improved.

在一些实施例中，文本生成子模型还提供有mask机制，该mask机制是指将一段特征序列中不需要关注的特征遮挡起来，以避免对视频特征提取模型的影响，提高了自注意力机制的准确性。In some embodiments, the text generation sub-model is further provided with a masking mechanism, which means to block the features that do not need attention in a feature sequence, so as to avoid the influence on the video feature extraction model and improve the self-attention mechanism. accuracy.

在一些实施例中，在文本生成子模型用于执行多个类型的文本生成任务的情况下，在将该融合特征输入文本生成子模型之前，该方法还包括：在该融合特征上，添加各个类型的类型标识，基于添加该类型标识后的融合特征，执行上述步骤407。其中，类型标识用于指示对应类型的文本生成任务。在一些实施例中，通过视频特征提取模型，将添加该类型标识后的融合特征输入文本生成子模型，通过该文本生成子模型，分别基于各个类型标识对应的处理机制，对该融合特征进行处理，得到该多个类型的内容描述文本。在该实施例中，通过在融合特征上添加各个类型的类型标识，以便视频特征提取模型中的文本生成子模型能够基于各个类型的类型标识，来触发生成样本视频在各个类型上的内容描述文本，确保文本生成的顺利进行。In some embodiments, when the text generation sub-model is used to perform multiple types of text generation tasks, before inputting the fused feature into the text generation sub-model, the method further includes: on the fused feature, adding various The type identification of the type is performed based on the fusion feature after adding the type identification, and the above step 407 is performed. The type identifier is used to indicate the text generation task of the corresponding type. In some embodiments, through the video feature extraction model, the fusion feature after adding the type identifier is input into the text generation sub-model, and the fusion feature is processed based on the processing mechanism corresponding to each type identifier through the text generation sub-model. to obtain the content description texts of the multiple types. In this embodiment, by adding each type of type identifier to the fusion feature, the text generation sub-model in the video feature extraction model can trigger the generation of the content description text of each type of sample video based on each type of type identifier. , to ensure smooth text generation.

需要说明的是，上述步骤406至步骤407以先输出图像训练结果再输出文本训练结果为例，在另一些实施例中，计算机设备还能够先输出文本训练结果再输出图像训练结果，或者，计算机设备还能够同时输出图像训练结果和文本训练结果，本公开实施例对步骤406和步骤407的执行次序不作限定。It should be noted that the above steps 406 to 407 take the example of outputting the image training results first and then outputting the text training results. In other embodiments, the computer device can also output the text training results first and then output the image training results, or the computer The device can also output the image training result and the text training result at the same time, and the embodiment of the present disclosure does not limit the execution order of step 406 and step 407 .

在步骤408中，计算机设备基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练。In step 408, the computer device adjusts the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text based on the image training result, the text training result and the image label and text label of the sample video Generate the model parameters of the sub-model to train the video feature extraction model.

针对上述步骤402至步骤408，在一些实施例中，计算机设备在模型训练的第一次迭代过程中，将该样本视频的图像信息与文本信息输入初始的视频特征提取模型，触发该视频特征提取模型执行上述步骤402至步骤407的处理过程，得到第一次迭代过程的图像训练结果和文本训练结果，基于该第一次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，对该初始的视频特征提取模型中该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数进行调整；在调整后的视频特征提取模型不满足目标条件的情况下，基于调整后的该模型参数进行下一次迭代过程，进而，在模型训练的第i次迭代过程中，将该样本视频的图像信息与文本信息输入第i-1次迭代过程所确定的视频特征提取模型，触发该视频特征提取模型执行上述步骤402至步骤407的处理过程，进而得到第i次迭代过程的图像训练结果和文本训练结果，基于该第i次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，调整该第i-1次迭代过程所确定的该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，在调整后的视频特征提取模型不满足目标条件的情况下，基于调整后的该模型参数进行第i+1次迭代过程，重复上述训练的迭代过程，直至训练满足目标条件，该i为大于1的正整数。For the above steps 402 to 408, in some embodiments, the computer device inputs the image information and text information of the sample video into the initial video feature extraction model during the first iteration of model training, and triggers the video feature extraction model. The model executes the processing process of the above-mentioned steps 402 to 407, and obtains the image training result and the text training result of the first iterative process, based on the image training result of the first iterative process, the text training result and the image label of the sample video, Text label, adjust the model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model in the initial video feature extraction model; in the adjusted video feature extraction model If the target condition is not met, the next iteration process is performed based on the adjusted model parameters, and then, in the i-th iteration process of model training, the image information and text information of the sample video are input for the i-1th time. The video feature extraction model determined by the iterative process triggers the video feature extraction model to execute the processing process of the above-mentioned steps 402 to 407, and then obtains the image training result and the text training result of the i-th iteration process, based on the i-th iteration process The image training result, text training result and the image label and text label of the sample video, adjust the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and The model parameters of the text generation sub-model, in the case that the adjusted video feature extraction model does not meet the target conditions, the i+1 th iteration process is performed based on the adjusted model parameters, and the above-mentioned iterative process of training is repeated until the training To satisfy the target condition, the i is a positive integer greater than 1.

在一些实施例中，训练满足的目标条件为视频特征提取模型的训练迭代次数达到目标次数，该目标次数为预先设定的训练迭代次数，如1000次；或者，训练满足的目标条件为模型损失值满足目标阈值条件，如损失值小于0.00001。本公开实施例对目标条件的设置不加以限定。In some embodiments, the target condition satisfied by the training is that the number of training iterations of the video feature extraction model reaches the target number of times, and the target number of times is a preset number of training iterations, such as 1000 times; or, the target condition satisfied by the training is the model loss The value satisfies the target threshold condition, such as the loss value is less than 0.00001. The embodiment of the present disclosure does not limit the setting of the target condition.

针对上述对视频特征提取模型的模型参数进行调整的过程，在一些实施例中，在模型训练的第i次迭代过程中，基于该第i次迭代过程的图像训练结果、文本训练结果以及该样本视频的图像标签、文本标签，确定该第i次迭代过程的模型损失值；基于该第i次迭代过程的模型损失值，对第i-1次迭代过程所确定的视频特征提取模型的模型参数进行调整。下面基于步骤408A至步骤408C，对计算机设备确定该第i次迭代过程的模型损失值的过程进行说明：For the above process of adjusting the model parameters of the video feature extraction model, in some embodiments, in the ith iteration process of model training, based on the image training results, text training results and the sample of the ith iteration process The image label and text label of the video are used to determine the model loss value of the i-th iteration process; based on the model loss value of the i-th iteration process, the model parameters of the video feature determined in the i-1th iteration process are extracted. make adjustments. Based on steps 408A to 408C, the process of determining the model loss value of the i-th iteration process by the computer device will be described below:

在步骤408A中，计算机设备基于该第i次迭代过程的图像训练结果与该样本视频的图像标签，确定该第i次迭代过程的图像重建损失值，该图像重建损失值表示该图像训练结果与该图像标签之间的差异。In step 408A, the computer device determines the image reconstruction loss value of the i-th iteration process based on the image training result of the i-th iteration process and the image label of the sample video, and the image reconstruction loss value represents the image training result and the image reconstruction loss value. The difference between the image tags.

在一些实施例中，计算机设备基于该第i次迭代过程的图像训练结果与该样本视频的图像标签，确定该第i次迭代过程的MSELoss(Mean Square Eerror Loss，均方误差损失)值，将所确定的MSELoss值作为图像重建损失值。In some embodiments, the computer device determines the MSELoss (Mean Square Eerror Loss, mean square error loss) value of the ith iterative process based on the image training result of the ith iterative process and the image label of the sample video, and sets the The determined MSELoss value is used as the image reconstruction loss value.

在步骤408B中，计算机设备基于该第i次迭代过程的文本训练结果与该样本视频的文本标签，确定该第i次迭代过程的文本生成损失值，该文本生成损失值表示该文本训练结果与该文本标签之间的差异。In step 408B, the computer device determines the text generation loss value of the ith iteration process based on the text training result of the ith iteration process and the text label of the sample video, and the text generation loss value indicates that the text training result is different from The difference between the text labels.

在一些实施例中，在该文本训练结果包括多个类型的内容描述文本的情况下，对于任一类型，基于该第i次迭代过程在该类型上的文本训练结果与该样本视频在该类型上的文本标签，确定该第i次迭代过程在该类型上的损失值；基于该第i次迭代过程在该多个类型上的损失值以及该视频特征提取网络在该多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值。In some embodiments, when the text training result includes multiple types of content description texts, for any type, the text training result on the type based on the i-th iteration process and the sample video in the type The text label on , determine the loss value of the i-th iteration process on the type; based on the loss value of the i-th iteration process on the multiple types and the weight of the video feature extraction network on the multiple types coefficients, and perform weighted summation to obtain the text generation loss value of the i-th iteration process.

在一些实施例中，对于任一类型，计算机设备基于该第i次迭代过程在该类型上的文本训练结果与该样本视频在该类型上的文本标签，确定该第i次迭代过程在该类型上的CEloss(Cross Entropy Loss，交叉熵损失)值，基于该第i次迭代过程在该多个类型上的CEloss值以及该视频特征提取网络在该多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值。In some embodiments, for any type, the computer device determines, based on the text training result of the i-th iteration process on the type and the text label of the sample video on the type On the CEloss (Cross Entropy Loss, cross entropy loss) value, based on the CEloss value of the i-th iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, perform weighted summation , the text generation loss value of the i-th iteration process is obtained.

针对上述CEloss值，在一些实施例中，计算机设备确定CEloss值的过程包括：对于任一类型，计算机设备基于该第i次迭代过程在该类型上的文本训练结果、该样本视频在该类型上的文本标签、样本视频的数量以及CEloss公式(1)，确定该第i次迭代过程在该类型上的交叉熵损失值。Regarding the above CEloss value, in some embodiments, the process of determining the CEloss value by the computer device includes: for any type, the computer device is based on the text training result of the i-th iteration process on the type, the sample video on the type The text labels of , the number of sample videos, and the CEloss formula (1), determine the cross-entropy loss value of the i-th iteration process on this type.

式中，CEloss表示交叉熵损失值；m表示训练数据集中样本视频的数量；y_k表示该视频特征提取模型针对第k个样本视频的文本训练结果；p表示该视频特征提取模型针对第k个样本视频的文本训练结果的正确概率，例如，正确概率可以是文本训练结果与文本标签之间的相似度。In the formula, CEloss represents the cross entropy loss value; m represents the number of sample videos in the training data set; y _k represents the text training result of the video feature extraction model for the kth sample video; p represents the video feature extraction model for the kth sample video. The correct probability of the text training result of the sample video, for example, the correct probability can be the similarity between the text training result and the text label.

针对上述多个类型对应的权重系数，在一些实施例中，计算机设备确定权重系数的过程包括：对于任一类型，基于该第i次迭代过程在该类型上的正确文本数量以及总文本数量，确定该第i次迭代过程在该类型上的正确比例，该正确比例表示在该第i次迭代过程中正确文本数量占总文本数量的比例，基于该第i次迭代过程在该类型上的正确比例，确定该视频特征提取网络在该类型上的权重系数，其中，该正确比例与该权重系数成负相关。Regarding the weight coefficients corresponding to the above multiple types, in some embodiments, the process of determining the weight coefficient by the computer device includes: for any type, based on the correct number of texts on the type and the total number of texts in the i-th iteration process, Determine the correct proportion of the ith iteration process on this type, the correct proportion represents the proportion of the correct text quantity to the total text quantity in the ith iteration process, based on the correctness of the ith iteration process on the type Scale, to determine the weight coefficient of the video feature extraction network on this type, wherein the correct scale is negatively correlated with the weight coefficient.

其中，正确文本数量是指模型训练过程中所生成的正确文本训练结果的数量，例如，正确概率达到概率阈值的文本训练结果的数量。总文本数量是指模型训练过程中所生成的文本训练结果的总数量。The number of correct texts refers to the number of correct text training results generated during the model training process, for example, the number of text training results whose correct probability reaches a probability threshold. The total number of texts refers to the total number of textual training results generated during model training.

在一些实施例中，计算机设备基于该第i次迭代过程在该类型上的正确文本数量、该第i次迭代过程在该类型上的总文本数量以及下述权重系数公式(2)，确定该视频特征提取网络在该类型上的权重系数。In some embodiments, the computer device determines the correct amount of text on the genre in the ith iterative process, the total text on the genre in the ith iterative process, and the following weight coefficient formula (2). The weight coefficient of the video feature extraction network on this type.

W＝1-(correct/total) (2)W=1-(correct/total) (2)

式中，W表示该视频特征提取网络在该类型上的权重系数；correct表示在该类型上的正确文本数量；total表示在该类型上的总文本数量。In the formula, W represents the weight coefficient of the video feature extraction network on this type; correct represents the correct text quantity on this type; total represents the total text quantity on this type.

在上述实施例中，针对文本生成所涉及的各个类型，分别按照各个类型所对应的正确比例，来确定各个类型对应的权重系数，由于正确比例表示正确文本数量占总文本数量的比例，且由于正确比例与权重系数之间成负相关，因此，在计算文本生成损失值的情况下，为正确比例大的类型设置较小的权重系数，为正确比例小的类型设置较大的权重系数，提高了确定权重系数的准确性，也就提高了确定文本生成损失值的准确性。In the above embodiment, for each type involved in text generation, the weight coefficient corresponding to each type is determined according to the correct ratio corresponding to each type, since the correct ratio represents the ratio of the correct text quantity to the total text quantity, and since the correct ratio represents the ratio of the correct text quantity to the total text quantity There is a negative correlation between the correct scale and the weight coefficient. Therefore, in the case of calculating the text generation loss value, set a smaller weight coefficient for the type with a large correct scale, and set a larger weight coefficient for the type with a small correct scale. In order to determine the accuracy of the weight coefficient, the accuracy of determining the text generation loss value is also improved.

针对该第i次迭代过程，在一些实施例中，基于上述CEloss公式(1)，计算得到该多个类型对应的交叉熵损失值，基于上述权重系数公式(2)，计算得到该多个类型对应的权重系数之后，基于该多个类型对应的交叉熵损失值、该多个类型对应的权重系数以及下述损失值公式(3)，进行加权求和，得到该第i次迭代过程的文本生成损失值。For the i-th iteration process, in some embodiments, based on the above CEloss formula (1), the cross entropy loss values corresponding to the multiple types are calculated, and based on the above weight coefficient formula (2), the multiple types are calculated and obtained After the corresponding weight coefficients, based on the cross-entropy loss values corresponding to the multiple types, the weight coefficients corresponding to the multiple types, and the following loss value formula (3), a weighted sum is performed to obtain the text of the i-th iteration process. Generate loss values.

式中，loss_文本生成表示文本生成损失值；n表示多个类型的数量；W_s表示类型s所对应的权重系数；CEloss_s表示类型s所对应的交叉熵损失值。In the formula, loss _{text generation} represents the loss value of text generation; n represents the number of multiple types; W _s represents the weight coefficient corresponding to type s; CEloss _s represents the cross entropy loss value corresponding to type s.

在上述实施例中，针对文本生成所涉及的各个类型，分别设置有各个类型对应的权重系数，进而利用各个类型上的交叉熵损失值以及各个类型对应的权重系数，来确定文本生成损失值，提高了确定文本生成损失值的准确性。In the above embodiment, for each type involved in text generation, a weight coefficient corresponding to each type is respectively set, and then the cross-entropy loss value on each type and the weight coefficient corresponding to each type are used to determine the text generation loss value, Improved the accuracy of determining text generation loss values.

在步骤408C中，计算机设备基于该图像重建损失值与该文本生成损失值，确定该第i次迭代过程的模型损失值。In step 408C, the computer device determines the model loss value of the ith iteration process based on the image reconstruction loss value and the text generation loss value.

在一些实施例中，基于该图像重建损失值、该图像重建损失值对应的权重系数、该文本生成损失值以及该文本生成损失值对应的权重系数，进行加权求和，得到该第i次迭代过程的模型损失值。In some embodiments, a weighted sum is performed based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value, and the weight coefficient corresponding to the text generation loss value to obtain the ith iteration The model loss value for the process.

在一些实施例中，基于该图像重建损失值、该图像重建损失值对应的权重系数、该文本生成损失值、该文本生成损失值对应的权重系数以及下述损失值公式(4)，进行加权求和，得到该第i次迭代过程的模型损失值。In some embodiments, weighting is performed based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value, the weight coefficient corresponding to the text generation loss value, and the following loss value formula (4). Sum up to get the model loss value of the ith iteration process.

Totalloss＝W_图像重建*loss_图像重建+W_文本生成*loss_文本生成 (4)Totalloss=W _{image reconstruction} *loss _{image reconstruction} +W _{text generation} *loss _{text generation} (4)

式中，Totalloss表示模型损失值；W_图像重建表示图像重建损失值对应的权重系数；loss_图像重建表示图像重建损失值；W_文本生成表示文本生成损失值对应的权重系数；loss_文本生成表示文本生成损失值。In the formula, Totalloss represents the model loss value; W _{image reconstruction} represents the weight coefficient corresponding to the image reconstruction loss value; loss _{image reconstruction} represents the image reconstruction loss value; W _{text generation} represents the weight coefficient corresponding to the text generation loss value; loss _{text generation} represents the text generation loss value.

在上述实施例中，在模型训练的任一次迭代过程中，均利用本次迭代过程的模型损失值来对视频特征提取模型中的各个子模型进行模型参数的调整，以提升视频特征提取模型的文本生成能力，从而训练出文本生成能力较高的视频特征提取模型。In the above embodiment, in any iterative process of model training, the model loss value of this iterative process is used to adjust the model parameters of each sub-model in the video feature extraction model, so as to improve the performance of the video feature extraction model. Text generation ability, so as to train a video feature extraction model with high text generation ability.

在上述实施例中，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到样本视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该样本视频的文本特征，还能够对样本视频的图像特征和文本特征进行特征融合，以便后续基于融合特征，一方面，能够对样本视频进行图像重建，以获得高质量的图像特征，另一方面，还能够生成该样本视频的内容描述文本，提供了一种结合图像重建任务和文本生成任务的视频特征提取模型的训练方法，提升了视频特征提取模型的训练效果。In the above embodiment, by constructing the image feature extraction sub-model in the video feature extraction model, the image features of the sample video can be accurately extracted, and by constructing the feature fusion sub-model in the video feature extraction model, not only the sample can be obtained The text features of the video can also feature fusion of the image features and text features of the sample video, so that based on the fusion features, on the one hand, the sample video can be reconstructed to obtain high-quality image features. The content description text of the sample video can be generated, a training method of a video feature extraction model combining an image reconstruction task and a text generation task is provided, and the training effect of the video feature extraction model is improved.

在上述图4所示出的方案中，提供了一种视频特征提取模型的训练方法，在一些实施例中，基于上述训练方法所训练出的视频特征提取模型，能够实现一种基于视频特征提取模型的文本生成方法，图6是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成方法的流程图，参见图6，该方法包括：In the solution shown in FIG. 4 above, a training method for a video feature extraction model is provided. In some embodiments, a video feature extraction model based on the video feature extraction model trained by the above training method can be implemented. Model text generation method, FIG. 6 is a flow chart of a text generation method based on a video feature extraction model shown according to an exemplary embodiment, referring to FIG. 6, the method includes:

在步骤601中，计算机设备获取目标视频的图像信息与文本信息。In step 601, the computer device acquires image information and text information of the target video.

本公开实施例中，采用目标视频来指代待进行文本生成的视频。在一些实施例中，以计算机设备提供为终端为例，该目标视频为终端本地所存储的视频，或终端所下载的视频等；在另一些实施例中，以计算机设备提供为服务器为例，该目标视频为服务器所关联的视频数据库中的视频，或终端所上传的视频等。本公开实施例对目标视频的来源不作限定。In the embodiment of the present disclosure, the target video is used to refer to the video to be subjected to text generation. In some embodiments, taking a computer device provided as a terminal as an example, the target video is a video stored locally by the terminal, or a video downloaded by the terminal, etc.; in other embodiments, taking a computer device provided as a server as an example, The target video is a video in a video database associated with the server, or a video uploaded by the terminal. This embodiment of the present disclosure does not limit the source of the target video.

在一些实施例中，计算机设备获取目标视频的图像信息的过程包括下述至少一项：获取该目标视频的封面图像；或，获取该目标视频内的至少一帧图像。需要说明的是，关于获取该目标视频的图像信息的过程参见步骤401中获取样本视频的图像信息的过程，不再赘述。In some embodiments, the process of acquiring the image information of the target video by the computer device includes at least one of the following: acquiring a cover image of the target video; or, acquiring at least one frame of image in the target video. It should be noted that, for the process of acquiring the image information of the target video, refer to the process of acquiring the image information of the sample video in step 401, which will not be repeated.

在一些实施例中，计算机设备获取目标视频的文本信息的过程包括下述至少一项：获取该目标视频的描述信息；获取该目标视频的标题信息；获取该目标视频的字幕信息；获取该目标视频的文字识别结果；获取该目标视频的音频识别结果。需要说明的是，关于获取该目标视频的文本信息的过程参见步骤401中获取样本视频的文本信息的过程，不再赘述。In some embodiments, the process of obtaining the text information of the target video by the computer device includes at least one of the following: obtaining description information of the target video; obtaining title information of the target video; obtaining subtitle information of the target video; obtaining the target video The text recognition result of the video; obtain the audio recognition result of the target video. It should be noted that, for the process of acquiring the text information of the target video, refer to the process of acquiring the text information of the sample video in step 401, and details are not repeated here.

在步骤602中，计算机设备将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该目标视频的图像特征。In step 602, the computer device inputs the image information and the text information into the video feature extraction model, and performs feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video.

需要说明的是，关于获取该目标视频的图像特征的过程参见步骤403中获取样本视频的图像特征的过程，不再赘述。It should be noted that, for the process of acquiring the image feature of the target video, refer to the process of acquiring the image feature of the sample video in step 403, which will not be repeated.

在步骤603中，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该目标视频的文本特征。In step 603, the text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain text features of the target video.

需要说明的是，关于获取该目标视频的文本特征的过程参见步骤404中获取样本视频的文本特征的过程，不再赘述。It should be noted that, for the process of acquiring the text feature of the target video, refer to the process of acquiring the text feature of the sample video in step 404, and details are not repeated here.

在步骤604中，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该目标视频的融合特征。In step 604, feature fusion is performed on the image feature and the text feature through the feature fusion layer of the feature fusion sub-model to obtain the fusion feature of the target video.

需要说明的是，关于获取该目标视频的融合特征的过程参见步骤405中获取样本视频的融合特征的过程，不再赘述。It should be noted that, for the process of acquiring the fusion feature of the target video, refer to the process of acquiring the fusion feature of the sample video in step 405, and details are not repeated here.

在步骤605中，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，输出满足该文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本。In step 605, the fusion feature is processed by the text generation sub-model of the video feature extraction model, a plurality of characters satisfying the text generation condition are output, and a content description text of the target video is generated based on the plurality of characters.

需要说明的是，关于获取该目标视频的内容描述文本的过程参见步骤407中获取样本视频的文本训练结果的过程，不再赘述。It should be noted that, for the process of acquiring the content description text of the target video, refer to the process of acquiring the text training result of the sample video in step 407, and details are not repeated here.

在上述实施例中，计算机设备将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型对该图像信息与该文本信息进行处理，输出满足文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本，提供了一种基于图像信息和文本信息来进行文本生成的模型，参考了图像模态和文本模型的多模态信息，增加了视频特征提取模型所参考的信息量，一方面，丰富了视频特征提取模型所生成的内容描述文本，另一方面，提升了视频特征提取模型进行文本生成的准确性。其中，模态是指信息的表示方式或表示形式，应理解地，每一种信息的媒介或者形式均可以称作一种模态，例如，信息的媒介，如图像、文本、音频等等。在另一些实施例中，计算机设备还能够利用其他模态的信息来进行上述文本生成的过程，例如，目标视频的发布信息等等。In the above embodiment, the computer device inputs the image information and the text information into the video feature extraction model, processes the image information and the text information through the video feature extraction model, and outputs a plurality of characters that meet the text generation conditions, The content description text of the target video is generated based on the plurality of characters, and a model for text generation based on image information and text information is provided. With reference to the multi-modal information of the image modality and the text model, video feature extraction is added. The amount of information referenced by the model, on the one hand, enriches the content description text generated by the video feature extraction model, and on the other hand, improves the accuracy of text generation by the video feature extraction model. The modality refers to the representation or representation of information, and it should be understood that each medium or form of information can be referred to as a modality, for example, a medium of information such as images, texts, audios, and so on. In other embodiments, the computer device can also use information of other modalities to perform the above text generation process, for example, the release information of the target video and so on.

在一些实施例中，该方法还包括：通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到该目标视频的原始图像大小的图像重建特征。需要说明的是，关于获取该目标视频的图像重建特征的过程参见步骤406中获取样本视频的图像训练结果的过程，不再赘述。In some embodiments, the method further includes: performing image restoration on the image features in the fusion feature by using the image reconstruction sub-model of the video feature extraction model to obtain the image reconstruction feature of the original image size of the target video. It should be noted that, for the process of acquiring the image reconstruction feature of the target video, refer to the process of acquiring the image training result of the sample video in step 406, and details are not repeated here.

本公开实施例提供的技术方案，通过在视频特征提取模型中构建图像特征提取子模型，能够精确地提取到目标视频的图像特征，通过在视频特征提取模型中构建特征融合子模型，不仅能够获取到该目标视频的文本特征，还能够对目标视频的图像特征和文本特征进行特征融合，以便后续基于融合特征，一方面，能够对目标视频进行图像重建，以获得高质量的图像特征，另一方面，还能够生成该目标视频的内容描述文本，提供了一种结合图像重建任务和文本生成任务的视频特征提取模型，能够更好的表征该目标视频，提升了视频表征的准确性。According to the technical solutions provided by the embodiments of the present disclosure, by constructing an image feature extraction sub-model in the video feature extraction model, the image features of the target video can be accurately extracted, and by constructing a feature fusion sub-model in the video feature extraction model, not only can To the text features of the target video, it can also perform feature fusion on the image features and text features of the target video, so as to follow the fusion features, on the one hand, the target video can be reconstructed to obtain high-quality image features, on the other hand In one aspect, the content description text of the target video can also be generated, and a video feature extraction model combining image reconstruction task and text generation task is provided, which can better characterize the target video and improve the accuracy of video representation.

图7是根据一示例性实施例示出的一种视频特征提取模型的训练装置的框图。参见图7，该装置包括获取单元701，输入单元702、处理单元703和训练单元704。Fig. 7 is a block diagram of an apparatus for training a video feature extraction model according to an exemplary embodiment. Referring to FIG. 7 , the apparatus includes an acquisition unit 701 , an input unit 702 , a processing unit 703 and a training unit 704 .

获取单元701，被配置为执行获取样本视频的图像信息、文本信息、图像标签以及文本标签，该图像标签表示图像重建特征，该文本标签表示该样本视频的内容描述文本；an acquisition unit 701, configured to perform acquisition of image information, text information, image tags and text tags of the sample video, where the image tag represents an image reconstruction feature, and the text tag represents the content description text of the sample video;

输入单元702，被配置为执行将该图像信息与该文本信息输入视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该样本视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该样本视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该样本视频的融合特征；The input unit 702 is configured to input the image information and the text information into the video feature extraction model, and perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image feature of the sample video, The text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the sample video, and the image features and the text features are feature-fused through the feature fusion layer of the feature fusion sub-model. , to obtain the fusion feature of the sample video;

处理单元703，被配置为执行通过该视频特征提取模型的图像重建子模型对该融合特征中的图像特征进行图像复原，得到原始图像大小的图像训练结果，通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，得到文本训练结果；The processing unit 703 is configured to perform image restoration through the image reconstruction sub-model of the video feature extraction model to the image features in the fusion feature to obtain an image training result of the original image size, and the text generation sub-model of the video feature extraction model is used. The model processes the fusion feature to obtain the text training result;

训练单元704，被配置为执行基于该图像训练结果、该文本训练结果以及该样本视频的图像标签、文本标签，调整该图像特征提取子模型、该特征融合子模型、该图像重建子模型以及该文本生成子模型的模型参数，以对该视频特征提取模型进行训练。The training unit 704 is configured to perform image labeling and text labeling based on the image training result, the text training result and the sample video, and adjust the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model, and the image reconstruction sub-model. Model parameters for the text generation sub-model to train this video feature extraction model.

在一些实施例中，该获取单元701，被配置为执行下述至少一项：In some embodiments, the obtaining unit 701 is configured to perform at least one of the following:

在一些实施例中，该输入单元702包括处理子单元，被配置为执行下述任一项：In some embodiments, the input unit 702 includes a processing subunit configured to perform any of the following:

在一些实施例中，该处理单元703包括文本生成子单元，被配置为执行：In some embodiments, the processing unit 703 includes a text generation sub-unit configured to perform:

该装置还包括添加单元，被配置为执行在该融合特征上，添加各个类型的类型标识；The device also includes an adding unit configured to perform on the fusion feature, adding type identifiers of various types;

该处理单元703包括文本生成子单元，还被配置为执行将添加该类型标识后的融合特征输入该文本生成子模型，通过该文本生成子模型，分别基于各个类型标识对应的处理机制，对该融合特征进行处理，得到该多个类型的内容描述文本。The processing unit 703 includes a text generation sub-unit, and is further configured to input the fusion feature after adding the type identification into the text generation sub-model, and through the text generation sub-model, based on the corresponding processing mechanisms of each type identification, the The fusion features are processed to obtain the multiple types of content description texts.

在一些实施例中，该训练单元704，包括：In some embodiments, the training unit 704 includes:

对于任一类型，基于该第i次迭代过程在该类型上的文本训练结果与该样本视频在该类型上的描述文本标签，确定该第i次迭代过程在该类型上的损失值；For any type, based on the text training result of the i-th iteration process on the type and the description text label of the sample video on the type, determine the loss value of the i-th iteration process on the type;

基于该第i次迭代过程在该多个类型上的交叉��损失值以及该视频特征提取网络在该多个类型上的权重系数，进行加权求和，得到该第i次迭代过程的文本生成损失值。Based on the cross-entropy loss value of the i-th iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, weighted summation is performed to obtain the text generation loss of the i-th iteration process value.

图8是根据一示例性实施例示出的一种基于视频特征提取模型的文本生成装置的框图。参见图8，该装置包括获取单元801、输入单元802和处理单元803。Fig. 8 is a block diagram of a text generation apparatus based on a video feature extraction model according to an exemplary embodiment. Referring to FIG. 8 , the apparatus includes an acquisition unit 801 , an input unit 802 and a processing unit 803 .

获取单元801，被配置为执行获取目标视频的图像信息与文本信息；an acquisition unit 801, configured to perform acquisition of image information and text information of the target video;

输入单元802，被配置为执行将该图像信息与该文本信息输入该视频特征提取模型，通过该视频特征提取模型的图像特征提取子模型对该图像信息进行特征提取，得到该目标视频的图像特征，通过该视频特征提取模型的特征融合子模型的嵌入层对该文本信息进行处理，得到该目标视频的文本特征，通过该特征融合子模型的特征融合层对该图像特征与该文本特征进行特征融合，得到该目标视频的融合特征；The input unit 802 is configured to input the image information and the text information into the video feature extraction model, and perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image feature of the target video. , the text information is processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the target video, and the image features and the text features are characterized by the feature fusion layer of the feature fusion sub-model. Fusion to obtain the fusion feature of the target video;

处理单元803，被配置为执行通过该视频特征提取模型的文本生成子模型对该融合特征进行处理，输出满足文本生成条件的多个字符，基于该多个字符生成该目标视频的内容描述文本。The processing unit 803 is configured to process the fusion feature through the text generation sub-model of the video feature extraction model, output a plurality of characters satisfying the text generation condition, and generate a content description text of the target video based on the plurality of characters.

在一些实施例中，该处理单元803，还被配置为执行：In some embodiments, the processing unit 803 is further configured to execute:

需要说明的是：上述实施例提供的视频特征提取模型的训练装置在特征提取时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的视频特征提取模型的训练装置与视频特征提取模型的训练方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that: the training device for the video feature extraction model provided in the above-mentioned embodiments only uses the division of the above-mentioned functional modules as an example for the feature extraction. In practical applications, the above-mentioned functions may be allocated to different functions as required Module completion means dividing the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the apparatus for training a video feature extraction model provided in the above embodiment and the embodiment of the training method for a video feature extraction model belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.

本公开实施例所提到的计算机设备可提供为一种终端。图9示出了本公开一个示例性实施例提供的终端900的结构框图。该终端900可以是：智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III，动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV，动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端900还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。The computer device mentioned in the embodiments of the present disclosure may be provided as a terminal. FIG. 9 shows a structural block diagram of a terminal 900 provided by an exemplary embodiment of the present disclosure. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, the standard audio level 3 of Moving Picture Experts compression), MP4 (Moving Picture Experts Group Audio Layer IV, the standard audio of Moving Picture Experts Group compression) Level 4) Player, laptop or desktop computer. Terminal 900 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.

通常，终端900包括有：处理器901和存储器902。Generally, the terminal 900 includes: a processor 901 and a memory 902 .

处理器901可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central Processing Unit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器901可以集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器901还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may use at least one hardware form of DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array). accomplish. The processor 901 may also include a main processor and a co-processor. The main processor is a processor used to process data in a wake-up state, also called a CPU (Central Processing Unit, central processing unit); the co-processor is a A low-power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.

存储器902可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器902中的非暂态的计算机可读存储介质用于存储至少一个程序代��，该至少一个程序代码用于被处理器901所执行以实现本公开中方法实施例提供的视频特征提取模型的训练方法或基于视频特征提取模型的文本生成方法中终端执行的过程。Memory 902 may include one or more computer-readable storage media, which may be non-transitory. Memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 902 is used to store at least one program code, and the at least one program code is used to be executed by the processor 901 to implement the methods provided by the method embodiments of the present disclosure. The process performed by the terminal in the training method of the video feature extraction model or the text generation method based on the video feature extraction model.

在一些实施例中，终端900还可选包括有：外围设备接口903和至少一个外围设备。处理器901、存储器902和外围设备接口903之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口903相连。具体地，外围设备包括：射频电路904、显示屏905、摄像头组件906、音频电路907、定位组件908和电源909中的至少一种。In some embodiments, the terminal 900 may optionally further include: a peripheral device interface 903 and at least one peripheral device. The processor 901, the memory 902 and the peripheral device interface 903 may be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 903 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904 , a display screen 905 , a camera assembly 906 , an audio circuit 907 , a positioning assembly 908 and a power supply 909 .

外围设备接口903可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器901和存储器902。在一些实施例中，处理器901、存储器902和外围设备接口903被集成在同一芯片或电路板上；在一些其他实施例中，处理器901、存储器902和外围设备接口903中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 903 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 901 and the memory 902 . In some embodiments, processor 901, memory 902, and peripherals interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 901, memory 902, and peripherals interface 903 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路904用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路904通过电磁信号与通信网络以及其他通信设备进行通信。射频电路904将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路904包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路904可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于：城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路904还可以包括NFC(Near Field Communication，近距离无线通信)有关的电路，本公开对此不加以限定。The radio frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 904 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 904 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 904 includes an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 904 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: metropolitan area network, mobile communication networks of various generations (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, wireless fidelity) network. In some embodiments, the radio frequency circuit 904 may further include a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.

显示屏905用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏905是触摸显示屏时，显示屏905还具有采集在显示屏905的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器901进行处理。此时，显示屏905还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏905可以为一个，设置在终端900的前面板；在另一些实施例中，显示屏905可以为至少两个，分别设置在终端900的不同表面或呈折叠设计；在另一些实施例中，显示屏905可以是柔性显示屏，设置在终端900的弯曲表面上或折叠面上。甚至，显示屏905还可以设置成非矩形的不规则图形，也即异形屏。显示屏905可以采用LCD(Liquid Crystal Display，液晶显示屏)、OLED(Organic Light-Emitting Diode，有机发光二极管)等材��制备。The display screen 905 is used to display a UI (User Interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to acquire touch signals on or above the surface of the display screen 905 . The touch signal may be input to the processor 901 as a control signal for processing. At this time, the display screen 905 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 905, which is arranged on the front panel of the terminal 900; in other embodiments, there may be at least two display screens 905, which are respectively arranged on different surfaces of the terminal 900 or in a folded design; In other embodiments, the display screen 905 may be a flexible display screen, which is disposed on a curved surface or a folding surface of the terminal 900 . Even, the display screen 905 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 905 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).

摄像头组件906用于采集图像或视频。可选地，摄像头组件906包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中，摄像头组件906还可以包括闪光灯。闪光灯可以是单色温闪光灯，也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，可以用于不同色温下的光线补偿。The camera assembly 906 is used to capture images or video. Optionally, the camera assembly 906 includes a front camera and a rear camera. Usually, the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

音频电路907可以包括��和��器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器901进行处理，或者输入至射频电路904以实现语音通信。出于立体声采集或降噪的目的，麦克风可以为多个，分别设置在终端900的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器901或射频电路904的电信号转换为声波。扬声器可以是传统的薄膜扬声器，也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅可以将电信号转换为人类可听见的声波，也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路907还可以包括耳机插孔。Audio circuitry 907 may include a microphone and speakers. The microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals, and input them to the processor 901 for processing, or to the radio frequency circuit 904 to realize voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively disposed in different parts of the terminal 900 . The microphone may also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes. In some embodiments, the audio circuit 907 may also include a headphone jack.

定位组件908用于定位终端900的当前地理位置，以实现导航或LBS(LocationBased Service，基于位置的服务)。The positioning component 908 is used to locate the current geographic location of the terminal 900 to implement navigation or LBS (Location Based Service, location-based service).

电源909用于为终端900中的各个组件进行供电。电源909可以是交流电、直流电、一次性电池或可充电电池。当电源909包括可充电电池时，该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。The power supply 909 is used to power various components in the terminal 900 . The power source 909 may be alternating current, direct current, primary batteries, or rechargeable batteries. When the power source 909 includes a rechargeable battery, the rechargeable battery can support wired charging or wireless charging. The rechargeable battery can also be used to support fast charging technology.

在一些实施例中，终端900还包括有一个或多个传感器910。该一个或多个传感器910包括但不限于：加速度传感器911、陀螺仪传感器912、压力传感器913、指纹传感器914、光学传感器915以及接近传感器916。In some embodiments, terminal 900 also includes one or more sensors 910 . The one or more sensors 910 include, but are not limited to, an acceleration sensor 911 , a gyro sensor 912 , a pressure sensor 913 , a fingerprint sensor 914 , an optical sensor 915 , and a proximity sensor 916 .

加速度传感器911可以检测以终端900建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器911可以用于检测重力加速度在三个坐标轴上的分量。处理器901可以根据加速度传感器911采集的重力加速度信号，控制显示屏905以横向视图或纵向视图进行用户界面的显示。加速度传感器911还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 911 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 900 . For example, the acceleration sensor 911 can be used to detect the components of the gravitational acceleration on the three coordinate axes. The processor 901 may control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911 . The acceleration sensor 911 can also be used for game or user movement data collection.

陀螺仪传感器912可以检测终端900的机体方向及转动角度，陀螺仪传感器912可以与加速度传感器911协同采集用户对终端900的3D动作。处理器901根据陀螺仪传感器912采集的数据，可以实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyroscope sensor 912 can detect the body direction and rotation angle of the terminal 900 , and the gyroscope sensor 912 can cooperate with the acceleration sensor 911 to collect 3D actions of the user on the terminal 900 . The processor 901 can implement the following functions according to the data collected by the gyro sensor 912 : motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

压力传感器913可以设置在终端900的侧边框和/或显示屏905的下层。当压力传感器913设置在终端900的侧边框时，可以检测用户对终端900的握持信号，由处理器901根据压力传感器913采集的握持信号进行左右手识别或快捷操作。当压力传感器913设置在显示屏905的下层时，由处理器901根据用户对显示屏905的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 913 may be disposed on the side frame of the terminal 900 and/or the lower layer of the display screen 905 . When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 can be detected, and the processor 901 can perform left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed on the lower layer of the display screen 905, the processor 901 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 905. The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

光学传感器914用于采集环境光强度。在一个实施例中，处理器901可以根据光学传感器914采集的环境光强度，控制显示屏905的显示亮度。具体地，当环境光强度较高时，调高显示屏905的显示亮度；当环境光强度较低时，调低显示屏905的显示亮度。在另一个实施例中，处理器901还可以根据光学传感器914采集的环境光强度，动态调整摄像头组件906的拍摄参数。Optical sensor 914 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 according to the ambient light intensity collected by the optical sensor 914 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is decreased. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 914 .

接近传感器915，也称距离传感器，通常设置在终端900的前面板。接近传感器915用于采集用户与终端900的正面之间的距离。在一个实施例中，当接近传感器915检测到用户与终端900的正面之间的距离逐渐变小时，由处理器901控制显示屏905从亮屏状态切换为息屏状态；当接近传感器915检测到用户与终端900的正面之间的距离逐渐变大时，由处理器901控制显示屏905从息屏状态切换为亮屏状态。A proximity sensor 915 , also called a distance sensor, is usually provided on the front panel of the terminal 900 . The proximity sensor 915 is used to collect the distance between the user and the front of the terminal 900 . In one embodiment, when the proximity sensor 915 detects that the distance between the user and the front of the terminal 900 gradually decreases, the processor 901 controls the display screen 905 to switch from the bright screen state to the off screen state; when the proximity sensor 915 detects When the distance between the user and the front of the terminal 900 gradually increases, the processor 901 controls the display screen 905 to switch from the closed screen state to the bright screen state.

本领域技术人员可以理解，图9中示出的结构并不构成对终端900的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 9 does not constitute a limitation on the terminal 900, and may include more or less components than shown, or combine some components, or adopt different component arrangements.

本公开实施例所提到的计算机设备可提供为一种服务器。图10是根据一示例性实施例示出的一种服务器的框图，该服务器1000可因配置或性能不同而产生比较大的差异，可以包括一个或多个处理器(Central Processing Units，CPU)1001和一个或多个的存储器1002，其中，该一个或多个存储器1002中存储有至少一条程序代码，该至少一条程序代码由该一个或多个处理器1001加载并执行以实现上述各个方法实施例提供的视频特征提取模型的训练方法中服务器执行的过程。当然，该服务器1000还可以具有有线或无线网络接口、键盘以及输入输出接口等部件，以便进行输入输出，该服务器1000还可以包括其他用于实现设备功能的部件，在此不做赘述。The computer device mentioned in the embodiments of the present disclosure may be provided as a kind of server. FIG. 10 is a block diagram of a server according to an exemplary embodiment. The server 1000 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1001 and One or more memories 1002, wherein, at least one piece of program code is stored in the one or more memories 1002, and the at least one piece of program code is loaded and executed by the one or more processors 1001 to realize the above-mentioned various method embodiments provided The process performed by the server in the training method of the video feature extraction model. Of course, the server 1000 may also have components such as wired or wireless network interfaces, keyboards, and input/output interfaces for input and output, and the server 1000 may also include other components for implementing device functions, which will not be repeated here.

在示例性实施例中，还提供了一种包括程序代码的计算机可读存储介质，例如包括程序代码的存储器1002，上述程序代码可由服务器1000的处理器1001执行以完成上述视频特征提取模型的训练方法。可选地，计算机可读存储介质可以是ROM(Read-Only Memory，只读内存)、RAM(Random Access Memory，随机存取存储器)、CD-ROM(Compact-Disc Read-Only Memory，只读光盘)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium including program codes is also provided, such as a memory 1002 including program codes, and the program codes can be executed by the processor 1001 of the server 1000 to complete the training of the above-mentioned video feature extraction model method. Optionally, the computer-readable storage medium may be ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact-Disc Read-Only Memory, CD-ROM) ), magnetic tapes, floppy disks, and optical data storage devices, etc.

在示例性实施例中，还提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述的视频特征提取模型的训练方法。In an exemplary embodiment, a computer program product is also provided, including a computer program that, when executed by a processor, implements the above-mentioned training method for a video feature extraction model.

在一些实施例中，本公开实施例所涉及的计算机程序可被部署在一个计算机设备上执行，或者在位于一个地点的多个计算机设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算机设备上执行，分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。In some embodiments, the computer programs involved in the embodiments of the present disclosure may be deployed and executed on one computer device, or executed on multiple computer devices located at one site, or distributed over multiple sites and through communication Executed on multiple computer devices interconnected by a network, and multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain system.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. a training method of video feature extraction model, is characterized in that, described method comprises:

Obtain image information, text information, image tags and text tags of the sample video, where the image tag represents an image reconstruction feature, and the text tag represents the content description text of the sample video;

Input the image information and the text information into the video feature extraction model, and perform feature extraction on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image features of the sample video, and obtain the image features of the sample video. The embedded layer of the feature fusion sub-model of the video feature extraction model processes the text information to obtain the text features of the sample video, and the image features and the text features are analyzed by the feature fusion layer of the feature fusion sub-model. Perform feature fusion to obtain fusion features of the sample video;

Image restoration is performed on the image features in the fusion features through the image reconstruction sub-model of the video feature extraction model to obtain an image training result of the original image size. Features are processed to obtain text training results;

Adjust the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text based on the image training result, the text training result and the image label and text label of the sample video Model parameters of the sub-model are generated to train the video feature extraction model.

2. The training method of the video feature extraction model according to claim 1, wherein the content description text is at least one of content category description text, content form description text, content subject description text and content detail description text. one type.

3. The training method of the video feature extraction model according to claim 1, wherein the feature fusion layer of the feature fusion sub-model performs feature fusion on the image feature and the text feature to obtain the The fusion features of the sample video include any of the following:

Through the self-attention layer included in the feature fusion sub-model, the image feature and the text feature are processed to obtain the fusion feature of the sample video;

Through the deep belief network included in the feature fusion sub-model, the image feature and the text feature are processed to obtain the fusion feature of the sample video.

4. the training method of the video feature extraction model according to claim 1, is characterized in that, described fusion feature is processed by the text generation sub-model of described video feature extraction model, and obtaining text training result comprises:

Through the self-attention layer included in the text generation sub-model, the fusion feature is processed to obtain the text training result.

5. A text generation method based on a video feature extraction model, wherein the video feature extraction model is trained based on the training method described in any one of the above claims 1 to 4, and the method comprises:

Obtain the image information and text information of the target video;

The image information and the text information are input into the video feature extraction model, and feature extraction is performed on the image information through the image feature extraction sub-model of the video feature extraction model to obtain the image features of the target video. The embedding layer of the feature fusion sub-model of the video feature extraction model processes the text information to obtain the text features of the target video, and the image features are compared with the image features through the feature fusion layer of the feature fusion sub-model. Perform feature fusion on text features to obtain fusion features of the target video;

The fusion feature is processed by the text generation sub-model of the video feature extraction model, a plurality of characters satisfying text generation conditions are output, and a content description text of the target video is generated based on the plurality of characters.

6. A training device for a video feature extraction model, wherein the device comprises:

an acquisition unit, configured to perform acquisition of image information, text information, image tags and text tags of the sample video, where the image tag represents an image reconstruction feature, and the text tag represents the content description text of the sample video;

an input unit, configured to input the image information and the text information into a video feature extraction model, and perform feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain the sample video The image features of the sample video are processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the sample video, and the feature fusion layer of the feature fusion sub-model is used to process the text information. The image feature and the text feature are feature-fused to obtain the fusion feature of the sample video;

A processing unit, configured to perform image restoration on the image features in the fusion features through the image reconstruction sub-model of the video feature extraction model, to obtain an image training result of the original image size, and through the text of the video feature extraction model generating a sub-model to process the fusion feature to obtain a text training result;

A training unit configured to adjust the image feature extraction sub-model, the feature fusion sub-model, the image The reconstruction sub-model and the model parameters of the text generation sub-model are used to train the video feature extraction model.

7. A text generation device based on a video feature extraction model, wherein the video feature extraction model is trained based on the training method described in any one of claims 1 to 4, and the device comprises:

an acquisition unit, configured to perform acquisition of image information and text information of the target video;

an input unit, configured to input the image information and the text information into the video feature extraction model, and perform feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain the The image features of the target video are processed through the embedding layer of the feature fusion sub-model of the video feature extraction model to process the text information, and the text features of the target video are obtained. Feature fusion is performed on the image feature and the text feature to obtain the fusion feature of the target video;

a processing unit, configured to process the fusion feature through a text generation sub-model of the video feature extraction model, output a plurality of characters satisfying text generation conditions, and generate the content of the target video based on the plurality of characters Description text.

8. A computer device, characterized in that the computer device comprises:

one or more processors;

memory for storing said processor executable program code;

Wherein, the processor is configured to execute the program code, so as to realize the training method of the video feature extraction model according to any one of claims 1 to 4, or the video feature extraction model based on the video feature extraction model according to claim 5 text generation method.

9. A computer-readable storage medium, characterized in that, when the program code in the computer-readable storage medium is executed by a processor of a computer device, the computer device is enabled to execute any one of claims 1 to 4 The training method of the video feature extraction model, or the text generation method based on the video feature extraction model of claim 5.

10. A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the training method of the video feature extraction model according to any one of claims 1 to 4 is realized, or claim 5 The text generation method based on the video feature extraction model.