CN116311279A

CN116311279A - Generation of sample images, model training, character recognition method, equipment and medium

Info

Publication number: CN116311279A
Application number: CN202310300797.6A
Authority: CN
Inventors: 陆峥岩; 李敏杰; 肖潇; 章勇
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-23
Anticipated expiration: 2043-03-24
Also published as: CN116311279B

Abstract

The present invention relates to the technical field of image processing, in particular to the generation of sample images, model training, character recognition methods, devices and media. The generation method includes acquiring target text and style features in a target scene and first sample characters, target text and The style feature is obtained by encoding the text style of the real character image in the target scene; based on the target diffusion model, the first sample character, the target text and the style feature are processed to obtain the sample character image in the target scene, and the target diffusion model uses The font style of the first sample character is transferred to the real character image to generate the sample character image; the sample character image is spliced with the background image in the target scene to obtain the sample image. The font style of the first sample character is transferred to the real character image by using the target diffusion model, so that the style is unified and a realistic sample image is obtained.

Description

Generation of sample images, model training, character recognition method, equipment and medium

技术领域technical field

本发明涉及图像处理技术领域，具体涉及样本图像的生成、模型训练、字符识别方法、设备及介质。The present invention relates to the technical field of image processing, in particular to methods, equipment and media for generation of sample images, model training, and character recognition.

背景技术Background technique

光学字符识别(OCR)在现阶��飞速发展的深度学习应用领域有着多方面的应用，在文档及证件识别等应用上具有卓越的表现，在机器视觉方面随之生发出不同场景下的字符识别需求。在样本数据量充足的情况下，按照所训练出的字符识别模型的二阶段模式，可以很好的识别出图片上的文本内容。基于此，样本数据影响着字符识别模型的准确性。Optical character recognition (OCR) has many applications in the field of deep learning applications that are rapidly developing at this stage. It has excellent performance in applications such as document and certificate recognition. In terms of machine vision, character recognition requirements in different scenarios have emerged accordingly. . In the case of sufficient sample data, according to the two-stage mode of the trained character recognition model, the text content on the picture can be well recognized. Based on this, the sample data affects the accuracy of the character recognition model.

然而，在一些应用场景下样本数据量较少，通过一些数据增强处理所得到的样本数据的逼真性较低。例如，服装上的字符识别，为了提高样本数据量，现有的方案是直接在人身上贴上字符，即使加上一些文本图像扩增变换，也很难做到看上去就是衣服自带的字符而不是后期贴上去，从而导致样本图像的逼真性较低。However, in some application scenarios, the amount of sample data is small, and the fidelity of the sample data obtained through some data enhancement processing is low. For example, for character recognition on clothing, in order to increase the amount of sample data, the existing solution is to directly paste characters on the body. Even if some text and image amplification and transformation are added, it is difficult to make it look like the characters that come with the clothes. Instead of pasting it later, resulting in a lower fidelity of the sample image.

发明内容Contents of the invention

有鉴于此，本发明实施例提供了一种样本图像的生成、模型训练、字符识别、设备及介质，以解决样本图像的逼真性较低的问题。In view of this, the embodiments of the present invention provide a generation of sample images, model training, character recognition, equipment and media, so as to solve the problem of low fidelity of sample images.

根据第一方面，本发明实施例提供了一种样本图像的生成方法��包括：According to the first aspect, an embodiment of the present invention provides a method for generating a sample image, including:

获取目标场景下的目标文本和风格特征以及第一样本字符，所述目标文本和风格特征是对所述目标场景下的真实字符图像进行文本风格编码得到的；Acquiring target text and style features and first sample characters in the target scene, where the target text and style features are obtained by text style encoding of real character images in the target scene;

基于目标扩散模型对所述第一样本字符以及所述目标文本和风格特征进行处理，得到所述目标场景下的样本字符图像，所述目标扩散模型用于将所述第一样本字符的字体风格迁移到所述真实字符图像中以生成所述样本字符图像；Process the first sample character, the target text and style features based on a target diffusion model to obtain a sample character image in the target scene, and the target diffusion model is used to convert the first sample character The font style is transferred to the real character image to generate the sample character image;

将所述样本字符图像与所述目标场景下的背景图像进行拼接，得到样本图像。The sample character image is spliced with the background image in the target scene to obtain a sample image.

本发明实施例提供的样本图像的生成方法，利用目标场景下的真实字符图像获得的目标文本和风格特征作为目标扩散模型的条件，使用目标扩散模型将第一样本字符的字体风格迁移到真实字符图像中，从而保证所生成的样本字符图像中的文本与图像能够实现风格统一且更接近目标场景的真实文本风格，再将样本字符图像与目标场景下的背景图像进行拼接能够得到逼真的样本字符图像。The sample image generation method provided by the embodiment of the present invention uses the target text and style features obtained from the real character image in the target scene as the conditions of the target diffusion model, and uses the target diffusion model to migrate the font style of the first sample character to the real character image. In the character image, so as to ensure that the text and image in the generated sample character image can achieve a unified style and be closer to the real text style of the target scene, and then splicing the sample character image with the background image in the target scene can obtain a realistic sample character image.

在一些实施方式中，所述获取目标场景下的目标文本和风格特征，包括：In some implementation manners, said acquiring the target text and style features in the target scene includes:

获取所述真实字符图像；Acquiring the real character image;

对所述真实字符图像进行图像特征提取，得到图像风格编码；Carry out image feature extraction to described real character image, obtain image style coding;

对所述真实字符图像中的文本内容进行文本特征提取，得到文本编码；Carrying out text feature extraction to the text content in described real character image, obtains text coding;

对所述图像风格编码与所述文本编码进行融合，得到所述目标文本和风格特征。The image style coding and the text coding are fused to obtain the target text and style features.

本发明实施例提供的样本图像的生成方法，针对真实字符图像分别进行图像特征以及文本特征的提取，再将两者进行融合，使得所得到的目标文本和风格特征中既包括有图像特征又包括有文本特征，提高了所得到的目标文本和风格特征的可靠性。The sample image generation method provided by the embodiment of the present invention extracts image features and text features respectively for real character images, and then fuses the two, so that the obtained target text and style features include both image features and text features. There are textual features that improve the reliability of the resulting target text and style features.

在一些实施方式中，所述对所述图像风格编码与所述文本编码进行融合，得到所述目标文本和风格特征，包括：In some embodiments, the fusion of the image style code and the text code to obtain the target text and style features includes:

对所述图像风格编码以及所述文本编码进行注意力处理，得到所述图像风格编码以及所述文本编码之间的注意力；performing attention processing on the image style code and the text code to obtain the attention between the image style code and the text code;

将所述��意力与所述文本编码融合，并将融合结果经过前馈网络的处理得到所述目标文本和风格特征。The attention is fused with the text code, and the fusion result is processed by a feed-forward network to obtain the target text and style features.

本发明实施例提供的样本图像的生成方法，通过注意力实现图像风格编码与文本编码的融合，使得不同的文本内容关注给定图像风格编码的不同部分，提高了所得到的融合结果的可靠性以及真实性，在此基础上再经过前馈网络的处理能够提取较丰富的语义特征，进一步保证了真实字符图像文本风格编码的真实性，从而使得将目标文本和风格特征作为目标扩散模型的输入条件时，目标扩散模型能够得到扩散得到更为真实的样本图像。The sample image generation method provided by the embodiment of the present invention realizes the fusion of image style coding and text coding through attention, so that different text contents focus on different parts of a given image style coding, and improves the reliability of the obtained fusion results And the authenticity, on this basis, after the processing of the feedforward network, richer semantic features can be extracted, which further ensures the authenticity of the real character image text style encoding, so that the target text and style features can be used as the input of the target diffusion model When the conditions are met, the target diffusion model can obtain a more realistic sample image by diffusion.

在一些实施方式中，所述将所述样本字符图像与所述目标场景下的背景图像进行拼接，得到样本图像，包括：In some embodiments, the splicing of the sample character image and the background image in the target scene to obtain a sample image includes:

获取所述目标场景下的背景图像，并对所述背景图像进行感兴趣区域的识别，得到感兴趣区域图像；Acquiring a background image in the target scene, and identifying a region of interest on the background image to obtain an image of a region of interest;

对所述样本字符图像进行任意角度的旋转，得到旋转后的样本字符图像；Rotating the sample character image at any angle to obtain a rotated sample character image;

将所述旋转后的样本字符图像与所述感兴趣区域图像进行拼接，得到所述样本图像。The rotated sample character image is spliced with the ROI image to obtain the sample image.

本发明实施例提供的样本图像的生成方法，通过感兴趣区域的识别以便于后续将样本字符图像拼接到感兴趣区域，以贴近目标场景下的使用；同时，对样本字符图像进行任意角度的旋转，从而能够生成大量的样本图像，从而丰富样本图像的数量。The sample image generation method provided by the embodiment of the present invention, through the identification of the region of interest, facilitates the subsequent splicing of the sample character image to the region of interest, so as to be close to the use in the target scene; at the same time, the sample character image is rotated at any angle , so that a large number of sample images can be generated, thereby enriching the number of sample images.

在一些实施方式中，所述目标扩散模型的训练方法包括：In some embodiments, the training method of the target diffusion model includes:

获取第二样本字符以及所述目标场景下样本图像的文本和风格特征；Acquiring the text and style features of the second sample character and the sample image in the target scene;

在预设扩散模型的正向扩散过程中，基于所述第二样本字符生成噪声图像；During the forward diffusion process of the preset diffusion model, a noise image is generated based on the second sample character;

基于所述噪声图像以及所述目标场景下样本图像的文本和风格特征，对所述预设扩散模型的反向扩散过程进行训练，以确定所述目标扩散模型。Based on the noise image and the text and style features of the sample image in the target scene, the reverse diffusion process of the preset diffusion model is trained to determine the target diffusion model.

本发明实施例提供的样本图像的生成方法，以第二样本字符作为监督，目标场景下样本图像的文本和风格特征作为条件对预设扩散模型的反向扩散过程进行训练，能够保证所得到的目标扩散模型能够生成与目标场景贴合的图像。The sample image generation method provided by the embodiment of the present invention uses the second sample character as supervision, and the text and style features of the sample image in the target scene as conditions to train the reverse diffusion process of the preset diffusion model, which can ensure that the obtained The target diffusion model is able to generate images that fit the target scene.

根据第二方面，本发明实施例还提供了一种字符检测模型的训练方法，包括：According to the second aspect, the embodiment of the present invention also provides a method for training a character detection model, including:

获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一项实施方式中所述的样本图像的生成方法得到的；Acquiring a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or in any implementation manner of the first aspect;

获取所述样本图像的标签数据，所述标签数据包括所述样本图像中文本内容的位置信息以及目标旋转角度；Acquiring label data of the sample image, the label data including the position information of the text content in the sample image and the target rotation angle;

将所述样本图像输入所述字符检测模型中，得到所述样本图像中文本内容的预测位置信息以及预测旋转角度；Inputting the sample image into the character detection model to obtain predicted position information and predicted rotation angle of the text content in the sample image;

基于所述预测位置信息、所述预测旋转角度以及所述标签数据，对所述字符检测模型的参数进行更新，以得到目标字符检测模型。Based on the predicted position information, the predicted rotation angle and the label data, the parameters of the character detection model are updated to obtain a target character detection model.

本发明实施例提供的字符检测模型的训练方法，在获取到的目标场景下的大量且逼真的样本图像的基础上，进行字符检测模型的训练，能够提高所得到的目标字符检测模型的准确性；且训练得到的目标字符检测模型还输出有文本内容的旋转角度，提高目标字符检测模型的检测准确性。并且，后续对字符识别时可利用旋转角度对检测出的文本内容进行旋转，以进一步提高字符识别的准确性。The character detection model training method provided by the embodiment of the present invention can improve the accuracy of the obtained target character detection model by performing character detection model training on the basis of a large number of realistic sample images obtained in the target scene. ; and the trained target character detection model also outputs the rotation angle of the text content to improve the detection accuracy of the target character detection model. In addition, in subsequent character recognition, the detected text content can be rotated by using the rotation angle, so as to further improve the accuracy of character recognition.

根据第三方面，本发明实施例还提供了一种字符识别模型的训练方法，包括：According to the third aspect, the embodiment of the present invention also provides a method for training a character recognition model, including:

获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一实施方式中所述的样本图像的生成方法得到的；Acquiring a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or any implementation manner of the first aspect;

将所述样本图像输入目标字符检测模型中，得到所述样本图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；Input the sample image into the target character detection model to obtain the position and rotation angle of the text content in the sample image, and the target character detection model is obtained by training according to the character detection model training method described in the second aspect of the present invention of;

利用所述文本内容的位置以及所述旋转角度对所述样本图像中的文本内容进行旋转，得到目标文本内容；using the position of the text content and the rotation angle to rotate the text content in the sample image to obtain the target text content;

基于所述目标文本内容以及所述样本图像的文本标签，对字符识别模型的参数进行更新，以得到目标字符识别模型。Based on the target text content and the text label of the sample image, the parameters of the character recognition model are updated to obtain the target character recognition model.

本发明实施例提供的字符识别模型的训练方法，在将检测出的文本行输入字符识别模型之前，先利用预测出的位置以及旋转角度对文本内容进行旋转校正，以保证所得到的目标文本内容的角度统一，再利用角度统一的目标文本内容对字符识别模型的参数进行更新，进一步提高了训练得到的目标字符识别模型的准确性。In the character recognition model training method provided by the embodiment of the present invention, before inputting the detected text lines into the character recognition model, the text content is rotated and corrected using the predicted position and rotation angle to ensure the obtained target text content The angles are unified, and then the parameters of the character recognition model are updated by using the target text content with a unified angle, which further improves the accuracy of the trained target character recognition model.

根据第四方面，本发明实施例还提供了一种字符识别方法，包括：According to the fourth aspect, the embodiment of the present invention also provides a character recognition method, including:

获取目标场景下的待处理图像；Obtain the image to be processed under the target scene;

将所述待处理图像输入目标字符检测模型中，得到所述待处理图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；Input the image to be processed into the target character detection model to obtain the position and rotation angle of the text content in the image to be processed, and the target character detection model is a training method for the character detection model according to the second aspect of the present invention obtained by training;

利用所述待处理图像中文本内容的位置以及旋转角度对所述待处理图像中文本内容进行旋转，得到待识别文本内容；Using the position and rotation angle of the text content in the image to be processed to rotate the text content in the image to be processed to obtain the text content to be recognized;

将所述待识别文本内容输入目标字符识别模型中，得到所述待处理图像的字符识别结果，所述目标字符识别模型是根据本发明第三方面所述的字符识别模型的训练方法训练得到的。Inputting the text content to be recognized into the target character recognition model to obtain the character recognition result of the image to be processed, the target character recognition model is obtained by training according to the character recognition model training method described in the third aspect of the present invention .

本发明实施例提供的字符识别方法，由于目标字符检测模型以及目标字符识别模型是用大量逼真的样本图像训练得到的，具有较高的字符检测以及字符识别准确性，利用目标字符检测模型以及目标字符识别模型对待处理图像进行字符识别，能够得到较准确的字符识别结果。The character recognition method provided by the embodiment of the present invention, since the target character detection model and the target character recognition model are trained with a large number of realistic sample images, have high character detection and character recognition accuracy. The character recognition model performs character recognition on the image to be processed, and can obtain more accurate character recognition results.

根据第五方面，本发明实施例还提供了一种样本图像的生成模块，包括：According to the fifth aspect, the embodiment of the present invention also provides a sample image generation module, including:

第一获取模块，用于获取目标场景下的目标文本和风格特征以及第一样本字符的字体风格，所述目标文本和风格特征是对所述目标场景下的真实字符图像进行文本风格编码得到的；The first acquisition module is used to acquire the target text and style feature in the target scene and the font style of the first sample character, and the target text and style feature are obtained by encoding the text style of the real character image in the target scene of;

风格处理模块，用于基于目标扩散模型对所述第一样本字符的字体风格以及所述目标文本和风格特征进行处理，得到所述目标场景下的样本字符图像，所述目标扩散模型用于将所述第一样本字符的字体风格迁移到所述真实字符图像中以生成所述样本字符图像；A style processing module, configured to process the font style of the first sample character and the target text and style features based on a target diffusion model to obtain a sample character image in the target scene, and the target diffusion model is used for Migrating the font style of the first sample character to the real character image to generate the sample character image;

拼接模块，用于将所述样本字符图像与所述目标场景下的背景图像进行拼接，得到样本图像。A splicing module, configured to splice the sample character image with the background image in the target scene to obtain a sample image.

根据第六方面，本发明实施例还提供了一种字符检测模型的训练装置，包括：According to the sixth aspect, the embodiment of the present invention also provides a character detection model training device, including:

第二获取模块，用于获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一项实施方式中所述的样本图像的生成方法得到的；The second acquisition module is used to acquire a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or any one of the implementations of the first aspect;

第三获取模块，用于获取所述样本图像的标签数据，所述标签数据包括所述样本图像中文本内容的位置信息以及目标旋转角度；A third acquiring module, configured to acquire tag data of the sample image, where the tag data includes position information of text content in the sample image and a target rotation angle;

第一预测模块，用于将所述样本图像输入所述字符检测模型中，得到所述样本图像中文本内容的预测位置信息以及预测旋转角度；A first prediction module, configured to input the sample image into the character detection model to obtain predicted position information and predicted rotation angle of text content in the sample image;

第一更新模块，用于基于所述预测位置信息、所述预测旋转角度以及所述标签数据，对所述字符检测模型的参数进行更新，以得到目标字符检测模型。A first updating module, configured to update parameters of the character detection model based on the predicted position information, the predicted rotation angle, and the label data, so as to obtain a target character detection model.

根据第七方面，本发明实施例还提供了一种字符识别模型的训练装置，包括：According to the seventh aspect, the embodiment of the present invention also provides a character recognition model training device, including:

第四获取模块，用于获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一项实施方式中所述的样本图像的生成方法得到的；A fourth acquisition module, configured to acquire a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or any one of the implementations of the first aspect;

第一检测模块，用于将所述样本图像输入目标字符检测模型中，得到所述样本图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；The first detection module is used to input the sample image into the target character detection model to obtain the position and rotation angle of the text content in the sample image, and the target character detection model is the character according to the second aspect of the present invention The training method of the detection model is trained;

第一旋转模块，用于利用所述文本内容的位置以及所述旋转角度对所述样本图像中的文本内容进行旋转，得到目标文本内容；A first rotation module, configured to use the position of the text content and the rotation angle to rotate the text content in the sample image to obtain the target text content;

第二更新模块，用于基于所述目标文本内容以及所述样本图像的文本标签，对字符识别模型的参数进行更新，以得到目标字符识别模型。The second updating module is configured to update the parameters of the character recognition model based on the target text content and the text label of the sample image, so as to obtain the target character recognition model.

根据第八方面，本发明实施例还提供了一种字符识别模块，包括：According to the eighth aspect, the embodiment of the present invention also provides a character recognition module, including:

第五获取模块，用于获取目标场景下的待处理图像；The fifth acquisition module is used to acquire the image to be processed under the target scene;

第二检测模块，用于将所述待处理图像输入目标字符检测模型中，得到所述待处理图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；The second detection module is used to input the image to be processed into the target character detection model to obtain the position and rotation angle of the text content in the image to be processed, and the target character detection model is according to the second aspect of the present invention The training method training of the character detection model obtains;

第二旋转模块，用于利用所述待处理图像中文本内容的位置以及旋转角度对所述待处理图像中文本内容进行旋转，得到待识别文本内容；The second rotation module is used to rotate the text content in the image to be processed by using the position and rotation angle of the text content in the image to be processed to obtain the text content to be recognized;

识别模块，用于将所述待识别文本内容输入目标字符识别模型中，得到所述待处理图像的字符识别结果，所述目标字符识别模型是根据本发明第三方面所述的字符识别模型的训练方法训练得到的。A recognition module, configured to input the text content to be recognized into a target character recognition model to obtain a character recognition result of the image to be processed, and the target character recognition model is the character recognition model according to the third aspect of the present invention Trained by the training method.

根据第九方面，本发明实施例提供了一种电子设备，包括：存储器和处理器，所述存储器和所述处理器之间互相通信连接，所述存储器中存储有计算机指令，所述处理器通过执行所述计算机指令，从而执行第一方面或者第一方面的任意一种实施方式中所述的样本图像的生成方法，或者，执行第二方面所述的字符检测模型的训练方法，或者，执行第三方面所述的字符识别模型的训练方法，或者执行第四方面所述的字符识别方法。According to a ninth aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor, the memory and the processor are connected to each other in communication, the memory stores computer instructions, and the processor By executing the computer instructions, the method for generating a sample image described in the first aspect or any implementation manner of the first aspect is executed, or the method for training a character detection model described in the second aspect is executed, or, Execute the character recognition model training method described in the third aspect, or execute the character recognition method described in the fourth aspect.

根据第十方面，本发明实施例提供了一种计算机可读存储介质，所述计算机可读存储介质存储计算机指令，所述计算机指令用于使所述计算机执行第一方面或者第一方面的任意一种实施方式中所述的样本图像的生成方法，或者，执行第二方面所述的字符检测模型的训练方法，或者，执行第三方面所述的字符识别模型的训练方法，或者，执行第四方面所述的字符识别方法。According to a tenth aspect, an embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute the first aspect or any of the first aspects. The method for generating a sample image described in one embodiment, or, execute the character detection model training method described in the second aspect, or execute the character recognition model training method described in the third aspect, or execute the first aspect The character recognition method described in the four aspects.

需要说明的是，本发明实施例提供的样本图像的生成装置、字符检测模型的训练装置、字符识别模型的训练装置、字符识别装置、电子设备及计算机可读存储介质的相应有益效果，请参见上文样本图像的生成方法、字符检测模型的训练方法、字符识别模型的训练方法以及字符识别方法得对应有益效果描述，在此不再赘述。It should be noted that for the corresponding beneficial effects of the sample image generation device, character detection model training device, character recognition model training device, character recognition device, electronic equipment and computer-readable storage medium provided by the embodiments of the present invention, please refer to The method for generating the sample image, the method for training the character detection model, the method for training the character recognition model, and the method for character recognition have corresponding beneficial effect descriptions, which will not be repeated here.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

图1是根据本发明实施例的样本图像的生成方法的流程图；1 is a flowchart of a method for generating a sample image according to an embodiment of the present invention;

图2是根据本发明实施例的样本图像的生成方法的流程图；2 is a flowchart of a method for generating a sample image according to an embodiment of the present invention;

图3是根据本发明实施例的字符编码的示意图；Fig. 3 is a schematic diagram of character encoding according to an embodiment of the present invention;

图4是根据本发明实施例的反向扩散过程的结构示意图；4 is a schematic structural diagram of a backdiffusion process according to an embodiment of the present invention;

图5是根据本发明实施例的字符检测模型的训练方法的流程图；Fig. 5 is the flowchart of the training method of the character detection model according to the embodiment of the present invention;

图6是根据本发明实施例的字符检测模型的结构示意图；6 is a schematic structural diagram of a character detection model according to an embodiment of the present invention;

图7是根据本发明实施例的字符识别模型的训练方法的流程图；Fig. 7 is the flowchart of the training method of the character recognition model according to the embodiment of the present invention;

图8是根据本发明实施例的字符识别方法的流程图；8 is a flow chart of a character recognition method according to an embodiment of the present invention;

图9是根据本发明实施例的样本图像的生成装置的结构框图；Fig. 9 is a structural block diagram of a device for generating a sample image according to an embodiment of the present invention;

图10是根据本发明实施例的字符检测模型的训练装置的结构框图；Fig. 10 is a structural block diagram of a training device for a character detection model according to an embodiment of the present invention;

图11是根据本发明实施例的字符识别模型的训练装置的结构框图；Fig. 11 is a structural block diagram of a training device for a character recognition model according to an embodiment of the present invention;

图12是根据本发明实施例的字符识别装置的结构框图；Fig. 12 is a structural block diagram of a character recognition device according to an embodiment of the present invention;

图13是根据本发明实施例提供的电子设备的硬件结构示意图。Fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.

如上文所述，利用简单的字符风格生成的字符印刷体图像，其与目标场景下的背景图像不能��好地融合，对于字符识别造成很大的负面影响。在样本量稀缺的状况下，相关技术一般是利用生成模型来扩充样本数据集。其中，生成模型大多采用对抗神经模型，但由于对抗神经模型本身具有对抗性，很难训练达到平衡。基于此，本发明实施例提供的样本图像的生成方法，在样本数据集较少的情况下，基于目标扩散模型进行样本数据集的扩充，在此基础上提高了所得到的样本图像的逼真性。基于所生成的样本图像，本发明实施例还提供一种字符检测模型的训练方法，用于训练得到目标字符检测模型，以用于文本内容位置及旋转角度检测；进一步地，基于样本字符图像，本申请实施例还提供一种字符识别模型的训练方法，用于训练得到目标字符识别模型，以用于字符识别。As mentioned above, the printed character image generated by using a simple character style cannot be well integrated with the background image in the target scene, which has a great negative impact on character recognition. In the case of scarce sample size, related techniques generally use generative models to expand sample data sets. Among them, most of the generative models adopt the adversarial neural model, but due to the adversarial nature of the adversarial neural model itself, it is difficult to achieve a balance in training. Based on this, the sample image generation method provided by the embodiment of the present invention expands the sample data set based on the target diffusion model when the sample data set is small, and improves the fidelity of the obtained sample image on this basis . Based on the generated sample image, the embodiment of the present invention also provides a character detection model training method, which is used to train the target character detection model for detecting the position and rotation angle of the text content; further, based on the sample character image, The embodiment of the present application also provides a character recognition model training method, which is used for training to obtain a target character recognition model for character recognition.

基于训练得到的目标字符识别模型，本发明实施例提供的字符识别方法可以应用于服装上的字符识别，也可以应用于条幅上的字符识别，等等。关于字符识别的具体应用场景是根据实际需求设置的，在此对其并不做任何限定。Based on the target character recognition model obtained through training, the character recognition method provided by the embodiment of the present invention can be applied to character recognition on clothing, and can also be applied to character recognition on banners, and so on. The specific application scenarios for character recognition are set according to actual needs, and are not limited here.

根据本发明实施例，提供了一种样本图像的生成方法、字符识别模型的训练方法以及字符识别方法实施例，需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, a method for generating a sample image, a method for training a character recognition model, and an embodiment of a character recognition method are provided. It should be noted that the steps shown in the flow chart of the accompanying drawings can be performed in a group of computers such as Instructions are executed in a computer system and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

在本实施例中提供了一种样本图像的生成方法，可用于电子设备，如电脑、服务器等，图1是根据本发明实施例的样本图像的生成方法的流程图，如图1所示，该流程包括如下步骤：In this embodiment, a method for generating a sample image is provided, which can be used in electronic devices, such as computers, servers, etc. FIG. 1 is a flowchart of a method for generating a sample image according to an embodiment of the present invention, as shown in FIG. 1 , The process includes the following steps:

S11，获取目标场景下的目标文本和风格特征以及第一样本字符。S11. Obtain target text and style features and first sample characters in the target scene.

其中，所述目标文本和风格特征是对目标场景下的真实字符图像进行文本风格编码得到的。Wherein, the target text and style features are obtained by encoding the text style of real character images in the target scene.

如上文所述，目标场景是根据实际需求设置的。例如，所训练出的字符识别模型是用于识别服装上的字符，相应地，目标场景下的真实字符图像为服装上的真实字符图像。通过图像采集设备采集到的真实场景中抠出服装上的字符图像，得到目标场景下的真实字符图像。As mentioned above, the target scene is set according to actual needs. For example, the trained character recognition model is used to recognize characters on clothing, and accordingly, the real character image in the target scene is the real character image on clothing. The character image on the clothing is cut out from the real scene collected by the image acquisition device to obtain the real character image in the target scene.

第一样本字符为不同风格的网络印刷字符，或其他形式的字符等等，在此对字符的形式和来源并不做任何限定，具体根据实际需求进行设置。The first sample characters are web-printed characters of different styles, or other forms of characters, etc., and there is no limitation on the form and source of the characters, which are specifically set according to actual needs.

目标场景下的目标文本和风格特征是通过文本风格编码器得到的，该文本风格编码器的输入为字符图像，输出为字符图像的文本和风格特征，文本风格编码器的作用是用于提取输入字符图像的文本和风格特征，基于此，风格编码器是基于特征提取模型构建的。此处的特征提取模型包括但不限于MobileNetV2模型、ResNet或VGG等等。The target text and style features in the target scene are obtained through a text style encoder. The input of the text style encoder is a character image, and the output is the text and style features of the character image. The function of the text style encoder is to extract the input Text and style features of character images, based on which style encoders are built based on feature extraction models. The feature extraction models here include but are not limited to MobileNetV2 model, ResNet or VGG, etc.

S12，基于目标扩散模型对第一样本字符以及目标文本和风格特征进行处理，得到目标场景下的样本字符图像。S12. Process the first sample character, the target text, and the style feature based on the target diffusion model to obtain a sample character image in the target scene.

其中，所述目标扩散模型用于将第一样本字符的字体风格迁移到真实字符图像中以生成样本字符图像。Wherein, the target diffusion model is used to transfer the font style of the first sample character to the real character image to generate the sample character image.

目标扩散模型是用于实现字体风格的迁移，在得到第一样本字符以及目标文本和风格特征之后，将其输入到目标扩散模型中，得到目标场景下的样本字符图像。具体地，目标扩散模型将第一样本字符的字体风格迁移到真实字符图像中，生成样本字符图像。以目标场景为服装上的字符识别为例，基于目标扩散模型将第一样本字符中的字体风格迁移到实际的人员服装场景中，生成逼真的样本字符图像。The target diffusion model is used to implement font style migration. After obtaining the first sample character, target text and style features, input it into the target diffusion model to obtain the sample character image in the target scene. Specifically, the target diffusion model transfers the font style of the first sample character to the real character image to generate a sample character image. Taking the target scene as character recognition on clothing as an example, based on the target diffusion model, the font style in the first sample character is transferred to the actual personnel clothing scene, and a realistic sample character image is generated.

S13，将样本字符图像与目标场景下的背景图像进行拼接，得到样本图像。S13, splicing the sample character image and the background image in the target scene to obtain a sample image.

目标场景下的背景图像为字符图像的背景，上述S12中得到的仅仅是样本字符图像，该样本字符图像是融合了目标场景的文本和风格特征的字符图像，而为了得到具体的样本图像，需要将该字符图像融合到目标场景的背景中。具体地，将样本字符图像与目标场景下的背景图像进行拼接，该拼接可以是将样本字符图像放置在背景图像的任意位置，或者，以任意角度放置在背景图像的任意位置，等等。经过拼接处理之后，得到包括目标场景的背景以及样本字符的样本图像。The background image in the target scene is the background of the character image, and what is obtained in the above S12 is only a sample character image, which is a character image that combines the text and style features of the target scene. In order to obtain a specific sample image, it is necessary to Blend the character image into the background of the target scene. Specifically, the sample character image is spliced with the background image in the target scene. The splicing may be placing the sample character image at any position of the background image, or placing the sample character image at any position of the background image at any angle, and so on. After stitching, a sample image including the background of the target scene and sample characters is obtained.

在此对所生成的样本图像的数量并不做任何限定，具体根据实际需求进行设置即可。Here, there is no limitation on the number of generated sample images, which can be set according to actual needs.

本实施例提供的样本图像的生成方法，利用目标场景下的真实字符图像获得的目标文本和风格特征作为目标扩散模型的条件，使用目标扩散模型将第一样本字符的字体风格迁移到真实字符图像中，从而保证所生成的样本字符图像中的文本与图像能够实现风格统一且更接近目标场景的真实文本风格，再将样本字符图像与目标场景下的背景图像进行拼接能够得到逼真的样本字符图像。利用该方法能够生成目标场景下的大量且逼真的样本图像，在此基础上，再进行预设字符识别模型的训练，能够提高所得到的目标字符识别模型的准确性。The sample image generation method provided in this embodiment uses the target text and style features obtained from the real character image in the target scene as the conditions of the target diffusion model, and uses the target diffusion model to migrate the font style of the first sample character to the real character image, so as to ensure that the text and image in the generated sample character image can achieve a unified style and be closer to the real text style of the target scene, and then splicing the sample character image with the background image in the target scene can obtain a realistic sample character image. Using this method, a large number of realistic sample images in the target scene can be generated. On this basis, the training of the preset character recognition model can improve the accuracy of the obtained target character recognition model.

在本实施例中提供了一种样本图像的生成方法，可用于电子设备，如电脑、服务器等，图2是根据本发明实施例的字符识别模型的训练方法的流程图，如图2所示，该流程包括如下步骤：In this embodiment, a method for generating a sample image is provided, which can be used in electronic devices, such as computers, servers, etc. FIG. 2 is a flow chart of a training method for a character recognition model according to an embodiment of the present invention, as shown in FIG. 2 , the process includes the following steps:

S21，获取目标场景下的目标文本和风格特征以及第一样本字符的字体风格。S21. Obtain the target text and style features in the target scene and the font style of the first sample character.

具体地，上述S21包括：Specifically, the above S21 includes:

S211，获取真实字符图像。S211. Acquire real character images.

真实字符图像是从目标场景下采集的图像中抠取出字符图像得到的，具体的抠取方式在此对其并不做任何限定。The real character image is obtained by extracting the character image from the image collected in the target scene, and the specific extracting method is not limited here.

S212，对真实字符图像进行图像特征提取，得到图像风格编码。S212. Perform image feature extraction on the real character image to obtain an image style code.

图像特征提取可以是通过图像特征提取模型实现的，也可以是通过图像处理方式实现的。以特征提取模型为例，将真实字符图像输入到特征提取模型中，该模型的输出即为图像风格编码，例如，对输入的真实字符图像进行局部特征或关键点特征的提取，得到图像风格编码。Image feature extraction can be realized through image feature extraction models, or through image processing. Taking the feature extraction model as an example, the real character image is input into the feature extraction model, and the output of the model is the image style code. For example, the local feature or key point feature is extracted from the input real character image to obtain the image style code .

S213，对真实字符图像中的文本内容进行文本特征提取，得到文本编码。S213, performing text feature extraction on the text content in the real character image to obtain a text code.

文本内容为真实字符图像中的文本行，对文本行进行文本特征提取得到文本编码。文本特征提取也可以是通过文本特征提取模型实现的，该文本特征提取模块包括但不限于基于词袋模型的特征提取、基于持平-逆向文档频率(即，TF-IDF)的特征提取或者基于词向量的特征提取等等。The text content is the text line in the real character image, and the text feature is extracted from the text line to obtain the text code. Text feature extraction can also be realized by a text feature extraction model, the text feature extraction module includes but not limited to feature extraction based on bag of words model, feature extraction based on flat-inverse document frequency (ie, TF-IDF) or word-based Vector feature extraction and so on.

S214，对图像风格编码与文本编码进行融合，得到目标文本和风格特征。S214, merging the image style coding and the text coding to obtain the target text and style features.

图像风格编码与文本编码的融合方式包括但不限于两个编码向量的拼接，或两个编码向量的加权和方式等等，经过融合处理之后得到目标文本和风格特征。The fusion method of image style coding and text coding includes but not limited to concatenation of two coding vectors, or weighted sum of two coding vectors, etc., and the target text and style features are obtained after fusion processing.

在一些实施方式中，上述S214包括：In some embodiments, the above S214 includes:

(1)对图像风格编码以及文本编码进行注意力处理，得到图像风格编码以及文本编码之间的注意力。(1) Attention processing is performed on image style encoding and text encoding, and the attention between image style encoding and text encoding is obtained.

(2)将注意力与文本编码融合，并将融合结果经过前馈网络的处理得到目标文本和风格特征。(2) Integrate attention with text encoding, and process the fusion result through a feed-forward network to obtain target text and style features.

注意力处理是基于多头注意力模块实现的，输入为图像风格编码以及文本编码，输出为图像风格编码与文本编码之间的注意力，使得不同的文本字符关注给定风格样本的不同部分。在得到两者之间的注意力之后，将该注意力与文本编码进行融合得到融合结果，再将融合结果经过前馈网络后得到目标文本和风格特征。其中，注意力与文本编码的融合可以是多维矩阵通道方向上的堆叠。前馈网络的作用是整合字符区域的图像特征和该区域字符内容的语义特征。Attention processing is implemented based on a multi-head attention module. The input is image style encoding and text encoding, and the output is the attention between image style encoding and text encoding, so that different text characters focus on different parts of a given style sample. After the attention between the two is obtained, the attention is fused with the text encoding to obtain the fusion result, and then the fusion result is passed through the feedforward network to obtain the target text and style features. Among them, the fusion of attention and text encoding can be stacked in the channel direction of multi-dimensional matrix. The role of the feed-forward network is to integrate the image features of the character region and the semantic features of the character content in the region.

通过注意力实现图像风格编码与文本编码的融合，使得不同的文本内容关注给定图像风格编码的不同部分，提高了所得到的融合结果的可靠性以及真实性，在此基础上再经过前馈网络的处理能够提取较丰富的语义特征，进一步保证了真实字符图像文本风格编码的真实性，从而使得将目标文本和风格特征作为目标扩散模型的输入条件时，目标扩散模型能够得到扩散得到更为真实的样本图像。The fusion of image style coding and text coding is achieved through attention, so that different text content focuses on different parts of a given image style coding, which improves the reliability and authenticity of the fusion results obtained. On this basis, feedforward The processing of the network can extract richer semantic features, which further ensures the authenticity of the real character image text style encoding, so that when the target text and style features are used as the input conditions of the target diffusion model, the target diffusion model can be diffused and obtained more effectively. Real sample images.

作为一个具体应用实例，以服装上的字符识别为例。图3示出了文本风格编码的一个具体应用实例。具体地，如图3所示，上述的确定目标文本和风格特征的方式包括：将相机采集到的真实场景抠取出服装上的字符图像，输入到在Imagenet上预训练的MobileNetV2模型，丢弃顶部全连接层，利用平均池化层提取局部特征作为图像风格编码。由于此处的MobileNetV2模型，本质上来说是一个分类模型，所以才会在模型的最后接入全连层用来实现分类功能。而在本实施例中，只需要该模型作为一个特征提取器，所以去掉全连层；也就是分类模块的全连层之前的网络结构是特征提取模块，加入平均池化的目的是为了对特征进行滤波。当然，除了平均池化以外，还可以采��最大池化层等等，在此对其并不做任何限定。进一步地，根据字符图像中的文本内容使用嵌入层对当前文本行编码，按照英文及数字的字符顺序编码，用于表示文本编码。将提取的图像风格编码与文本编码输入至多头注意力层，通过计算文本编码和提取的特征之间的注意力，使得不同的文本字符关注给定风格样本的不同部分。最后，将多头注意力机制的输出加上到文本编码上，再经过前馈网络得到最终的目标文本和风格特征输出。As a specific application example, take character recognition on clothing as an example. Fig. 3 shows a specific application example of text style coding. Specifically, as shown in Figure 3, the above method of determining the target text and style features includes: extracting the character image on the clothing from the real scene captured by the camera, inputting it into the MobileNetV2 model pre-trained on Imagenet, discarding the top full The connection layer uses the average pooling layer to extract local features as image style encoding. Since the MobileNetV2 model here is essentially a classification model, the fully connected layer is connected at the end of the model to implement the classification function. In this embodiment, the model is only needed as a feature extractor, so the fully connected layer is removed; that is, the network structure before the fully connected layer of the classification module is the feature extraction module, and the purpose of adding average pooling is to to filter. Of course, in addition to the average pooling, the maximum pooling layer and the like can also be used, which is not limited here. Further, use the embedding layer to encode the current text line according to the text content in the character image, and encode according to the character sequence of English and numbers to represent the text encoding. The extracted image style encoding and text encoding are input to a multi-head attention layer, and by computing the attention between the text encoding and the extracted features, different text characters focus on different parts of a given style sample. Finally, the output of the multi-head attention mechanism is added to the text encoding, and then the final target text and style feature output are obtained through the feedforward network.

S215，获取第一样本字符。S215. Acquire a first sample character.

S22，基于目标扩散模型对第一样本字符以及目标文本和风格特征进行处理，得到目标场景下的样本字符图像。S22. Process the first sample character, the target text, and the style feature based on the target diffusion model to obtain a sample character image in the target scene.

如上文所述，目标扩散模型是用于实现字体风格迁移的，其输出为样本字符图像。As mentioned above, the target diffusion model is used to implement font style transfer, and its output is a sample character image.

在一些实施方式中，目标扩散模型的训练方法包括：In some embodiments, the training method of target diffusion model includes:

(1)获取第二样本字符以及目标场景下样本图像的文本和风格特征。(1) Obtain the text and style features of the second sample character and the sample image in the target scene.

(2)在预设扩散模型的正向��散��程中，基于第二样本字符生成噪声图像。(2) During the forward diffusion process of the preset diffusion model, a noise image is generated based on the second sample character.

(3)基于噪声图像以及目标场景下样本图像的文本和风格特征，对预设扩散模型的反向扩散过程进行训练，以确定目标扩散模型。(3) Based on the noise image and the text and style features of the sample image in the target scene, the reverse diffusion process of the preset diffusion model is trained to determine the target diffusion model.

第二样本字符是随机生成的图像，例如，利用opencv工具生成随机长度、随机颜色、随机大小的第二样本字符作为原始图像。目标场景下样本图像的文本和风格特征的获取方式以上文所述的目标文本和风格特征的获取方式类似，在此不再赘述。The second sample character is a randomly generated image, for example, using an opencv tool to generate a second sample character of random length, random color, and random size as the original image. The acquisition method of the text and style features of the sample image in the target scene is similar to the acquisition method of the target text and style features described above, and will not be repeated here.

对于预设扩散模型而言，其包括正向扩散过程以及反向扩散过程，其中，正向扩散过程是用于将原始图像的分布变成标准的高斯分布的过程，反向扩散过程是用于生成目标场景下的字符图像的。以服装上的字符识别为例，在正向扩散过程中，在原始图像上随机添加噪声。对于噪音机制β₁,…β_T，采用β_t＝0.02+Exponential(1×10^-5,0.4)，其中Exponential(1×10^-5,0.4)表示log(1×10^-5)到log(0.4)之间的浮点数。经过T步迭代最终将原始图片的分布变成标准的高斯分布。其中，β_i表示第i次迭代加入的噪声(i＝1，2，……，T)，T的具体取值是根据实际需求设置的，在此对其并不做任何限定。For the preset diffusion model, it includes the forward diffusion process and the reverse diffusion process, wherein the forward diffusion process is used to change the distribution of the original image into a standard Gaussian distribution, and the reverse diffusion process is used to Generate character images in the target scene. Taking character recognition on clothing as an example, noise is randomly added to the original image during forward diffusion. For the noise mechanism β ₁ ,…β _T , use β _t =0.02+Exponential(1×10 ^-5 ,0.4), where Exponential(1×10 ^-5 ,0.4) means log(1×10 ^-5 ) to log( 0.4) between floating point numbers. After T-step iterations, the distribution of the original image is finally transformed into a standard Gaussian distribution. Wherein, β _i represents the noise added in the i-th iteration (i=1, 2, .

对于反向扩散过程是基于如图4所述的UNet模型实现的，如图4所示，UNet模型由下采样块和上采样块组成，并使用长范围卷积跳跃连接，其主要使用两种类型的区块，卷积区块和注意力区块。卷积块由3个卷积层和一个卷积跳跃连接组成并对每个卷积层的输出应用条件仿射变换，每个条件仿射变换的尺度和偏差由全连接层的输出参数化。注意块由2个多头注意层和一个前馈网络组成。第一个注意层在输入的噪声图像的文本序列潜变量和样本图像的文本和风格特征之间执行注意，而第二个注意层执行自我注意。使用层归一化在每个注意层和前馈网络后进行条件仿射变换。如图4所示，注意力块的输入包括两部分，一部分是作为监督使用的原始图像，另一部分是目标场景下样本图像的文本和风格特征，即，以噪声图像作为监督，将真实服装场景下的样本图像的文本和风格特征作为条件训练扩散模型的反向过程。The back-diffusion process is implemented based on the UNet model as shown in Figure 4. As shown in Figure 4, the UNet model consists of a downsampling block and an upsampling block, and uses long-range convolutional skip connections. It mainly uses two Types of blocks, convolutional blocks and attentional blocks. The convolutional block consists of 3 convolutional layers and a convolutional skip connection and applies a conditional affine transformation to the output of each convolutional layer, the scale and bias of each conditional affine transformation is parameterized by the output of the fully connected layer. The attention block consists of 2 multi-head attention layers and a feed-forward network. The first attention layer performs attention between the text sequence latent variables of the input noisy image and the text and style features of the sample image, while the second attention layer performs self-attention. Conditional affine transformations are performed after each attention layer and feed-forward network using layer normalization. As shown in Figure 4, the input of the attention block consists of two parts, one is the original image used as supervision, and the other is the text and style features of the sample image under the target scene, that is, the noise image is used as the supervision, and the real clothing scene The text and style features of the sample images below are used as the reverse process of training the diffusion model.

在反向过程中，从球形高斯函数中对一个正向过程输出的噪声图像进行采样得到长马尔科夫链的噪声图像，并使用包含服装字符语义的条件潜变量(即，样本图像的文本和风格特征)对长马尔可夫链中的噪声图像进行去噪，生成服装场景风格的字符图像。In the reverse process, the noisy image output by a forward process is sampled from a spherical Gaussian function to obtain a noisy image with a long Markov chain, and the conditional latent variables containing the semantics of clothing characters (i.e., the text of the sample image and Style feature) denoises the noisy image in the long Markov chain to generate the character image of the clothing scene style.

以第二样本字符作为监督，目标场景下样本图像的文本和风格特征作为条件对预设扩散模型的反向扩散过程进行训练，能够保证所得到的目标扩散模型能够生成与目标场景贴合的图像。The second sample character is used as supervision, and the text and style features of the sample image in the target scene are used as conditions to train the reverse diffusion process of the preset diffusion model, which can ensure that the obtained target diffusion model can generate an image that fits the target scene .

S23，将样本字符图像与目标场景下的背景图像进行拼接，得到样本图像。S23. Concatenate the sample character image with the background image in the target scene to obtain a sample image.

具体地，上述S23包括：Specifically, the above S23 includes:

S231，获取目标场景下的背景图像，并对背景图像进行感兴趣区域的识别，得到感兴趣区域图像。S231. Acquire a background image in the target scene, and identify a region of interest on the background image to obtain an image of the region of interest.

在将样本字符图像拼接至背景图像中时，一般是拼接在背景图像中的感兴趣区域的。例如，服装上的字符一般是为人员的上半身，基于此，需要对采集到的背景图像进行人员上半身的识别，此时人员的生半身为背景图像中的感兴趣区域。When the sample character image is stitched into the background image, it is generally stitched in the region of interest in the background image. For example, the characters on clothing are generally the upper body of the person. Based on this, it is necessary to recognize the upper body of the person in the collected background image. At this time, the half body of the person is the region of interest in the background image.

在对背景图像进行感兴趣区域的识别，确定感兴趣区域的位置，再利用该感兴趣区域的位置从背景图像中截取出局部图像，得到感兴趣区域图像。Identify the region of interest on the background image, determine the location of the region of interest, and then use the location of the region of interest to extract a partial image from the background image to obtain an image of the region of interest.

S232，对样本字符图像进行任意角度的旋转，得到旋转后的样本字符图像。S232. Rotate the sample character image at any angle to obtain a rotated sample character image.

样本字符图像的旋转角度是根据实际需求设置的，对样本字符图像旋转任意角度后，得到旋转后的样本字符图像。为了便于对旋转角度的记录，规定旋转角度为旋转目标框的短边与x轴正方向的锐角夹角，其中逆时针方向指定为正角，顺时针方向为负角，因此，角度范围为[-90，90)。The rotation angle of the sample character image is set according to actual requirements. After the sample character image is rotated by any angle, the rotated sample character image is obtained. In order to facilitate the recording of the rotation angle, the rotation angle is specified as the acute angle between the short side of the rotation target frame and the positive direction of the x-axis, where the counterclockwise direction is specified as a positive angle, and the clockwise direction is a negative angle. Therefore, the angle range is [ -90,90).

S233，将旋转后的样本字符图像与感兴趣区域图像进行拼接，得到样本图像。S233, splicing the rotated sample character image and the ROI image to obtain a sample image.

拼接方式是将旋转后的样本字符图像贴在感兴趣区域图像中，得到样本图像。The splicing method is to paste the rotated sample character image on the image of the region of interest to obtain the sample image.

进一步地，为了得到大量的样本图像，可以采用大量且不同风格的第一样本字符作为目标扩散模型的输入，收集目标场景下不同风格的真实场景图像得到真实场景图像的目标文本和风格特征，再以目标文本和风格特征作为目标扩散模型的条件，生成大量的样本字符图像，从而生成大量的样本图像。Further, in order to obtain a large number of sample images, a large number of first sample characters with different styles can be used as the input of the target diffusion model, and real scene images of different styles in the target scene can be collected to obtain the target text and style features of the real scene image, Then, the target text and style features are used as the conditions of the target diffusion model to generate a large number of sample character images, thereby generating a large number of sample images.

本实施例提供的样本图像的生成方法，针对真实字符图像分别进行图像特征以及文本特征的提取，再将两者进行融合，使得所得到的目标文本和风格特征中既包括有图像特征又包括有文本特征，提高了所得到的目标文本和风格特征的可靠性。通过感兴趣区域的识别以便于后续将样本字符图像拼接到感兴趣区域，以贴近目标场景下的使用；同时，对样本字符图像进行任意角度的旋转，从而能够生成大量的样本图像，从而丰富样本图像的数量。The sample image generation method provided in this embodiment extracts image features and text features respectively for real character images, and then fuses the two, so that the obtained target text and style features include both image features and text features. Text features, which improve the reliability of the resulting target text and style features. Through the identification of the region of interest, it is convenient to stitch the sample character image to the region of interest, so as to be close to the use in the target scene; at the same time, the sample character image is rotated at any angle, so that a large number of sample images can be generated, thereby enriching the sample the number of images.

在本实施例中提供了一种字符检测模型的训练方法，可用于电子设备，如电脑、服务器等，图5是根据本发明实施例的字符检测模型的训练方法的流程图，如图5所示，该流程包括如下步骤：In this embodiment, a method for training a character detection model is provided, which can be used in electronic equipment, such as computers, servers, etc. FIG. 5 is a flowchart of a method for training a character detection model according to an embodiment of the present invention, as shown in FIG. 5 The process includes the following steps:

S31，获取样本图像。S31. Acquire a sample image.

其中，所述样本图像是根据上述任一项所述的样本图像的生成方法得到的。关于样本图像的生成请参见上文所述，在此不再赘述。Wherein, the sample image is obtained according to the method for generating a sample image described in any one of the above. For the generation of the sample image, please refer to the above description, and details will not be repeated here.

S32，获取样本图像的标签数据。S32. Acquire tag data of the sample image.

其中，所述标签数据包括所述样本图像中文本内容的位置信息以及目标旋转角度。Wherein, the label data includes the position information of the text content in the sample image and the target rotation angle.

文本内容的位置信息以及目标旋转角度是在上述将样本字符图像与目标场景下的背景图像进行拼接时记录得到的。The position information of the text content and the target rotation angle are recorded when the sample character image is stitched together with the background image in the target scene.

S33，将样本图像输入字符检测模型中，得到样本图像中文本内容的预测位置信息以及预测旋转角度。S33. Input the sample image into the character detection model to obtain predicted position information and predicted rotation angle of the text content in the sample image.

字符检测模型的输入为��本图像，输出为样本图像中文本行的预测位置信息以及预测旋转角度。其中，预测位置信息可以采用中心点、长度与宽度的方式进行表示，也可以采用文本行所在的预测框的左上角坐标与右下角坐标的方式进行表示等等，具体对其并不做任何限定。The input of the character detection model is a sample image, and the output is the predicted position information and predicted rotation angle of the text line in the sample image. Among them, the predicted position information can be represented by the center point, length and width, or can be represented by the coordinates of the upper left corner and the lower right corner of the prediction box where the text line is located, etc., and there is no specific limitation on it. .

S34，基于预测位置信息、预测旋转角度以及标签数据，对字符检测模型的参数进行更新，确定目标字符检测模型。S34. Based on the predicted position information, predicted rotation angle and label data, update the parameters of the character detection model to determine the target character detection model.

作为本实施例的字符检测模型的一个具体应用实例，以resnet-18作为主干网络搭建字符检测模型，并在其后接4个输出的特征图分支，分别是目标中心点热图、采样偏移量、目标框的大小以及目标框的旋转角度，通过对这四个分支进行训练以预测目标的具体位置。具体地，将所得到的样本图像缩放至3*320*192分辨率大小，作为字符检测模型的输入，其输出中心点热图、偏移量特征图、目标框大小特征图以及角度特征图。再结合样本图像的标签，不断迭代标签与字符检测模型输出之间的损失值，进行参数化训练，以更新字符检测模型的参数，最终确定出目标字符检测单元。As a specific application example of the character detection model in this embodiment, the character detection model is built with resnet-18 as the backbone network, and then four output feature map branches are connected, which are target center point heat map and sampling offset The amount, the size of the target frame and the rotation angle of the target frame are trained to predict the specific position of the target by training these four branches. Specifically, the obtained sample image is scaled to a resolution of 3*320*192, and used as the input of the character detection model, which outputs a center point heat map, an offset feature map, a target frame size feature map, and an angle feature map. Combined with the label of the sample image, the loss value between the label and the output of the character detection model is continuously iterated, and parametric training is performed to update the parameters of the character detection model, and finally determine the target character detection unit.

具体地，针对在主干网络输出的特征图大小，通过高斯核计算热图真实值，当枚举块的位置和真实中心关键点坐标接近重合的时候，高斯核输出值接近为1；当枚举块位置和真实中心关键点相差很大时，高斯核输出值接近为0。中心点热图输出尺寸为原图尺寸的四分之一，由于检测目标只有字符框一个类别，故输出热图尺寸为80*48*1。在训练过程中，中心点热图采用focal loss进行训练。在推理过程中，对预测的中心点热图进行3×3最大池化计算出符合检测阈值的中心点。Specifically, for the size of the feature map output by the backbone network, the Gaussian kernel is used to calculate the real value of the heat map. When the position of the enumeration block and the coordinates of the real central key point are close to coincident, the Gaussian kernel output value is close to 1; when the enumeration The Gaussian kernel output value is close to 0 when the block position is very different from the true central keypoint. The output size of the center point heat map is a quarter of the size of the original image. Since the detection target only has one category of character boxes, the output heat map size is 80*48*1. During the training process, the center point heatmap is trained with focal loss. In the inference process, 3×3 maximum pooling is performed on the predicted center point heat map to calculate the center point that meets the detection threshold.

由于图像进行了R＝4的下采样，根据特征图下采样率计算中心点x轴和y轴的偏移量，设定关于偏移的损失函数，使得训练后的网络能够弥补中心点偏移值，修正检测框的位置。该分支输出特征图尺寸为80*48*2，用来预测x轴和y轴的偏移量，这个偏置值用L1损失来训练。Since the image has been down-sampled by R=4, the offset of the x-axis and y-axis of the center point is calculated according to the downsampling rate of the feature map, and the loss function about the offset is set so that the trained network can compensate for the offset of the center point Value, correct the position of the detection frame. The output feature map size of this branch is 80*48*2, which is used to predict the offset of the x-axis and y-axis. This offset value is trained with L1 loss.

字符检测模型输出的目标框大小特征图尺寸为80*48*2，2个通道分别对应高和宽的预测，将预测得到的高和宽映射成原图大小，采用L1损失训练此分支，进而使得字符检测模型预测的高和宽逐渐收敛于原图字符框。The size of the target box size feature map output by the character detection model is 80*48*2, and the two channels correspond to the prediction of height and width respectively. The predicted height and width are mapped to the size of the original image, and the L1 loss is used to train this branch, and then The height and width predicted by the character detection model gradually converge to the character frame of the original image.

字符检测模型在主干网络的特征提取器后，添加角度信息的检测头，此特征图尺寸为80*48*1，用来预测目标框的旋转角度，将目标角度误差反馈到字符检测模型中，使得字符检测模型能够学习到目标的角度信息。The character detection model adds a detection head of angle information after the feature extractor of the backbone network. The size of this feature map is 80*48*1, which is used to predict the rotation angle of the target frame and feed back the target angle error to the character detection model. This enables the character detection model to learn the angle information of the target.

本实施例提供的字符检测模型的训练方法，在获取到的目标场景下的大量且逼真的样本图像的基础上，进行字符检测模型的训练，能够提高所得到的目标字符检测模型的准确性；且训练得到的目标字符检测模型还输出有文本内容的旋转角度，提高目标字符检测模型的检测准确性。并且，后续字符识别时可利用旋转角度对检测出的文本内容进行旋转，以进一步提高字符识别的准确性。The character detection model training method provided in this embodiment, on the basis of a large number of realistic sample images obtained in the target scene, performs character detection model training, which can improve the accuracy of the obtained target character detection model; Moreover, the trained target character detection model also outputs the rotation angle of the text content, which improves the detection accuracy of the target character detection model. Moreover, during subsequent character recognition, the detected text content can be rotated by using the rotation angle, so as to further improve the accuracy of character recognition.

在本实施例中提供了一种字符识别模型的训练方法，可用于电子设备，如电脑、服务器等，图6是根据本发明实施例的字符识别模型的训练方法的流程图，如图6所示，该流程包括如下步骤：In this embodiment, a method for training a character recognition model is provided, which can be used in electronic devices, such as computers, servers, etc. FIG. 6 is a flow chart of a method for training a character recognition model according to an embodiment of the present invention, as shown in FIG. 6 The process includes the following steps:

S41，获取样本图像。S41. Acquire a sample image.

S42，将样本图像输入目标字符检测模型中，得到样本图像中文本内容的位置以及旋转角度。S42. Input the sample image into the target character detection model to obtain the position and rotation angle of the text content in the sample image.

其中，目标字符检测模型是根据上述的字符检测模型的训练方法训练得到的。关于目标字符检测模型的具体细节请参见上文所述，在此不再赘述。Wherein, the target character detection model is trained according to the above-mentioned character detection model training method. For the specific details of the target character detection model, please refer to the above description, which will not be repeated here.

S43，利用文本内容的位置以及旋转角度对样本图像中的文本内容进行旋转，得到目标文本内容。S43, using the position and rotation angle of the text content to rotate the text content in the sample image to obtain the target text content.

对文本内容的旋转方式可以是利用文本内容的位置先提取出文本行，再结合旋转角度对文本行进行旋转得到目标文本内容；也可以利用文本内容的位置以及旋转角度对文本内容梯形校正，具体地，利用位置以及旋转角度来计算出透视变换矩阵，再利用该透视变换矩阵对样本图像中的文本内容进行透视变换，得到目标文本内容。The way to rotate the text content can be to use the position of the text content to extract the text line first, and then combine the rotation angle to rotate the text line to obtain the target text content; it is also possible to use the position of the text content and the rotation angle to correct the keystone of the text content, specifically Specifically, the perspective transformation matrix is calculated by using the position and the rotation angle, and then the perspective transformation matrix is used to perform perspective transformation on the text content in the sample image to obtain the target text content.

当然，也可以在文本内容的位置以及旋转角度的基础上，采用其他方式对文本内容进行旋转得到目标文本内容，在此对其并不做任何限定。Of course, the target text content can also be obtained by rotating the text content in other ways based on the position and rotation angle of the text content, which is not limited here.

S44，基于目标文本内容以及样本图像的文本标签，对字符识别模型的参数进行更新，以得到目标字符识别模型。S44. Based on the target text content and the text label of the sample image, update the parameters of the character recognition model to obtain the target character recognition model.

字符识别模型可以是预训练的字符识别模型，例如，可以是一个预训练的英文字符及数字字符分类器，用于识别英文和数字。在训练过程中，对字符识别模型的参数进行调整，以固定该字符识别模型的参数。参数调整过程可以是将生成的样本图像按原图宽高比例缩放至3*32*384分辨率，若原图缩放后宽度不足384，则三通道分别使用三通道的均值补齐，若宽度大于384，将图像以384宽度截断分批送入预训练的字符识别模型。该字符识别模型是以resnet18为主干网络搭建的，通过softmax激活函数对52个英文大小写字母以及10个数字进行分类。训练使用文本分类的CTCloss，对于输入的重复的字符设为blank。作为一个类别，则最终分类特征图为1*48*63，在此特征图上进行CTCloss计算，通过Adam优化器不断调整网络参数以拟合正确的字符串，从而固定字符识别模型的参数，得到预训练字符识别模型。进一步地，将预训练字符识别模型应用到目标场景中，使用目标字符检测模型对样本图像进行检测，并基于文本内容的位置和旋转角度对文本内容进行旋转得到目标文本内容，利用预训练字符识别模型计算目标文本内容和样本图像文本标签的损失，基于该损失更新预训练字符识别模型的参数得到目标字符识别模型，使得目标字符识别模型更适应于目标场景下的字符/文本识别。The character recognition model can be a pre-trained character recognition model, for example, it can be a pre-trained English character and number character classifier for recognizing English and numbers. During the training process, the parameters of the character recognition model are adjusted to fix the parameters of the character recognition model. The parameter adjustment process can be to scale the generated sample image to a resolution of 3*32*384 according to the width and height ratio of the original image. If the width of the original image is less than 384 after scaling, the three channels will be filled with the average value of the three channels respectively. If the width is greater than 384 , the images are fed into the pre-trained character recognition model in batches with 384 width truncated. The character recognition model is built with resnet18 as the backbone network, and classifies 52 English uppercase and lowercase letters and 10 numbers through the softmax activation function. Train CTCloss using text classification, and set blank for repeated characters input. As a category, the final classification feature map is 1*48*63, CTCloss calculation is performed on this feature map, and the network parameters are continuously adjusted through the Adam optimizer to fit the correct string, thereby fixing the parameters of the character recognition model, and obtaining Pretrained character recognition model. Further, apply the pre-trained character recognition model to the target scene, use the target character detection model to detect the sample image, and rotate the text content based on the position and rotation angle of the text content to obtain the target text content, use the pre-trained character recognition The model calculates the loss of the target text content and the sample image text label, and updates the parameters of the pre-trained character recognition model based on the loss to obtain the target character recognition model, making the target character recognition model more suitable for character/text recognition in the target scene.

本实施例提供的字符识别模型的训练方法，在将检测出的文本行输入字符识别模型之前，先利用预测出的位置以及旋转角度对文本内容进行旋转校正，以保证所得到的目标文本内容的角度统一，再利用角度统一的目标文本内容对字符识别模型的参数进行更新，进一步提高了训练得到的目标字符识别模型的准确性。The character recognition model training method provided in this embodiment uses the predicted position and rotation angle to perform rotation correction on the text content before inputting the detected text line into the character recognition model, so as to ensure the accuracy of the obtained target text content. The angle is unified, and then the parameters of the character recognition model are updated by using the target text content with the same angle, which further improves the accuracy of the trained target character recognition model.

在本实施例中提供了一种字符识别方法，可用于电子设备，如电脑、服务器，移动终端等，图8是根据本发明实施例的字符识别方法的流程图，如图8所示，该流程包括如下步骤：In this embodiment, a character recognition method is provided, which can be used in electronic devices, such as computers, servers, mobile terminals, etc. FIG. 8 is a flowchart of a character recognition method according to an embodiment of the present invention. As shown in FIG. 8, the The process includes the following steps:

S51，获取目标场景下的待处理图像。S51. Acquire the image to be processed under the target scene.

待处理图像为在目标场景下的图像，可以是目标场景下的采集设备采集得到的，也可以存储在电子设备中的等等，在此对其来源并不做任何限定。The image to be processed is an image in the target scene, which may be collected by a collection device in the target scene, or stored in an electronic device, etc., and its source is not limited here.

S52，将待处理图像输入目标字符检测模型中，得到待处理图像中文本内容的位置以及旋转角度。S52. Input the image to be processed into the target character detection model to obtain the position and rotation angle of the text content in the image to be processed.

其中，所述目标字符检测模型是根据上述的字符检测模型的训练方法训练得到的。关于目标字符检测模型的具体结构细节请参见上文所述，在此不再赘述。Wherein, the target character detection model is obtained through training according to the above-mentioned character detection model training method. For the specific structural details of the target character detection model, please refer to the above description, which will not be repeated here.

S53，利用待处理图像中文本内容的位置以及旋转角度对待处理图像中文本内容进行旋转，得到待识别文本内容。S53, using the position and rotation angle of the text content in the image to be processed to rotate the text content in the image to be processed to obtain the text content to be recognized.

具体的旋转处理与上文S43中的旋转处理类似，在此不再赘述。The specific rotation processing is similar to the rotation processing in S43 above, and will not be repeated here.

S54，将待识别文本内容输入目标字符识别模型中，得到待处理图像的字符识别结果。S54, input the text content to be recognized into the target character recognition model, and obtain the character recognition result of the image to be processed.

其中，所述目标字符识别模型是根据上述的字符识别模型的训练方法训练得到的。Wherein, the target character recognition model is obtained through training according to the above-mentioned character recognition model training method.

在一些实施方式中，若所训练的目标字符识别模型仅能够识别英文字符和数字，对于目标字符识别模型而言，可能检测到中文字符，从而导致目标字符识别模型出现误分类的情况，即输出较长的英文和数字夹杂的乱码情况。基于此，设定中英文出现次数的比例阈值，由于中文字符的预测得分普遍较低，当低于分类得分阈值的字符个数与该字符串长度的比值大于设定的比例阈值时，则认为该字符串为中文，不输出，其他情况输出字符识别结果。In some implementations, if the trained target character recognition model can only recognize English characters and numbers, for the target character recognition model, Chinese characters may be detected, resulting in misclassification of the target character recognition model, that is, the output Garbled characters mixed with long English and numbers. Based on this, the ratio threshold of the number of Chinese and English occurrences is set. Since the prediction scores of Chinese characters are generally low, when the ratio of the number of characters below the classification score threshold to the length of the string is greater than the set ratio threshold, it is considered The character string is in Chinese and will not be output. In other cases, the character recognition result will be output.

由于旋转校正是依赖于目标字符检测模型的输出而并非是独立的旋转检测得到的，基于此对文本内容进行旋转得到校正后的待识别文本内容，在此基础上再进行字符识别，提高了字符识别结果的准确性。Since the rotation correction depends on the output of the target character detection model rather than independent rotation detection, based on this, the text content is rotated to obtain the corrected text content to be recognized, and then character recognition is performed on this basis, which improves the accuracy of the character. Accuracy of recognition results.

本实施例提供的字符识别方法，由于目标字符检测模型以及目标字符识别模型是用大量逼真的样本图像训练得到的，具有较高的字符检测以及字符识别准确性，利用目标字符检测模型以及目标字符识别模型对待处理图像进行字符识别，能够得到较准确的字符识别结果。The character recognition method provided in this embodiment, because the target character detection model and the target character recognition model are trained with a large number of realistic sample images, has higher character detection and character recognition accuracy, and the target character detection model and the target character recognition model are used The recognition model performs character recognition on the image to be processed, and can obtain more accurate character recognition results.

在本实施例中还提供了一种样本图像的生成装置、字符检测模型的训练装置、字符识别模型的训练装置以及字符识别装置，该装置用于实现上述实施例及优选实施方式，已经进行过说明的不再赘述。如以下所使用的，术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a device for generating a sample image, a device for training a character detection model, a device for training a character recognition model, and a device for character recognition are also provided. The description will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

本实施例提供一种样本图像的生成装置，如图9所示，包括：This embodiment provides a device for generating a sample image, as shown in FIG. 9 , including:

第一获取模块61，用于获取目标场景下的目标文本和风格特征以及第一样本字符，所述目标文本和风格特征是对所述目标场景下的真实字符图像进行文本风格编码得到的；The first obtaining module 61 is used to obtain the target text and style feature and the first sample character in the target scene, and the target text and style feature are obtained by performing text style coding on the real character image in the target scene;

风格处理模块62，用于基于目标扩散模型对所述第一样本字符以及所述目标文本和风格特征进行处理，得到所述目标场景下的样本字符图像，所述目标扩散模型用于将所述第一样本字符的字体风格迁移到所述真实字符图像中以生成所述样本字符图像；A style processing module 62, configured to process the first sample character, the target text and style features based on a target diffusion model to obtain a sample character image in the target scene, and the target diffusion model is used to convert the The font style of the first sample character is transferred to the real character image to generate the sample character image;

拼接模块63，用于将所述样本字符图像与所述目标场景下的背景图像进行拼接，得到样本图像。The splicing module 63 is configured to splice the sample character image with the background image in the target scene to obtain a sample image.

在一些实施方式中，第一获取模块61包括：In some implementations, the first acquisition module 61 includes:

第一获取单元，用于获取所述真实字符图像；a first acquiring unit, configured to acquire the real character image;

第一特征提取单元，用于对所述真实字符图像进行图像特征提取，得到图像风格编码；The first feature extraction unit is used to perform image feature extraction on the real character image to obtain an image style code;

第二特征提取单元，用于对所述真实字符图像中的文本内容进行文本特征提取，得到文本编码；The second feature extraction unit is used to perform text feature extraction on the text content in the real character image to obtain a text code;

融合单元，用于对所述图像风格编码与所述文本编码进行融合，得到所述目标文本和风格特征。A fusion unit, configured to fuse the image style code and the text code to obtain the target text and style features.

在一些实施方式中，融合单元包括：In some embodiments, the fusion unit comprises:

注意力处理子单元，用于对所述图像风格编码以及所述文本编码进行注意力处理，得到所述图像风格编码以及所述文本编码之间的注意力；An attention processing subunit, configured to perform attention processing on the image style code and the text code, to obtain the attention between the image style code and the text code;

融合子单元，用于将所述注意力与所述文本编码融合，并将融合结果经过前馈网络的处理得到所述目标文本和风格特征。The fusion subunit is used to fuse the attention with the text code, and process the fusion result through a feed-forward network to obtain the target text and style features.

在一些实施方式中，拼接模块63包括：In some embodiments, the splicing module 63 includes:

第二获取单元，用于获取所述目标场景下的背景图像，并对所述背景图像进行感兴趣区域的识别，得到感兴趣区域图像；The second acquisition unit is configured to acquire a background image in the target scene, and identify a region of interest on the background image to obtain an image of the region of interest;

第一旋转单元，用于对所述样本字符图像进行任意角度的旋转，得到旋转后的样本字符图像；The first rotation unit is used to rotate the sample character image at any angle to obtain the rotated sample character image;

拼接单元，用于将所述旋转后的样本字符图像与所述感兴趣区域图像进行拼接，得到所述样本图像。A splicing unit, configured to splice the rotated sample character image and the ROI image to obtain the sample image.

在一些实施方式中，所述目标扩散模型的训练模块包括：In some embodiments, the training module of the target diffusion model includes:

第四获取单元，用于获取第二样本字符以及所述目标场景下样本图像的文本和风格特征；A fourth acquisition unit, configured to acquire the second sample character and the text and style features of the sample image in the target scene;

生成单元，用于在预设扩散模型的正��扩散过程中，基于所述第二样本字符生成噪声图像；A generating unit, configured to generate a noise image based on the second sample character during the forward diffusion process of the preset diffusion model;

训练单元，用于基于所述噪声图像以及所述目标场景下样本图像的文本和风格特征，对所述预设扩散模型的反向扩散过程进行训练，以确定所述目标扩散模型。A training unit, configured to train a reverse diffusion process of the preset diffusion model based on the noise image and the text and style features of the sample image in the target scene, so as to determine the target diffusion model.

本实施例提供一种字符检测模型的训练装置，如图10所示，包括：This embodiment provides a training device for a character detection model, as shown in Figure 10, including:

第二获取模块71，用于获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一项实施方式中所述的样本图像的生成方法得到的；The second acquisition module 71 is configured to acquire a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or any one of the implementations of the first aspect;

第三获取模块72，用于获取所述样本图像的标签数据，所述标签数据包括所述样本图像中文本内容的位置信息以及目标旋转角度；The third obtaining module 72 is used to obtain the tag data of the sample image, the tag data includes the position information of the text content in the sample image and the target rotation angle;

第一预测模块73，用于将所述样本图像输入所述字符检测模型中，得到所述样本图像中文本内容的预测位置信息以及预测旋转角度；The first prediction module 73 is used to input the sample image into the character detection model to obtain predicted position information and predicted rotation angle of the text content in the sample image;

第一更新模块74，用于基于所述预测位置信息、所述预测旋转角度以及所述标签数据，对所述字符检测模型的参数进行更新，以得到目标字符检测模型。The first updating module 74 is configured to update the parameters of the character detection model based on the predicted position information, the predicted rotation angle and the label data, so as to obtain a target character detection model.

本实施例提供一种字符识别模型的训练装置，如图11所示，包括：This embodiment provides a training device for a character recognition model, as shown in FIG. 11 , including:

第四获取模块81，用于获取样本图像，所述样本图像是根据本发明第一方面或第一方面任一项实施方式中所述的样本图像的生成方法得到的；The fourth acquisition module 81 is configured to acquire a sample image, the sample image is obtained according to the method for generating a sample image described in the first aspect of the present invention or any one of the implementations of the first aspect;

第一检测模块82，用于将所述样本图像输入目标字符检测模型中，得到所述样本图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；The first detection module 82 is used to input the sample image into the target character detection model to obtain the position and rotation angle of the text content in the sample image, and the target character detection model is according to the second aspect of the present invention The training method of the character detection model is trained;

第一旋转模块83，用于利用所述文本内容的位置以及所述旋转角度对所述样本图像中的文本内容进行旋转，得到目标文本内容；The first rotation module 83 is configured to use the position of the text content and the rotation angle to rotate the text content in the sample image to obtain the target text content;

第二更新模块84，用于基于所述目标文本内容以及所述样本图像的文本标签，对字符识别模型的参数进行更新，以得到目标字符识别模型。The second update module 84 is configured to update the parameters of the character recognition model based on the target text content and the text label of the sample image, so as to obtain the target character recognition model.

本实施例提供一种字符识别装置，如图12��，��括：This embodiment provides a character recognition device, as shown in Figure 12, including:

第五获取模块91，用于获取目标场景下的待处理图像；The fifth acquisition module 91 is used to acquire the image to be processed under the target scene;

第二检测模块92，用于将所述待处理图像输入目标字符检测模型中，得到所述待处理图像中文本内容的位置以及旋转角度，所述目标字符检测模型是根据本发明第二方面所述的字符检测模型的训练方法训练得到的；The second detection module 92 is configured to input the image to be processed into a target character detection model to obtain the position and rotation angle of the text content in the image to be processed, and the target character detection model is developed according to the second aspect of the present invention Obtained by the training method training of the character detection model described above;

第二旋转模块93，用于利用所述待处理图像中文本内容的位置以及旋转角度对所述待处理图像中文本内容进行旋转，得到待识别文本内容；The second rotation module 93 is configured to use the position and rotation angle of the text content in the image to be processed to rotate the text content in the image to be processed to obtain the text content to be recognized;

识别模块94，用于将所述待识别文本内容输入目标字符识别模型中，得到所述待处理图像的字符识别结果，所述目标字符识别模型是根据本发明第三方面所述的字符识别模型的训练方法训练得到的。A recognition module 94, configured to input the text content to be recognized into a target character recognition model to obtain a character recognition result of the image to be processed, and the target character recognition model is the character recognition model according to the third aspect of the present invention trained by the training method.

本实施例中的字符识别模型的训练装置以及字符识别装置是以功能单元的形式来呈现，这里的单元是指ASIC电路，执行一个或多个软件或固定程序的处理器和存储器，和/或其他可以提供上述功能的器件。The character recognition model training device and the character recognition device in this embodiment are presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory that execute one or more software or fixed programs, and/or Other devices that can provide the above functions.

上述各个模块的更进一步的功能描述与上述对应实施例相同，在此不再赘述。Further functional descriptions of the above-mentioned modules are the same as those in the above-mentioned corresponding embodiments, and will not be repeated here.

本发明实施例还提供一种电子设备，具有上述图9所示的样本图像的生成装置，或图10所示的字符检测模型的训练装置，或图11所示的字符识别模型的训练装置，或图12所示的字符识别装置。The embodiment of the present invention also provides an electronic device, which has the above-mentioned sample image generation device shown in FIG. 9, or the character detection model training device shown in FIG. 10, or the character recognition model training device shown in FIG. 11, Or the character recognition device shown in Figure 12.

请参阅图13，图13是本发明可选实施例提供的一种终端的结构示意图，如图13所示，该终端可以包括：至少一个处理器101，例如CPU(Central Processing Unit，中央处理器)，至少一个通信接口103，存储器104，至少一个通信总线102。其中，通信总线102用于实现这些组件之间的连接通信。其中，通信接口103可以包括显示屏(Display)、键盘(Keyboard)，可选通信接口103还可以包括标准的有线接口、无线接口。存储器104可以是高速RAM存储器(Random Access Memory，易挥发性随机存取存储器)，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。存储器104可选的还可以是至少一个位于远离前述处理器101的存储装置。其中处理器101可以结合图9或图10或图11或图12所描述的装置，存储器104中存储应用程序，且处理器101调用存储器104中存储的程序代码，以用于执行上述任一方法步骤。Please refer to FIG. 13. FIG. 13 is a schematic structural diagram of a terminal provided in an optional embodiment of the present invention. As shown in FIG. 13, the terminal may include: at least one processor 101, such as a CPU (Central Processing Unit, central processing unit ), at least one communication interface 103, memory 104, at least one communication bus 102. Wherein, the communication bus 102 is used to realize connection and communication between these components. Wherein, the communication interface 103 may include a display screen (Display) and a keyboard (Keyboard), and the optional communication interface 103 may also include a standard wired interface and a wireless interface. The memory 104 may be a high-speed RAM memory (Random Access Memory, volatile random access memory), or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 104 may also be at least one storage device located away from the aforementioned processor 101 . Wherein the processor 101 can be combined with the device described in FIG. 9 or FIG. 10 or FIG. 11 or FIG. 12, the application program is stored in the memory 104, and the processor 101 calls the program code stored in the memory 104 to perform any of the above methods step.

其中，通信总线102可以是外设部件互连标准(peripheral componentinterconnect，简称PCI)总线或扩展工业标准结构(extended industry standardarchitecture，简称EISA)总线等。通信总线102可以分为地址总线、数据总线、控制总线等。为便于表示，图13中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。Wherein, the communication bus 102 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The communication bus 102 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 13 , but it does not mean that there is only one bus or one type of bus.

其中，存储器104可以包括易失性存储器(英文：volatile memory)，例如随机存取存储器(英文：random-access memory，缩写：RAM)；存储器也可以包括非易失性存储器(英文：non-volatile memory)，例如快闪存储器(英文：flash memory)，硬盘(英文：hard diskdrive，缩写：HDD)或固态硬盘(英文：solid-state drive，缩写：SSD)；存储器104还可以包括上述种类的存储器的组合。Wherein, the memory 104 may include a volatile memory (English: volatile memory), such as a random-access memory (English: random-access memory, abbreviated as RAM); the memory may also include a non-volatile memory (English: non-volatile memory), such as flash memory (English: flash memory), hard disk (English: hard diskdrive, abbreviated: HDD) or solid-state hard disk (English: solid-state drive, abbreviated: SSD); the memory 104 can also include the above-mentioned types of memory The combination.

其中，处理器101可以是中央处理器(英文：central processing unit，缩写：CPU)，网络处理器(英文：network processor，缩写：NP)或者CPU和NP的组合。Wherein, the processor 101 may be a central processing unit (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.

其中，处理器101还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文：application-specific integrated circuit，缩写：ASIC)，可编程逻辑器件(英文：programmable logic device，缩写：PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文：complex programmable logic device，缩写：CPLD)，现场可编程逻辑门阵列(英文：field-programmable gate array，缩写：FPGA)，通用阵列逻辑(英文：generic arraylogic,缩写：GAL)或其任意组合。Wherein, the processor 101 may further include a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), a programmable logic device (English: programmable logic device, abbreviation: PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), field-programmable logic gate array (English: field-programmable gate array, abbreviated: FPGA), general array logic (English: generic array logic , Abbreviation: GAL) or any combination thereof.

可选地，存储器104还用于存储程序指令。处理器101可以调用程序指令，实现如本申请任一实施例中所示的样本图像的生成方法、字符检测模型的训练方法、字符识别模型的训练方法或字符识别方法。Optionally, the memory 104 is also used to store program instructions. The processor 101 can invoke program instructions to implement the method for generating a sample image, the method for training a character detection model, the method for training a character recognition model, or the method for character recognition as shown in any embodiment of the present application.

本发明实施例还提供了一种非暂态计算机存储介质，所述计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的样本图像的生成方法、字符检测模型的训练方法、字符识别模型的训练方法或字符识别方法。其中，所述存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)、随机存储记忆体(Random Access Memory，RAM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive，缩写：HDD)或固态硬盘(Solid-State Drive，SSD)等；所述存储介质还可以包括上述种类的存储器的组合。An embodiment of the present invention also provides a non-transitory computer storage medium, the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the sample image generation method and character detection method in any of the above method embodiments The training method of the model, the training method of the character recognition model, or the character recognition method. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk) Disk Drive, abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the above-mentioned types of memory.

虽然结合附图描述了本发明的实施例，但是本领域技术人员可以在不脱离本发明的精神和范围的情况下做出各种修改和变型，这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention. within the bounds of the requirements.

Claims

1. A method for generating a sample image, comprising:

Acquiring target text and style features and first sample characters in the target scene, where the target text and style features are obtained by text style encoding of real character images in the target scene;

Process the first sample character, the target text and style features based on a target diffusion model to obtain a sample character image in the target scene, and the target diffusion model is used to convert the first sample character The font style is transferred to the real character image to generate the sample character image;

The sample character image is spliced with the background image in the target scene to obtain a sample image.

2. method according to claim 1, is characterized in that, described acquisition target text and style feature under target scene, comprise:

Acquiring the real character image;

Carry out image feature extraction to described real character image, obtain image style coding;

Carrying out text feature extraction to the text content in described real character image, obtains text coding;

The image style coding and the text coding are fused to obtain the target text and style features.

3. The method according to claim 2, wherein said fusion of said image style code and said text code to obtain said target text and style feature comprises:

performing attention processing on the image style code and the text code to obtain the attention between the image style code and the text code;

The attention is fused with the text code, and the fusion result is processed by a feed-forward network to obtain the target text and style features.

4. The method according to claim 1, wherein the splicing the sample character image with the background image under the target scene to obtain a sample image comprises:

Acquiring a background image in the target scene, and identifying a region of interest on the background image to obtain an image of a region of interest;

Rotating the sample character image at any angle to obtain a rotated sample character image;

The rotated sample character image is spliced with the ROI image to obtain the sample image.

5. The method according to any one of claims 1-4, wherein the training method of the target diffusion model comprises:

Acquiring the text and style features of the second sample character and the sample image in the target scene;

During the forward diffusion process of the preset diffusion model, a noise image is generated based on the second sample character;

Based on the noise image and the text and style features of the sample image in the target scene, the reverse diffusion process of the preset diffusion model is trained to determine the target diffusion model.

6. A training method for a character detection model, comprising:

Obtaining a sample image, the sample image is obtained according to the method for generating a sample image according to any one of claims 1-5;

Acquiring label data of the sample image, the label data including the position information of the text content in the sample image and the target rotation angle;

Inputting the sample image into the character detection model to obtain predicted position information and predicted rotation angle of the text content in the sample image;

Based on the predicted position information, the predicted rotation angle and the label data, the parameters of the character detection model are updated to obtain a target character detection model.

7. A training method for a character recognition model, comprising:

The sample image is input into the target character detection model to obtain the position and rotation angle of the text content in the sample image, and the target character detection model is obtained according to the training method of the character detection model described in claim 6;

using the position of the text content and the rotation angle to rotate the text content in the sample image to obtain the target text content;

Based on the target text content and the text label of the sample image, the parameters of the character recognition model are updated to obtain the target character recognition model.

8. A character recognition method, characterized in that, comprising:

Obtain the image to be processed under the target scene;

Input the image to be processed into the target character detection model to obtain the position and rotation angle of the text content in the image to be processed, and the target character detection model is obtained according to the training method of the character detection model described in claim 6 of;

Using the position and rotation angle of the text content in the image to be processed to rotate the text content in the image to be processed to obtain the text content to be recognized;

Inputting the text content to be recognized into the target character recognition model to obtain the character recognition result of the image to be processed, the target character recognition model is obtained by training according to the character recognition model training method described in claim 7 .

9. An electronic device, characterized in that it comprises:

A memory and a processor, the memory and the processor are connected in communication with each other, computer instructions are stored in the memory, and the processor performs any one of claims 1-5 by executing the computer instructions The generation method of described sample image, or, carry out the training method of the character detection model described in claim 6, perhaps, carry out the training method of the character recognition model described in claim 7, perhaps, carry out the training method described in claim 8 character recognition method.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer perform the processing of the sample image according to any one of claims 1-5. The generating method, or, executes the character detection model training method described in claim 6, or executes the character recognition model training method described in claim 7, or executes the character recognition method described in claim 8.