CN116958766A

CN116958766A - Image processing method

Info

Publication number: CN116958766A
Application number: CN202310814356.8A
Authority: CN
Inventors: 陈汐; 黄梁华; 刘宇; 赵德丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-10-27
Anticipated expiration: 2043-07-04
Also published as: CN116958766B

Abstract

The embodiment of the present specification provides an image processing method including: determining a scene image, an object image of a target object, and object position information for placing the target object in the scene image; inputting the scene image, the object image and the object position information into an image processing model, and extracting the characteristics of the object image by utilizing a first characteristic extraction network in the image processing model to obtain the characteristics of the object image; performing feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain fusion image features; and obtaining a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics. Thereby generating a real and vivid target fusion image and avoiding the problem that the position of a target object in the target fusion image cannot be accurately controlled.

Description

Image processing method

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to an image processing method.

Background

With the continuous development of Artificial Intelligence (AI) technology, in the field of image generation, an image synthesis technology for automatically fusing a plurality of images into one customized image by using the AI technology is also widely used in various computer service scenes.

In the process of image synthesis in the prior art, a plurality of images to be fused can be input into a neural network model, and a customized image is generated by utilizing the neural network model. However, in the image generation method using the neural network model in the prior art, when a plurality of images to be fused are synthesized, the position of a specific object in the images to be fused in the customized image cannot be controlled, so that a real and vivid image cannot be generated. Therefore, how to accurately control the position of a specific object in a customized image, so as to generate a truly vivid customized image becomes a problem to be solved.

Disclosure of Invention

In view of this, the present embodiment provides two image processing methods. One or more embodiments of the present specification relate to two image processing apparatuses, a computing device, a computer-readable storage medium, and a computer program to solve the technical drawbacks existing in the prior art.

According to a first aspect of embodiments of the present specification, there is provided an image processing method including:

determining a scene image, an object image of a target object, and object position information to place the target object in the scene image;

Inputting the scene image, the object image and the object position information into an image processing model, and extracting the characteristics of the object image by using a first characteristic extraction network in the image processing model to obtain the characteristics of the object image;

performing feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain fusion image features;

and obtaining a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics.

According to a second aspect of embodiments of the present specification, there is provided an image processing apparatus comprising:

an image determination module configured to determine a scene image, an object image of a target object, and object position information to place the target object in the scene image;

a first feature extraction module configured to input the scene image, the object image, and the object position information into an image processing model, and perform feature extraction on the object image by using a first feature extraction network in the image processing model to obtain object image features;

The second feature extraction module is configured to perform feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain a fused image feature;

and the image generation module is configured to obtain a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics.

According to a third aspect of embodiments of the present specification, there is provided an image processing method applied to a cloud-side apparatus, including:

receiving an image processing request sent by a terminal side device, wherein the image processing request carries a scene image, an object image of a target object and object position information of the target object to be placed in the scene image;

Obtaining a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics;

and sending the target fusion image to the end-side equipment.

According to a fourth aspect of embodiments of the present specification, there is provided an image processing apparatus applied to a cloud-side device, including:

the request receiving module is configured to receive an image processing request sent by end-side equipment, wherein the image processing request carries a scene image, an object image of a target object and object position information of the target object to be placed in the scene image;

An image generation module configured to obtain a target fusion image including the target object and the scene image from the object image features and the fusion image features;

and the image sending module is configured to send the target fusion image to the end-side device.

According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the two image processing methods described above.

According to a sixth aspect of the embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the two image processing methods described above.

According to a seventh aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the two image processing methods described above.

The image processing method provided in the present specification includes: determining a scene image, an object image of a target object, and object position information to place the target object in the scene image; inputting the scene image, the object image and the object position information into an image processing model, and extracting the characteristics of the object image by using a first characteristic extraction network in the image processing model to obtain the characteristics of the object image; performing feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain fusion image features; and obtaining a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics.

Specifically, in the image processing method provided in the present specification, in the process of generating the target fusion image using the image processing model, the scene image, the object image of the target object, and the object position information to be placed in the scene image may be input together to the image processing model. Therefore, in the process of generating the target fusion image, the position of the target object in the target fusion image can be guided by utilizing the object position information, so that a real and vivid target fusion image is generated according to the characteristics of the object image and the characteristics of the fusion image, and the problem that the position of the target object in the target fusion image cannot be accurately controlled is avoided.

Drawings

FIG. 1 is a schematic illustration of a process flow for an image synthesis-like scheme provided in one embodiment of the present disclosure;

FIG. 2 is a customized image schematic diagram of an image synthesis-like scheme provided by one embodiment of the present disclosure;

fig. 3 is a schematic application scenario diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of an image processing method provided in one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a training sample processing procedure in an image processing method according to an embodiment of the present disclosure;

FIG. 6 is a process flow diagram of an image processing method according to one embodiment of the present disclosure;

fig. 7 is a schematic structural view of an image processing apparatus according to an embodiment of the present specification;

FIG. 8 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It should be noted that, in one or more embodiments of the present disclosure, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data used for analysis, stored data, presented data, etc.) are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or rejection.

In one or more embodiments of the present description, a large model refers to a deep learning model with large scale model parameters, typically including hundreds of millions, billions, trillions, and even more than one billion model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (Large Language Model, LLM), a multi-mode pre-training Model and the like.

When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as visual question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.

First, terms related to one or more embodiments of the present specification will be explained.

Diffusion model (diffusion model), which is a depth generation model inspired by unbalanced thermodynamics, generates images from Gaussian noise in the form of iterative denoising, and is a preferred technical route in the current depth generation model.

The customization generation (Customized Generation) is that images of the concept under different scenes, actions and forms can be generated according to a single reference image/a few reference images of the concept provided by a user.

Image synthesis (Image Composition): refers to a technology that, given a foreground map and a background map, a target object in the foreground map can be naturally placed at a specified position of the background map.

UNet model: a full convolution neural network model.

Low-pass filtering (Low-pass filter) is a filtering method, and the rule is that Low-frequency signals can normally pass through, and high-frequency signals exceeding a set critical value are blocked and weakened.

Transformer model: a deep learning model consisting of an encoder and a decoder.

DINO v2: a computer vision self-supervision model.

Sobel operator: the sobel operator is an important processing method in the field of computer vision. The method is mainly used for obtaining the first-order gradient of the digital image, and the common application and physical meaning are edge detection.

controleNet: a neural network architecture controls a diffusion model by adding additional conditions.

Control Copy: and (5) copy control.

With the continuous development of Artificial Intelligence (AI) technology, in the field of image generation, an image synthesis technology for automatically fusing a plurality of images into one customized image by using the AI technology is also widely used in various computer service scenes. In the process of image synthesis in the prior art, a plurality of images to be fused can be input into a diffusion model, and a customized image is generated by using the diffusion model. The diffusion Model (Diffuiosin Model) is a better technical route for generating a current image, and the core principle of the diffusion Model is that the given target image is subjected to asynchronous number of noise adding in the training process, and a UNet Model is trained to predict the current noise under the conditions of given step number and the image subjected to noise adding. Based on the method, in the testing process, the trained UNet model can be subjected to one-step denoising in Gaussian noise, so that an image is generated.

In the embodiments provided in this specification, many image generation methods for controllable contents using various conditions to direct a diffusion model have been developed based on the principle of the diffusion model. For example, in one image generation scheme, the generated content may be controlled using a language description; in another image generation scheme, the generated content may be supported for guidance using a variety of conditions such as text description, depth map, edge map, sketch, etc.

While the present specification provides two schemes in terms of generating an image of a specified target according to a customized generation requirement. The first is a customized image generation scheme based on language, which can finely adjust language descriptors according to a reference image of a given target, and learn accurate language descriptors representing the image. However, this method requires a long fine tuning of multiple images, limiting the large-scale use of the scheme. And the other scheme is a scheme similar to image synthesis, and the target image is fused into the appointed position of the background image without language description fine adjustment. For example, the scheme flow of similar image synthesis may be referred to fig. 1, and fig. 1 is a schematic process flow diagram of a scheme similar image synthesis provided in an embodiment of the present disclosure; the scheme can be used for synthesizing the background image x in the image synthesis process _s Image segmentation (loop) method for obtaining reference image x _r And for the reference image x _r Reference picture enhancement is performed (Reference augmentation). The enhanced reference image is then input to a CLIP, which is encoded as a feature vector by an image encoder of the CLIP. After obtaining the feature vector, the feature vector is projected by a multi-layer perceptron (MLP) and projected The obtained feature vector c after the mapping is input into the diffusion model as a guide. At the same time, in the background image x _s Adding mask area (Mask shape augmentation) to the reference image position and adding the mask areaBackground image and noise y _t Input to a diffusion model to generate a background image in a given mask region to obtain a customized image y containing the background image and a reference image _t-1 . The effect of the specifically generated customized image may be referred to as fig. 2, and fig. 2 is a schematic view of a customized image similar to the image synthesis scheme provided in an embodiment of the present disclosure. Based on fig. 2, the basic idea of this image synthesis scheme is: the user is required to give a reference picture, a background picture and a regional mask is smeared on the background picture, and customized image generation can be carried out on the mask region according to the reference picture.

However, although the above-described scheme of image synthesis is capable of generating a customized image, the scheme has a poor effect of maintaining the characteristics of a target object in a reference image in the operation of generating the customized image, and thus cannot generate a true and vivid image.

Based on this, in the present specification, two image processing methods are provided, and the present specification relates to two image processing apparatuses, a computing device, a computer-readable storage medium, and a computer program, one by one, in the following embodiments.

Fig. 2 is a schematic view of an application scenario of an image processing method according to an embodiment of the present disclosure, where description words or model parameters are mostly fine-tuned by using reference diagrams of specified concepts in consideration of the existing customized image generation method. Existing approaches to fine tuning mostly require several reference pictures and require the cost of minutes of time for single concept fine tuning, thereby posing challenges for large scale applications of the technology. Aiming at the problem of fine tuning cost, the method is provided, wherein an image of a given target can be generated only by using a single Zhang Cankao image, and particularly referring to fig. 2, a user inputs a scene image (namely a chair image in fig. 3), an object image (namely a pet dog image in fig. 3) and object position information of the target object to be placed in the scene image into the terminal; after receiving the scene image, the object image and the object position information, the terminal inputs the scene image, the object image and the object position information into a diffusion model, and a first feature extraction network layer in the diffusion model performs feature extraction processing on the object image so as to obtain the object image features. The second feature in the diffusion model extracts a network layer, and a scene image, an object image and object position information are fused to obtain a fused image; and extracting image features of the fused image to obtain fused image features. The object image features and the fused image features are then input to the Unet model in the diffusion model, thereby obtaining a target fused image of the scene image and the object image (i.e., the image of the pet dog sitting in the chair in fig. 3). In the image processing method provided in the present specification, the object state of the target object included in the target fusion image may be different from the object state of the target object in the target image. The object state may be understood as the pose, motion, morphology, etc. of the target object. Therefore, the method and the device can realize that the input target object is fused into a given position of any scene image, and simultaneously adjust the form, action and visual angle of the generated target object aiming at the background scene, so that the generated target object is interacted with the environment and is naturally fused.

Based on the above, the image processing method provided by the specification can generate a real and vivid target fusion image according to the object image characteristics and the fusion image characteristics, so that the problem of accurately controlling the position of the target object in the target fusion image is avoided. In addition, in the "custom generation" scheme according to the above embodiment, the image of the target concept is generated by means of language description as input, and the scene cannot be precisely controlled. And the "image synthesis" method can put a foreground object into a given position of an arbitrary scene image, but cannot change the object.

Fig. 4 shows a flowchart of an image processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 402: object position information of a scene image, an object image of a target object, and the target object to be placed in the scene image is determined.

The target object may be understood as an object that needs to be present in the target fusion image in the process of generating the target fusion image. The object can be living things such as human, animals, plants, furniture, clothes, houses, or the like, or clouds, spoons, and the like. The target object may be set according to an actual scene, and is not particularly limited herein. An object image may be understood as an image of the target object, such as a reference image or a foreground map in the above embodiments. The scene image may be understood as an image of the environment in which the target object is located in the process of generating the target fusion image. For example, the scene image may be an image of a scene of grass, beach, sky, living room, etc.; the scene image may be set according to an actual scene, and is not particularly limited herein. The scene image may be understood as a background image or a background map in the above embodiments. The object position information may be understood as information indicating a placement position of the target object in the scene image, and may be an area (e.g., a rectangular area, or an irregular polygonal area), coordinate information, or the like, which is not particularly limited herein. The object position information may be a position box for framing the position of the target object in the scene image. In practical application, the process of determining the object position information can be automatically determined by an algorithm or can be determined by a user. For example, the device side (server side or client side) to which the image processing method provided in the present specification is applied can receive the object position setting parameter sent by the user terminal or other server sides. The object position setting parameter may be coordinate information of the target object placed in the scene image, based on which the setting terminal may determine the object position parameter from the coordinate information. Alternatively, in other embodiments provided in the present specification, the method may display the scene image to the user through the user terminal, where the user may set a placement area (i.e. a location frame) of the target object in the scene image, for example, the user may determine two corners of the location frame from the scene image, and click-pull the frame, or the user smears a mask in the scene image; thereby making the interaction of determining the position frame simpler and more relaxed. Meanwhile, the mask input by a user is supported, and the circumscribed rectangle can be automatically calculated as a position frame according to the mask. The user terminal then transmits the placement area information (e.g., a location frame) as an object location setting parameter to the device side, thereby enabling the device side to determine the object location parameter from the placement area.

Specifically, in the image processing method provided in the present specification, in the process of generating the target fusion object, the scene image, the object image of the target object, and the object position information to be placed in the scene image need to be determined first. The scene image, the object image, and the object position information may be transmitted by a user. Based on this, in an embodiment provided in the present specification, the image processing method includes: and receiving a scene image, an object image of a target object and object position information of the target object to be placed in the scene image, which are sent by a user. Or receiving an image processing request sent by a user, wherein the image processing request carries a scene image, an object image of a target object and object position information of the target object to be placed in the scene image, so that the user can fuse a plurality of custom images into a target fusion image according to the self requirement.

In an embodiment provided in the present disclosure, a device side (e.g., a client side or a server) to which the image processing method is applied can obtain a scene image, an object image, and object location information from other device sides. The other device side may be a user terminal, other client side than the device side to which the image processing method is applied, or other service side.

Step 404: inputting the scene image, the object image and the object position information into an image processing model, and extracting the characteristics of the object image by using a first characteristic extraction network in the image processing model to obtain the characteristics of the object image.

The image processing model can be understood as a model capable of processing a scene image, an object image, and object position information, thereby obtaining object image features and fusion image features required for generating a target fusion object. For example, the image processing model may be a diffusion model. In practical applications, after the image processing model obtains the object image feature and the fusion image feature, the image processing model may also be used to process the object image feature and the fusion image feature, so as to generate the target fusion image. In one embodiment provided in this specification, the image processing model may be a large model.

The first feature extraction network may be understood as a network layer in the image processing model that performs feature extraction operations on the object image. The first feature extraction network may be comprised of one or more models. The object image feature may be understood as a feature vector corresponding to the object image. In one embodiment provided in this specification, the object image feature may be understood as a feature vector for representing salient feature information of a target object in the object image. The salient feature information may be information such as an overall profile of the target object, an overall posture of the target object, and the like, and may be set according to an actual application scenario, which is not specifically limited in this specification.

Specifically, in the image processing method provided in the present specification, after obtaining the scene image, the object image, and the object position information, the scene image, the object image, and the object position information may be input into an image processing model, and a feature extraction operation is performed on the object image by using a first feature extraction network in the image processing model, so as to obtain corresponding object image features.

In the embodiment provided in the present specification, in order to ensure the performance of the extracted features of the object image, the image processing method provided in the present specification performs preprocessing on the object image, and performs feature extraction on the preprocessed object image in the following specific manner.

The feature extraction of the object image by using the first feature extraction network in the image processing model to obtain the object image feature includes steps 4042 to 4044:

step 4042: inputting the object image into a first feature extraction network in the image processing model, and performing image processing on the object image through an image preprocessing module in the first feature extraction network to obtain a preprocessed object image;

in particular, the image preprocessing module may be understood as a module capable of performing preprocessing operations on the object image. The image preprocessing module may be one or more network layers in the first feature extraction network. Or the image preprocessing module may be one of a plurality of models constituting the first feature extraction network.

Specifically, the inputting the object image into the first feature extraction network in the image processing model, and performing image processing on the object image through the image preprocessing module in the first feature extraction network to obtain a preprocessed object image includes:

inputting the object image into a first feature extraction network in the image processing model, and extracting low-frequency features of the object image through a low-frequency feature extraction module in the first feature extraction network to obtain a preprocessed object image.

The low-frequency feature extraction module may be understood as a module capable of extracting a low-frequency feature in the object image in the first feature extraction network. The low frequency feature extraction module may be one or more network layers or a model. The low frequency feature extraction module may be a low pass filter module (pooling), which may be a module implementing a low pass filter function. Furthermore, the low frequency characteristic can be understood as low frequency information. The low frequency feature extraction module may be a low frequency information extraction module.

For example, taking an application of the image processing method provided in the present specification in a customized image generation/synthesis scene as an example, the low-frequency feature extraction module may be a low-pass filtering module, the image processing model is a diffusion model, and the object image may be a pet dog image. Based on the method, the low-frequency feature extraction can be carried out on the pet dog image by utilizing the low-pass filtering module in the diffusion model, so that the pet dog image with the extracted low-frequency feature is obtained, and the performance of the extracted object image features is ensured. In practical applications, the preprocessing target image may be a 224x224x3 image.

Step 4044: and carrying out feature extraction on the preprocessed object image through an image extraction module in the first feature extraction network to obtain object image features of the object image.

The image extraction module may be understood as a network layer capable of extracting features of the image to be preprocessed in the first feature extraction network. The image extraction module may be one or more network layers in the first feature extraction network. Or the image extraction module may be one of a plurality of models that constitute the first feature extraction network. For example, the image extraction model may be a DINO v2 model.

In an embodiment provided in the present specification, the object image feature may be a macroscopic feature vector expressing salient feature information of the target object in the object image. For example, the object image feature may be an ID token. The token is a unique term of a transducer model, and means the minimum unit of semantic features. In an embodiment provided in the present disclosure, the image extraction module (i.e., the ID extractor refers to a module for extracting the ID token) may be a transducer model named as DINO v2, where the DINO v2 features may macroscopically express the salient feature information of the target object in the object image. Therefore, in the case where the image extraction module is a DINO v2 model, the object image feature can be extracted using the DINO v2 model. The specific process is as follows: when the model takes an object image (224 x224x3, the number of high x wide x channels) as input, the output features are 256x1536, i.e., 256 token, each token is a 1536-dimensional vector, and the 256x1536 vector is the ID token. Based on this, the ID token can be regarded as a macroscopic object image feature expressing the salient information of the target object in the object image.

Along the above example, in the image processing method provided in the present specification, in the process of extracting ID token, the present specification may use the DINO v2 model as the ID extraction module, and input the preprocessed pet dog image (224 x224x3 image) into the ID extraction module (i.e. ID extractor), to obtain ID tokens (i.e. 256 tokens, each token is a vector with 1536 dimensions) output by the ID extraction module.

In an embodiment provided in the present disclosure, after feature extraction is performed on the object image, the image processing model may be further used to remove content such as image background and image noise except for the target object in the object image, so that performance of the object image feature extracted later is ensured by removing the image background, and a problem that the object image feature is inaccurate due to the image background is avoided. The specific mode is as follows.

The method for extracting the characteristics of the object image by using the first characteristic extraction network in the image processing model further comprises the following steps before the object image characteristics are obtained:

and filtering the image interference content except the target object in the object image by using an image filtering module in the image processing model to obtain a filtered object image.

The image interference content other than the target object can be understood as unnecessary or redundant interference information in the object image other than the target object. For example, in the case where the object image is a pet dog image, an image background, image noise, and other disturbing elements other than the pet dog in the pet dog image.

The image filtering module may be understood as a network layer in the image processing model that is capable of filtering image interference content in the object image. Or the image filtering module may be understood as one of the image processing models for filtering image disturbance content in the object image. For example, the image filtering module may be an image segmentation module for segmenting image disturbance content (e.g., image background) outside the target object in the object image. The image segmentation module may be an automatic segmentation model (e.g., SAM) or an interactive segmentation model (e.g., focalClick). The description herein is not specifically limited.

Specifically, after the object image is input into the image processing model, the image filtering module in the image processing model is utilized to filter the image interference content except for the target object in the object image, so that the object image without the image interference content is obtained, and the object image characteristics with higher performance can be obtained based on the object image.

Along the above example, in the image processing method provided in the present specification, the image filtering module is an image segmentation module, based on which, after the pet dog image is input into the diffusion model, the diffusion model performs foreground segmentation on the pet dog image, and the specific process is as follows: given a pet dog image as a Reference image (Reference), the Reference image is filtered by a segmentation module (Seg module) to obtain a segmented Reference image.

Step 406: and performing feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain fusion image features.

The second feature extraction network may be understood as a network layer that performs feature processing operation on the scene image, the object image, and the object position information in the image processing model. The second feature extraction network may be comprised of one or more models. The object image may be the object image after the filtering process.

In an embodiment provided in the present specification, the fused image feature may be understood as a feature vector obtained by feature extraction of an initial fused image obtained by fusing a scene image and an object image. An initial fusion image may be understood as an image obtained by fusing a scene image and an object image. The method for fusing the scene image and the object image can be set according to the actual application scene. For example, a collage image obtained by stitching the detail features extracted from the object image with the scene image may be used as the initial fusion image. The detail features may be high frequency components of the object image or high frequency detail features, among others. For example, the high frequency component may be understood as a feature that emphasizes detail information (e.g., detail outline, color, etc.) of the target object in the object image. Including, but not limited to, where the target object is a pet dog, the high frequency component may be a feature that emphasizes detailed information such as spots on the pet dog's body; in the case where the target object is clothing, the high-frequency component may be a feature that emphasizes detailed information such as a pattern, logo (logo refers to a logo, a logo), or the like on the clothing.

Thus, in the case where the initial fused image is a collage image to which detail features (e.g., high frequency components) are collaged, the fused image features may be feature maps extracted from the initial fused image. The feature map can be extracted using UNet Encoder (Encoder of UNet model). The specific process is as follows: and inputting the initial fusion image into a UNet Encoder to obtain an output feature map. Because the initial fusion image contains detail features, the output feature map maintains a high-resolution spatial scale. Based on this, it can also be determined that the fused image feature can be understood as a feature vector representing detailed features in the scene image as well as in the object image.

In an embodiment provided in the present disclosure, in order to ensure that a real and vivid target fusion image can be generated, the image processing method provided in the present disclosure determines a fusion image feature according to a scene image, an object image, and object position information, and then generates a target fusion image by using the fusion image feature determined based on the scene image, the object image, and the object position information, thereby ensuring that the target fusion image is vivid and vivid. The procedure for specifically determining the characteristics of the fused image is as follows.

The feature processing is performed on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain a fused image feature, which includes steps 4062 to 4064:

step 4062: and fusing the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain an initial fused image.

The initial fusion image is an image obtained by primarily fusing a scene image and an object image.

Specifically, the fusing the scene image, the object image and the object position information by using the second feature extraction network in the image processing model to obtain a fused image includes:

extracting the characteristics of the object image by using a to-be-fused characteristic extraction module in the second characteristic extraction network to obtain to-be-fused image characteristics;

and determining a target position corresponding to the object position information from the scene image, and adding the image feature to be fused to the target position in the scene image to obtain an initial fused image.

The feature extraction module to be fused can be understood as a network layer capable of extracting features of the object image in the second feature extraction network. The feature extraction module to be fused may be one or more network layers in the second feature extraction network. Or the feature extraction module to be fused may be one of a plurality of models constituting the second feature extraction network. The image features to be fused may be understood as features corresponding to the object image that need to be fused with the scene image, for example, the detail features in the above embodiment.

The target position may be understood as a position in the scene feature corresponding to the object position information, and the target position may also be understood as a position in the scene image where the target object needs to be placed.

In the image processing method provided in the present specification, the object position information is a position frame, and the scene image is a grass image. Based on this, after the pet dog image, the lawn image and the position frame are input into the diffusion model, the diffusion model performs a step of feature extraction and image Collage (Collage), wherein the feature extraction means that the diffusion model performs feature extraction on the pet dog image by using a feature extraction network layer, so as to obtain the image detail feature of the pet dog image. The image collage refers to that the diffusion model collages the image detail characteristics to the corresponding positions of the scene graph according to the position frame, so as to obtain a collage image.

In an embodiment provided in the present specification, the feature extraction module to be fused may be a module for performing high-frequency feature extraction. The high-frequency feature extraction module can extract the high-frequency feature in the object image, so that the high-frequency feature in the object image is obtained in the following specific mode.

The feature extraction of the object image by using the feature extraction module to be fused in the second feature extraction network to obtain the feature of the image to be fused comprises the following steps:

and carrying out high-frequency feature extraction on the object image by a high-frequency feature extraction module in the second feature extraction network to obtain an image high-frequency feature.

The high-frequency feature extraction module may be understood as a module capable of extracting a high-frequency feature in the object image in the second feature extraction network. The high frequency feature extraction module may be one or more network layers, or may be one of a plurality of models constituting the second feature extraction network. For example, the high frequency feature extraction module may be a Sobel operator.

In the above example, after the pet dog image, the lawn image, and the position frame are input to the diffusion model, the diffusion model performs high-frequency feature extraction on the pet dog image. The high-frequency feature extraction is that a diffusion model uses a Sobel operator to extract edge gradients of the pet dog image, and the edge gradients are high-frequency features, so that a high-frequency feature map is obtained.

Step 4064 uses the fused image feature extraction module in the second feature extraction network to perform feature extraction on the initial fused image, so as to obtain fused image features.

The fused image feature extraction module may be understood as a module capable of performing feature extraction on the initial fused image in the second feature extraction network. The fused image feature extraction module may be one or more network layers, or the fused image feature extraction module may be one of a plurality of models that constitute the second feature extraction network. The fused image feature extraction module may be a detail extraction module that extracts detail features in the initial fused image. The detail extraction module may be trained using Control Copy in a Control net for initialization.

Along the above example, after obtaining a stitched image, the stitched image is input to a detail extraction module, so as to obtain a detail feature map.

Step 408: and obtaining a target fusion image containing the target object and the scene image according to the object image characteristics and the fusion image characteristics.

The target fusion image is understood to be an image that contains a target object and a scene image, and the position of the target object in the scene image corresponds to the object position information. For example, the target object is a pet dog, the scene image is a chair, and the object position information is a seat position of the chair. Based on this, the generated target fusion image should be an image of the pet dog sitting in the seat of the chair. In an embodiment provided in the present disclosure, after obtaining the target fusion image, the server side to which the method is applied may send the target fusion image to the user terminal for display to the user. Or after the target fusion image is obtained, the client applied by the method can directly display the target fusion image to the user.

In the embodiments provided in the present specification, in order to improve the generation efficiency of the target fusion image, an image generation network may be configured in the image processing model. The target fusion image is generated by using the image generation network, so that the efficiency is improved. Specifically, the obtaining, according to the object image feature and the fusion image feature, a target fusion image including the target object and the scene image includes:

and processing the object image features and the fusion image features by using an image generation network in the image processing model to obtain a target fusion image containing the target object and the scene image.

The image generation network can be understood as a network layer for generating the target fusion image in the image processing model. Or the image generation network may generate a model for an image in the image processing model. For example, the image generation network may be a Unet model.

Specifically, the image processing method provided in the present specification can input the object image features and the fusion image features into the image generation network in the image processing model, and perform image synthesis processing on the object image features and the fusion image features by using the image generation network, so as to obtain a target fusion image including a target object and a scene image.

In an embodiment provided in the present specification, the processing the object image feature and the fused image feature by using the image generating network in the image processing model to obtain a target fused image including the target object and the scene image includes steps 4082 to 4084:

step 4082: and determining the image generation characteristics of an image generation module in the image generation network, and fusing the image generation characteristics, the object image characteristics and the fused image characteristics to obtain fused image generation characteristics.

Wherein the image generation features may be understood as features required by the image generation module in generating the target fusion image. In one embodiment provided in this specification, the Diffusion model (Stable Diffusion) has a UNet model part, and the output of UNet model of each layer is a characteristic diagram of HxWxC, and HWCs of different layers are different. Wherein H is length, W is width, C is the number of RGB color channels, i.e., C may refer to the dimension or depth of the feature map. The feature map may be understood as the image generation feature. Alternatively, the output of each layer of UNet is a feature map of HxWxC, part of the UNet model of the Diffusion model (Stable Diffusion). At a specific decoder layer (decoding layer), the feature map of each decoder output can also be understood as the image generation feature.

Based on this, following the above example, after the ID token and the detail feature map are obtained, the feature map output by each network layer or a specific network layer in the UNet model in the diffusion model is determined, and the feature map is subjected to feature recombination with the ID token and the detail feature map, so as to obtain a recombined feature map.

In an embodiment provided in the present disclosure, the determining the image generation feature of the image generation module in the image generation network, and fusing the image generation feature, the object image feature, and the fused image feature to obtain a fused image generation feature includes:

determining a first image generation feature and a second image generation feature of an image generation module in the image generation network;

fusing the object image features with the first image generation feature parameters to obtain first fused image generation features;

splicing the fusion image features with the second image generation feature parameters to obtain second fusion image generation features;

and taking the first fusion image generation feature and the second fusion image generation feature as fusion image generation features.

The first image generating feature may be understood as a feature map output by each network layer in the image generating network, for example, a feature map output by each layer of UNet model in UNet models, where HWCs of the feature maps of different layers are different. The second image generation feature may be understood as a feature map output by a specific network layer in the image generation network, for example, a feature map output by a decoder layer in the UNet model.

Along the above example, after the ID token and the detail feature map are obtained, the ID token and the detail feature map are subjected to feature recombination with the feature map output by the network layer of the UNet model in the diffusion model, and may be specifically divided into two parts. The first part is: the ID token is fused with a Cross-attention module (Cross-attention) and a pre-training initialized Stable Diffuse. The specific contents are as follows:

the Diffusion model (Stable Diffusion) provided in the present specification may use ID keys as guidelines to guide the generation of results. Meanwhile, the diffusion model is provided with a UNet part, the output of each layer of UNet model is a characteristic diagram of HxWxC, and HWCs of different layers are different. Therefore, the image processing method provided in the present specification, after encoding the pet dog image into the ID token, performs Cross-attention fusion (Cross-attention) on each layer of the UNet model and the feature map of the encoded ID token.

The second part is: the detail feature map and a Diffusion model (Stable Diffusion) are spliced with each decoding layer output feature map in a channel dimension in a decoder part (decoder layer) of the Unet model. The specific contents are as follows:

the Diffusion model (Stable Diffusion) provided by the specification has a UNet model part, the output of each UNet model network layer is a characteristic diagram of HxWxC, and HWCs of different layers are different. The detail feature map also has a set of feature maps of HxWxC corresponding to the UNet model. In a specific decoder layer, the method splices the characteristic diagrams of two HxWxC into the characteristic diagram of HxWx 2C. The purpose of this is to input the information extracted in the Detail extraction module as a guideline into the Stable dispersion model, which generates a specific background and a foreground with specific Detail features.

Based on the two partial contents for feature recombination, a recombined feature map (i.e. fusion image generation feature) can be obtained.

Step 4084: and carrying out image generation processing on the fusion image generation characteristics by utilizing the image generation module to obtain a target fusion image containing the target object and the scene image.

Along the above example, after determining the recombined feature map, the UNet model is guided based on the recombined feature map to generate a customized image of a specific background and a foreground with specific detail features in the process of denoising noise step by step, so that an image of a specified position of a target object in the field Jing Tuxiang can be generated according to the modified Stable diffration. And the form, action and visual angle of the generated target object are adjusted according to the scene image, so that the generated target object is interacted with the environment and is naturally fused.

In an embodiment provided in the present specification, before determining the scene image, the object image of the target object, and the object position information for placing the target object in the scene image, the method further includes:

determining a sample scene image, a sample object image of a sample object, object location information to place the sample object in the sample scene image, and a sample tag;

Inputting the sample scene image, the sample object image and the object position information into an image processing model, and extracting the characteristics of the sample object image by using a first characteristic extraction network in the image processing model to obtain sample object image characteristics;

performing feature processing on the sample scene image, the sample object image and the object position information by using a second feature extraction network in the image processing model to obtain sample fusion image features;

obtaining a sample target fusion image containing the sample object and the sample scene image according to the sample object image characteristics and the sample fusion image characteristics;

training the image processing model based on the sample target fusion image and the sample label until reaching a training stopping condition, and obtaining a trained image processing module.

The sample scene image may be understood as a scene image as a sample, and the sample object image may be understood as an object image as a sample. The condition for achieving the training stop can be set according to the actual application scene, and the specification does not limit the condition specifically, for example, training of a specific number of rounds is completed, and model loss achieves convergence.

It should be noted that, in the present disclosure, the training step of the image processing model may refer to the corresponding or corresponding content of the image processing model in the above embodiment, and will not be repeated here. By training the image processing model, the scene image, the object image and the object position information can be processed based on the trained image processing model, so that a real and vivid target fusion image can be generated.

The training of the image processing model based on the sample target fusion image and the sample label may be understood as calculating a loss function based on the sample target fusion image and the sample label, and training the image processing model through the loss function.

In an embodiment provided in the present specification, the determining a sample scene image, a sample object image of a sample object, object position information to place the sample object in the sample scene image, and a sample tag includes:

determining an image to be processed containing the sample object, and determining a first image to be processed, a second image to be processed and a third image to be processed according to the image to be processed;

Extracting a sample object image from the first image to be processed, and taking the second image to be processed as a sample label;

clearing the sample object in the third image to be processed, and taking the third image to be processed after clearing the sample object as a sample scene image;

object position information to be placed in the sample scene image is set according to object position setting parameters.

Wherein the image to be processed can be understood as an image containing the sample object; the image to be processed may be one or more; the image to be processed may be a video frame extracted from a sample object video. The sample object video may be understood as video data containing a sample object, for example, a pet dog, and the sample object video may be video data containing pet dog play, pet dog walking, etc.

The first, second and third images to be processed may be understood as a plurality of images containing sample objects. The object states of the sample objects contained in the first to-be-processed image, the second to-be-processed image, and the third to-be-processed image are different. The object state can be understood as information of the gesture of the object, the scene in which the object is located, and the action shown.

In the case that the number of the images to be processed is at least three, determining the first image to be processed, the second image to be processed, and the third image to be processed according to the images to be processed may be understood as dividing the at least three images to be processed into the first image to be processed, the second image to be processed, and the third image to be processed. In the case that the number of the images to be processed is one or two, determining the first image to be processed, the second image to be processed and the third image to be processed according to the images to be processed may be understood as converting the images to be processed into three images, namely, the first image to be processed, the second image to be processed and the third image to be processed by means of image copying, image overturning, image rotation expansion, adding image noise and the like.

Wherein the object position setting parameter can be understood as a parameter for setting the object position information. In practical application, the device side applied to the image processing method provided in the present specification can receive the object position setting parameters sent by the user terminal or other service sides. The object position setting parameter may be coordinate information of the sample object placed in the sample scene image, based on which the setting terminal may determine the object position parameter based on the coordinate information. Alternatively, in other embodiments provided herein, the method may present the sample scene image to the user via the user terminal, and the user may set the placement area of the sample object in the sample scene image. The user terminal transmits the placement area information (e.g., a position frame) as an object position setting parameter to the device side, thereby enabling the device side to determine the object position parameter from the placement area.

In the case that the image to be processed is one, determining the first image to be processed, the second image to be processed, and the third image to be processed according to the image to be processed may be understood as performing flipping and rotation expansion on one image, thereby obtaining a plurality of images.

For example, the above embodiments are described taking an example of obtaining training samples from a pet dog play video by the image processing method provided in the present specification. Referring to fig. 5, fig. 5 is a schematic diagram of a training sample processing procedure in an image processing method according to an embodiment of the present disclosure. Wherein the sample object video is a pet dog play video. The model training performed by the method is supervised model training, the training data needs pictures of the same target under different visual angles, postures and actions, and based on the supervised model training, the method can collect the training data by using a large-scale video data set, and images of the same target under different states can be acquired by using expansion methods such as rotation, distortion and the like through matching with static image data. The specific contents are as follows: and determining video videos of various actions of one pet dog, extracting two frames from the video videos, and enabling states of the pet dogs in the two frames to be different. See fig. 5 for a specific process. Firstly, acquiring a target mask (black pet dog) in a video frame, and then intercepting one (left image) according to the target mask to obtain a reference image; the other (right) expands the mask area to a position box filled with white as a background image. The left image, which is not filled with white, is used as a supervision image. Mask labels are split mask labels, and the mask labels can be manually extracted by using a model. Thus, a reference map, a background map, and a supervision map are obtained. The training process refers to the graph, the background graph and the position frame as input, and the supervision model generates the supervision graph. It should be noted that, for a still image, one image may be flipped, rotated and expanded, and two frames of video may be obtained by analogy.

In another embodiment provided in the present specification, the method provides a time-step sampling strategy: specific time for video data, time steps are sampled with 50% probability in the range of [0,1000], and the other 50% probability in the range of [500,1000 ]. For still image data, time steps are sampled with 50% probability in the range of [0,1000], and the other 50% probability in the range of [0, 500 ].

In the image processing method provided by the present specification, in generating the target fusion image using the image processing model, the scene image, the object image of the target object, and the object position information to be placed in the scene image may be input together to the image processing model. Therefore, in the process of generating the target fusion image, the position of the target object in the target fusion image can be guided by utilizing the object position information, so that a real and vivid target fusion image is generated according to the object image characteristics and the fusion image characteristics, and the problem of accurately controlling the position of the target object in the target fusion image is avoided.

The image processing method provided in the present specification will be further described with reference to fig. 6 by taking an application of the image processing method in a customized image generation/synthesis scenario without fine tuning as an example. Fig. 6 is a flowchart illustrating a processing procedure of an image processing method according to an embodiment of the present disclosure, where the image processing method provides a customized generation method, and a Reference image (Reference) is given, that is, the object image in the above embodiment; scene, i.e. the Scene image in the above embodiment; and a position frame on the scene graph, i.e., the object position information in the above embodiment, a diversity image (i.e., a target fusion image in the above embodiment) of the reference target (i.e., the target object) at the specified position in the scene graph can be generated. In the model reasoning process, the input of the algorithm model is as follows: a Reference picture (Reference), denoted R; a Scene image (Scene), denoted S; a position box of the target object in the scene image is desired, denoted B. The method specifically comprises the following steps.

Step 602: and (3) performing foreground segmentation: the background is filtered by a segmentation module (Seg) of a given Reference image (Reference) input model, and a segmented Reference image is obtained and is marked as R_s.

Step 604: and (3) extracting low-frequency characteristics: r_s is input into a low pass filtering module (pooling) to process r_s, resulting in a 224x224x3 image. Denoted r_l.

Step 606: and (3) carrying out ID Token extraction, and inputting R_l into an ID extraction module to obtain ID tokens.

Step 608: and extracting high-frequency characteristics, namely extracting the edge gradient of the R_s by using a Sobel operator, wherein the edge gradient is the high-frequency characteristics, and obtaining a high-frequency characteristic diagram R_h.

Step 610: image Collage (Collage) is performed, and R_h is collaged to a specific position on scene graph S according to position box B, and a Collage is obtained and denoted as C.

Step 612: and inputting the collage C into a detail extraction module for detail feature extraction to obtain a detail feature map F_d.

Step 614: and inputting the ID token and the detail feature map F_d into a pre-training diffusion model to carry out feature recombination, and generating a customized image of the target object in the scene map by using the diffusion model after feature recombination.

Specifically, after the ID token and the detail feature map are obtained, the scheme performs feature recombination with the feature map output by the network layer of the Unet model in the diffusion model, and may be specifically divided into two parts.

The first part is: the ID token is fused with a Cross-attention module (Cross-attention) and a pre-training initialized Stable Diffuse. The specific contents are as follows:

And obtaining a recombined characteristic diagram based on the two partial contents for characteristic recombination. Finally, guiding the Unet model to generate a customized image of a specific background and a foreground with specific detail characteristics in the process of denoising noise step by step based on the recombined feature map.

The image processing method provided by the specification can be understood as a customized image generation/image synthesis method without fine adjustment, and in the process of performing customized image generation/image synthesis, the customized generation is performed by using a large-scale self-supervision characterization model (such as a DINO V2) as an ID extractor. And, a detail prior is provided to the model in the form of a Collage (Collage). Including information extraction of the high frequency components of the collage portion. Therefore, the input target object can be fused into a given position of any scene image, and the form, action and visual angle of the generated target are adjusted according to the background scene, so that the generated target object interacts with the environment and is fused naturally.

Meanwhile, in the model training process, images of different postures, scenes and actions of the same target are extracted from large-scale video data, so that a large number of customized training samples are constructed, and the large-scale images and the video data training data sampling method can be combined to provide repeated sample data for model training.

Another image generation method provided in an embodiment of the present disclosure is applied to a cloud-side device, and specifically includes the following steps.

and sending the target fusion image to the end-side equipment.

The cloud-side device may be understood as a device that is located in the cloud and is capable of providing cloud services for the end-side device. For example, the cloud-side device may be one or more servers, one or more hosts. In an embodiment provided in the present specification, the cloud-side device may further be configured by a cloud-side computing device and/or a cloud-side storage device. The cloud-side computing device may be understood as a device that is located in the cloud and is capable of providing computing services for the end-side device. For example, the cloud-side computing device may be one or more servers. The cloud-side storage device may be understood as a device that is located in the cloud and is capable of providing storage services for the end-side device. Such as one or more database storage servers, cloud disks, etc. The end-side device may be understood as a device that exists opposite to the cloud-side device and is capable of using cloud services provided by the cloud-side device. The terminal side device comprises, but is not limited to, a client, a terminal, a computer, a server, a mobile phone or intelligent mobile device and the like.

Specifically, the image processing method provided in the present specification can be applied to a cloud-side device, and when receiving an image processing request sent by a terminal-side device and carrying data such as a scene image, an object image of a target object, and object position information to be placed in the scene image, the object image, and the object position information can be acquired, and the scene image, the object image, and the object position information can be input into an image processing model. Extracting the characteristics of the object image by utilizing a first characteristic extraction network in the image processing model to obtain the characteristics of the object image; performing feature processing on the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain fusion image features; then, the image processing model is utilized to perform image generation processing on the object image characteristics and the fusion image characteristics, so that a target fusion image containing a target object and a scene image is obtained, and the target fusion image is sent to the end-side device, so that the target fusion image is displayed to a user.

The above is a schematic scheme of another image processing method of the present embodiment. It should be noted that, the technical solution of the other image processing method and the technical solution of the one image processing method belong to the same concept, and details of the technical solution of the other image processing method, which are not described in detail, can be referred to the corresponding or corresponding description in the technical solution of the one image processing method, which is not repeated herein.

Based on this, in another image processing method applied to a cloud-side device provided in the present specification, in the process of generating a target fusion image using an image processing model, a scene image provided by an end-side device, an object image of a target object, and object position information to be placed in the scene image may be input together to the image processing model for processing. Therefore, in the process of generating the target fusion image, the position of the target object in the target fusion image can be guided by utilizing the object position information, so that a real and vivid target fusion image is generated according to the characteristics of the object image and the characteristics of the fusion image, and the problem that the position of the target object in the target fusion image cannot be accurately controlled is avoided. And the target fusion image is sent to the end side device, so that smooth operation of an image processing model is ensured, and the cloud side device can stably provide image processing service for the end side device.

Corresponding to the above method embodiments, the present disclosure further provides an image processing apparatus embodiment, and fig. 7 shows a schematic structural diagram of an image processing apparatus according to one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

An image determination module 702 configured to determine a scene image, an object image of a target object, and object position information to place the target object in the scene image;

a first feature extraction module 704 configured to input the scene image, the object image, and the object position information into an image processing model, and perform feature extraction on the object image by using a first feature extraction network in the image processing model to obtain object image features;

a second feature extraction module 706 configured to perform feature processing on the scene image, the object image, and the object position information using a second feature extraction network in the image processing model to obtain a fused image feature;

an image generation module 708 is configured to obtain a target fusion image comprising the target object and the scene image from the object image features and the fusion image features.

Optionally, the first feature extraction module 704 is further configured to:

inputting the object image into a first feature extraction network in the image processing model, and performing image processing on the object image through an image preprocessing module in the first feature extraction network to obtain a preprocessed object image;

And carrying out feature extraction on the preprocessed object image through an image extraction module in the first feature extraction network to obtain object image features of the object image.

Optionally, the first feature extraction module 704 is further configured to:

Optionally, the second feature extraction module 706 is further configured to:

fusing the scene image, the object image and the object position information by using a second feature extraction network in the image processing model to obtain a fused image;

and carrying out feature extraction on the fused image by utilizing a fused image feature extraction module in the second feature extraction network to obtain fused image features.

Optionally, the second feature extraction module 706 is further configured to:

Optionally, the image generation module 708 is further configured to:

determining image generation characteristics of an image generation module in the image generation network, and fusing the image generation characteristics, the object image characteristics and the fused image characteristics to obtain fused image generation characteristics;

And carrying out image generation processing on the fusion image generation characteristics by utilizing the image generation module to obtain a target fusion image containing the target object and the scene image.

Optionally, the image generation module 708 is further configured to:

Optionally, the image processing apparatus further comprises a model training module configured to:

Optionally, the model training module is further configured to:

The image processing apparatus provided in the present specification may input, in generating the target fusion image using the image processing model, the scene image, the object image of the target object, and the object position information to be placed in the scene image together to the image processing model. Therefore, in the process of generating the target fusion image, the position of the target object in the target fusion image can be guided by utilizing the object position information, so that a real and vivid target fusion image is generated according to the object image characteristics and the fusion image characteristics, and the problem of accurately controlling the position of the target object in the target fusion image is avoided

The above is a schematic scheme of an image processing apparatus of the present embodiment. It should be noted that, the technical solution of the image processing apparatus and the technical solution of the image processing method belong to the same concept, and details of the technical solution of the image processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the image processing method.

Corresponding to the method embodiment, the present specification also provides another embodiment of an image processing apparatus, where the apparatus is applied to a cloud side device, and includes:

Based on this, in another image processing apparatus applied to a cloud-side device provided in the present specification, in generating a target fusion image using an image processing model, a scene image provided by an end-side device, an object image of a target object, and object position information to be placed in the scene image may be input together to the image processing model for processing. Therefore, in the process of generating the target fusion image, the position of the target object in the target fusion image can be guided by utilizing the object position information, so that a real and vivid target fusion image is generated according to the characteristics of the object image and the characteristics of the fusion image, and the problem that the position of the target object in the target fusion image cannot be accurately controlled is avoided. And the target fusion image is sent to the end side device, so that smooth operation of an image processing model is ensured, and the cloud side device can stably provide image processing service for the end side device.

The above is a schematic version of another image processing apparatus of the present embodiment. It should be noted that, the technical solution of the other image processing apparatus and the technical solution of the other image processing method belong to the same concept, and details of the technical solution of the other image processing apparatus that are not described in detail may refer to corresponding or corresponding descriptions in the technical solution of the other image processing method, which are not described in detail herein.

Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with one embodiment of the present description. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.

Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor 820, perform the steps of the two image processing methods described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solutions of the two image processing methods belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solutions of the two image processing methods.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the two image processing methods described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solutions of the two image processing methods belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solutions of the two image processing methods.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the two image processing methods described above.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solutions of the two image processing methods belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solutions of the two image processing methods.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An image processing method, comprising:

2. The image processing method according to claim 1, wherein the feature extraction of the object image by using the first feature extraction network in the image processing model, to obtain object image features, includes:

3. The image processing method according to claim 2, the inputting the object image into the first feature extraction network in the image processing model, performing image processing on the object image by the image preprocessing module in the first feature extraction network, obtaining a preprocessed object image, comprising:

4. The image processing method according to claim 1 or 2, wherein the feature extraction of the object image by using the first feature extraction network in the image processing model, before obtaining the object image feature, further comprises:

5. The image processing method according to claim 1, wherein the feature processing of the scene image, the object image, and the object position information using the second feature extraction network in the image processing model to obtain the fused image feature includes:

6. The image processing method according to claim 5, wherein the fusing the scene image, the object image, and the object position information using the second feature extraction network in the image processing model to obtain a fused image, comprises:

7. The image processing method according to claim 6, wherein the feature extraction of the object image by using the feature extraction module to be fused in the second feature extraction network, to obtain the feature of the image to be fused, includes:

8. The image processing method according to claim 1, the obtaining a target fusion image including the target object and the scene image from the object image feature and the fusion image feature, comprising:

9. The image processing method according to claim 8, wherein the processing the object image feature and the fused image feature using the image generation network in the image processing model to obtain a target fused image including the target object and the scene image, comprises:

10. The image processing method according to claim 9, wherein the determining the image generation feature of the image generation module in the image generation network, and fusing the image generation feature, the object image feature, and the fused image feature to obtain a fused image generation feature, includes:

11. The image processing method according to claim 1, said determining a scene image, an object image of a target object, and object position information to place the target object in the scene image, further comprising:

12. The image processing method of claim 11, the determining a sample scene image, a sample object image of a sample object, object location information to place the sample object in the sample scene image, and a sample tag, comprising:

13. An image processing method applied to cloud side equipment comprises the following steps:

and sending the target fusion image to the end-side equipment.

14. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the image processing method of any one of claims 1 to 12 and the image processing method of claim 13.