CN117710234B

CN117710234B - Picture generation method, device, equipment and medium based on large model

Info

Publication number: CN117710234B
Application number: CN202410166499.7A
Authority: CN
Inventors: 邓邱伟; 王迪; 苏明月; 尹飞; 孙涛; 王中飞
Original assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-05-24
Anticipated expiration: 2044-02-06
Also published as: CN117710234A

Abstract

The present application provides a method, device, equipment and medium for generating an image based on a large model, and relates to the field of smart home/smart home technology. The method includes: inputting a first background image and a text description into a position prediction model, predicting the target position information of a first target object in the first background image; obtaining a second target object of the same type as the first target object in the target image, and segmenting the second target object from the target image according to the mask of the second target object; confirming the scaling ratio of the segmented second target object according to the size relationship between the target position size in the target position information and the segmented second target object; and integrating the segmented second target object into the first background image according to the target position information and the scaling ratio. The method of the present application can accurately integrate the target product into the background image.

Description

Picture generation method, device, equipment and medium based on large model

Technical Field

The application relates to the technical field of intelligent home/intelligent families, in particular to a picture generation method, device, equipment and medium based on a large model.

Background

The picture generation technology based on the large model can be applied to various fields including fields of multimedia, commercial propaganda, entertainment life and the like; for example, when the application scene of picture generation is used for commodity marketing propaganda, the target commodity needs to be merged into background pictures of various topics.

In the prior art, when a target commodity and a background picture are fused, if manual intervention is not performed, the phenomenon that the fused picture is unreasonable may occur, for example, a refrigerator is placed on a table, and a television is placed in a suspended state instead of being placed on a television cabinet; if manual intervention is performed, unreasonable pictures generated by fusion can be properly avoided, but the operation process is increased.

Therefore, how to reasonably fuse target commodities and background pictures without manual intervention is a problem to be solved at present.

Disclosure of Invention

The application provides a picture generation method, device, equipment and medium based on a large model, which are used for solving the problem that a fused generated picture is unreasonable easily when manual intervention is not performed in the prior art.

In a first aspect, the present application provides a method for generating a picture based on a large model, including:

Inputting a first background picture and text description into a position prediction model, and predicting to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;

acquiring a second target object which is the same type as the first target object in a target picture, and dividing the second target object from the target picture according to a mask of the second target object;

confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object;

and fusing the segmented second target object into the first background picture according to the target position information and the scaling.

In one possible implementation manner, the inputting the first background picture and the text description into the position prediction model predicts the target position information of the first target object in the first background picture, including:

Acquiring global picture features of the first background picture and global text features of the text description through a decoder in the position prediction model, wherein the decoder comprises a picture decoder and a text decoder;

performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features;

And carrying out position regression on the first fusion characteristic, and predicting to obtain target position information of the first target object in the first background picture.

In one possible implementation manner, the performing feature alignment and fusion on the global picture feature and the global text feature to obtain a first fusion feature includes:

Dividing the global picture feature into local picture features, dividing the global text feature into first local text features and word frequency information, wherein the word frequency information comprises occurrence frequencies of word features associated with the first target object in the first local text features;

Performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature;

Based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features;

and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.

In one possible implementation, the global feature alignment of the local picture feature and the second local text feature based on the attention mechanism includes:

Confirming a query vector of the attention mechanism according to the second local text feature and the projection of the second local text feature on the local picture feature;

confirming a key vector and a value vector of the attention mechanism according to the local picture feature and the projection of the local picture feature on the second local text feature;

According to the query vector, the key vector and the value vector, a local attention weight matrix is obtained, and the local attention weight matrix is collected to be a global attention weight matrix;

and according to the global attention weight matrix, carrying out global feature alignment on the local picture features and the second local text features.

In a possible implementation manner, the performing feature fusion on the aligned global picture feature and the global text feature by using a cross-modal feature fusion component to obtain the first fusion feature includes:

inputting the aligned global picture features into a first layer structure of the cross-modal feature fusion component, and inputting the aligned global text features into a second layer structure of the cross-modal feature fusion component, wherein the first layer structure comprises a plurality of first network layers, and the second layer structure comprises a plurality of second network layers;

Fusing a first output of each current non-last first network layer with a second output of each current non-last second network layer to obtain a fusion input, wherein the first output and the fusion input are used for fusing and inputting the next first network layer, and the second output and the fusion input are used for fusing and inputting the next second network layer;

and fusing the third output of the last first network layer and the fourth output of the last second network layer to obtain the first fusion characteristic.

In a possible implementation manner, the performing position regression on the first fusion feature predicts that the first target object is before the target position information of the first background picture, and the method further includes:

Acquiring a triplet loss function and keyword confidence in the text description, and establishing a loss function of the position prediction model according to the keyword confidence and the triplet loss function; wherein the loss function is used to train the position prediction model.

In one possible implementation, the obtaining the triplet loss function in the text description includes:

acquiring a second fusion characteristic obtained in a training process, and confirming a first distance between an anchoring sample and a positive sample in the second fusion characteristic and a second distance between the anchoring sample and a negative sample in the second fusion characteristic, wherein the positive sample is a sample of the same type as the anchoring sample, and the negative sample is a sample of a different type from the anchoring sample;

And confirming the triplet loss function according to a preset constant, the first distance and the second distance.

In one possible implementation manner, the building the loss function of the location prediction model according to the keyword confidence and the triplet loss function includes:

Acquiring a first loss coefficient according to the first weight and the keyword confidence coefficient;

Acquiring a second loss coefficient according to the second weight and the triplet loss function;

and obtaining a loss function of the position prediction model according to the first loss coefficient and the second loss coefficient.

In one possible implementation, before the inputting the first background picture and the text description into the position prediction model, the method further includes:

acquiring a second background picture in a database;

pre-labeling the position of a third target object in the second background picture to obtain a semantic text;

And carrying out data cleaning on the second background picture and the semantic text, and training the second background picture and the semantic text after data cleaning as the input of an initial position prediction model to obtain the position prediction model.

In one possible implementation manner, the dividing the second object from the object picture according to the mask of the second object includes:

Confirming the second target object according to preset target parameters, wherein the preset target parameters are used for selecting the second target object;

If a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;

the fusing the segmented second object into the first background picture includes:

And acquiring the first background pictures matched with the plurality of second targets, generating a large model based on cross-mode, and fusing the plurality of segmented second targets into the first background pictures.

In one possible implementation manner, the dividing each second object from the target picture according to the mask of each second object includes:

Performing semantic segmentation on the target pictures to obtain masks of each second target object;

And dividing each second object from the object picture according to the mask of each second object.

In a second aspect, the present application provides a large model-based picture generation apparatus, including:

The prediction module is used for inputting the first background picture and the text description into the position prediction model, and predicting to obtain the target position information of the first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;

The segmentation module is used for acquiring a second object which is the same type as the first object in the target picture and segmenting the second object from the target picture according to the mask of the second object;

the scaling module is used for confirming the scaling of the segmented second target object according to the size relation between the target position size in the target position information and the segmented second target object;

and the fusion module is used for fusing the segmented second target object into the first background picture according to the target position information and the scaling.

In a third aspect, the present application provides a large model-based picture generation apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executable instructions stored by the memory, causing the at least one processor to perform the large model-based picture generation method as described above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a large model based picture generation method as described above.

According to the picture generation method, device, equipment and medium based on the large model, a first background picture and text description are input into a position prediction model, and target position information of a first target object in the first background picture is obtained through prediction; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture; acquiring a second target object which is the same type as the first target object in a target picture, and dividing the second target object from the target picture according to a mask of the second target object; confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object; and fusing the segmented second target object into the first background picture according to the target position information and the scaling.

In the method, a position prediction model which can be used for predicting the position of the target object in the background picture is trained in advance, and the position prediction model predicts the position information of the target object according to the global picture characteristic of the first background picture and the global text characteristic of the text description;

Inputting the first background picture and text description to be fused into a position prediction model, wherein the position prediction model can output target position information of a first target object in the first background picture, and the first target object is a certain type of target object which is determined to be fused into the first background picture; the target picture is provided with a second target object which needs to be fused to the first background picture, and the type of the second target object is the same as that of the first target object; after the second object is extracted from the object picture, confirming the scaling of the second object according to the size relation between the second object and the object position in the object position information, and fusing the segmented second object into the first background picture according to the object position information and the scaling; the target position information is the predicted accurate placement position of the second target, the scaling is the confirmed adjustable proportion of the second target, and finally the second target can be reasonably fused into the first background picture.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a hardware environment diagram of a large model-based picture generation method provided by an embodiment of the application;

Fig. 2 is a schematic flow chart of a large model-based picture generation method according to an embodiment of the present application;

fig. 3 is a second flow chart of a large model-based picture generation method according to an embodiment of the present application;

fig. 4 is a flow chart diagram III of a large model-based picture generation method according to an embodiment of the present application;

FIG. 5 is a cross-modal feature fusion component provided by an embodiment of the present application;

fig. 6 is a flow chart diagram of a large model-based picture generation method according to an embodiment of the present application;

FIG. 7 is a diagram of a large model-based picture generation device according to an embodiment of the present invention;

Fig. 8 is a hardware schematic diagram of a large model-based picture generation device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the prior art, when a target commodity and a background picture are fused, if manual intervention is not performed, the phenomenon that the fused picture is unreasonable may occur, for example, a refrigerator is placed on a table, and a television is placed in a suspended state instead of being placed on a television cabinet; if manual intervention is performed, unreasonable fusion of generated pictures can be properly avoided, but the operation process and the labor cost are increased.

In order to solve the problems, the application provides a picture generation method based on a large model.

The following describes the implementation process of a large model-based picture generation method according to the present application with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a hardware environment diagram of a large model-based picture generation method provided by an embodiment of the application; according to an aspect of an embodiment of the present application, a method for generating a picture based on a large model is provided. The picture generation method based on the large model is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (INTELLIGENCE HOUSE) ecology and the like. Alternatively, in the present embodiment, the above-described picture generation method may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (WIRELESS FIDELITY ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.

Fig. 2 is a schematic flow chart of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 2, the method includes:

s201, inputting a first background picture and text description into a position prediction model, and predicting to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture.

The position prediction model is a pre-trained prediction model for predicting the placement position of the target object in the background picture, and the position prediction model mainly utilizes characteristic information to confirm the position of the target object in the background picture in the training process; the position prediction model can take the first background picture and the text description as input, and can also take the first background picture as input only;

the position prediction model also extracts characteristic information of input information in the prediction process, and predicts target position information according to the characteristic information; the input information comprises a first background picture and text description, and the characteristic information comprises global picture characteristics of the first background picture and global text characteristics of the text description;

The first background picture is generally free of a first object, the text description is used for describing which position of the first object should be placed in the first background picture, the function of assisting in limiting the placement position of the first object is achieved, for example, the first object is a refrigerator, a sofa and a cabinet are arranged in the first background picture, the text description can be the left side of the sofa, the inside of the cabinet and the like, and the position information is obtained by utilizing the position prediction capability of a position prediction model based on the limitation of the first background picture and the text description and specifically confirming the position of the refrigerator.

S202, a second object which is the same type as the first object in the target picture is obtained, and the second object is segmented from the target picture according to a mask of the second object.

The target picture is provided with a second target object which needs to be fused to the first background picture, and the type of the second target object is the same as that of the first target object; the first object and the second object may be the same type of product, e.g., the first object and the second object are both refrigerators; or on the basis of the same product type, the product attributes (color, shape, etc.) are the same, for example, the first object and the second object are white refrigerators, and the gray refrigerators and the white refrigerators are of different types;

although the target picture is fused with the first background picture, a background with a proper theme is arranged for the second target object, the second target object in the target picture may have a background, and at this time, the second target object needs to be identified and extracted from the target picture; the second object may be segmented from the target picture by masking the second object.

S203, confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object.

The method comprises the steps that through a position prediction model, target position information of a first target object in a first background picture is predicted, and according to a plurality of target position coordinates in the target position information, the size of a target position allowing the first target object to be placed can be confirmed, namely the size of a second target object to be placed in the target picture is confirmed; the second object to be segmented from the target picture may have a difference in size from the allowable second object placement position, and therefore, it is necessary to confirm the scaling of the segmented second object according to the size relationship between the target position in the target position information and the segmented second object.

S204, fusing the segmented second object into the first background picture according to the object position information and the scaling.

And confirming the target position information and the scaling ratio, so that the specific position of the second target object fused on the first background picture and the scaling size fused on the first background picture can be known, and the second target object is fused with the first background picture accurately to generate a reasonable fusion picture.

The first background picture may have target position information corresponding to a plurality of first targets, and the second targets segmented from the target picture may also have a plurality of second targets, so the second targets and the first background picture need to be matched:

For example, the dividing the second object from the target picture according to the mask of the second object includes:

The target detection model can be used for determining preset target parameters, and in the process of training the target detection model, the target detection model can gradually identify the second target objects, so that the preset target parameters can be calculated through the target detection model when the target detection model is used, and one or more second target objects in the target picture are selected by the preset target parameters; if a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;

after the second target object is segmented, a certain first background picture which is suitable for the segmented second target object is needed to be found out from the plurality of first background pictures; for example, two refrigerators are extracted, then the first background picture predicted to have two cabinet positions is found correspondingly, and the two are fused.

Semantic segmentation is carried out on each second object, and each second object is completely segmented:

For example, performing semantic segmentation on the target picture to obtain a mask of each second target object;

Dividing the target picture in a semantic dividing mode to obtain masks of second targets which can be identified by the semantic dividing model, and dividing each second target from the target picture according to the mask of each second target.

In the embodiment of the application, a position prediction model which can be used for predicting the position of a target object in a background picture is trained in advance, and the position prediction model predicts the position information of the target according to the global picture characteristic of a first background picture and the global text characteristic of text description;

Fig. 3 is a second flow chart of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 3, the method includes:

S301, acquiring global picture features of the first background picture and global text features of the text description through a decoder in the position prediction model, wherein the decoder comprises a picture decoder and a text decoder.

In the position prediction model, a picture and a text can be input and output in a coding and decoding mode, wherein the picture and the text can be subjected to feature extraction through a decoder of the position prediction model; correspondingly, the decoder for acquiring the global picture feature is a picture decoder, and the decoder for acquiring the global text feature is a text decoder, and the global picture feature of the first background picture and the global text feature of the text description corresponding to the first background picture are acquired in this embodiment.

S302, carrying out feature alignment and fusion on the global picture features and the global text features to obtain first fusion features.

The global picture features are used for representing the feature information of different target contents in the first background picture, the different target contents are represented by feature matrixes formed by a plurality of global picture feature values, the global text features are used for representing the feature information of different target contents in the text description, and the same is true, the different target contents are represented by the feature matrixes formed by a plurality of global text feature values; for example, the sofa and the cabinet are different target contents, and the global picture characteristic values corresponding to the sofa and the cabinet are different;

To fully define the location of the target content, it is necessary to feature align the global picture feature with the global text feature, to align the global picture feature with the global text feature with the same target content, for example, to align the feature related to the cabinet in the global picture feature with the feature related to the cabinet in the global text feature; and fusing the aligned global picture features and the global text features to obtain first fusion features.

S303, carrying out position regression on the first fusion characteristic, and predicting to obtain target position information of the first target object in the first background picture.

And carrying out position regression on the first fusion characteristic by using an end-to-end model algorithm to obtain the target position information of the first target object in the first background picture.

In the embodiment of the application, based on the global picture features and the global text features, model reasoning is carried out on the target position of the first target object in the first background picture, so that the position coordinate of the first target object in the first background picture is accurately predicted, and the subsequent fusion of the target object and the background picture is facilitated.

Fig. 4 is a flowchart illustrating a large-model-based picture generation method according to an embodiment of the present application. Fig. 5 is a cross-modal feature fusion component provided by an embodiment of the present application. As shown in fig. 4, the method includes:

S401, segmenting the global picture feature into local picture features, and segmenting the global text feature into first local text features and word frequency information, wherein the word frequency information comprises occurrence frequencies of word features associated with the first object in the first local text features.

The global picture features are picture features of the whole picture of the first background picture, and the local picture features are picture features of the local picture of the first background picture; dividing global picture features of a single first background picture into a plurality of local picture features according to a first preset dividing value; for example, 256×256 global picture features are cut into a plurality of 16×16 local picture features;

The global text feature comprises all text features of the text description, each first local text feature comprises part of the text feature of the text description, and the text features between the two first local text features may be identical; for example, if the text description is "in a cabinet in need of a refrigerator", the global text feature characterizes "in a cabinet in need of a refrigerator", and each of the first local text features characterizes "in need of a refrigerator", "in a cabinet", respectively, if there are repeated descriptions of "in the text description", there are a plurality of first local text features possibly characterizing "and words similar to" in need of "are that word features not associated with the first object are to be deleted;

The word frequency information may record the frequency of occurrence corresponding to a word feature associated with the first object, e.g., a word feature characterizes a "refrigerator", and two occurrences of the word feature characterize a first local text feature of the "refrigerator".

S402, de-duplicating the first local text feature according to the word frequency information to obtain a second local text feature.

The first local text feature repeatedly representing the same text feature appears, and duplication is needed to be removed, and one of the first local text features is left; for example, there are two first local text features that characterize a "refrigerator" from which one can be deleted; after the duplicate removal, the first local text feature left is the second local text feature.

S403, based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features.

When the local picture features and the second local text features are aligned, the weights corresponding to the features of the same target content of the picture and the text are increased at the same time; whether the local picture feature and the second local text feature are accurately aligned or not can be measured by the attention mechanism until the condition of the attention mechanism is satisfied:

For example, confirming the query vector of the attention mechanism according to the second local text feature and the projection of the second local text feature on the local picture feature;

Obtaining projection of the second local text feature on the local picture feature to obtain a first projection matrix W;

obtaining a query vector Q according to the first projection matrix and the second local text feature X;

Obtaining projection of the local picture features on the second local text features to obtain a second projection matrix U;

Obtaining a key vector K and a value vector V according to the second projection matrix and the local picture feature Y;

Optionally, q=xw, k=v=yu;

specifically, whether the local picture features and the second local text features are aligned accurately can be measured through a local attention weight matrix under an attention mechanism, and if the local attention weight matrix meets a second weight threshold, the alignment is confirmed;

The local attention weight matrix expression is:

wherein, Is a local attention weight matrix; /(I)Is a normalization function; x is a second local text feature; y is a local picture feature; q is a query vector; k is a key vector, k=v; v is a value vector; d is the number of rows of a second projection matrix U of the second local text feature on the local picture feature; t is the transpose;

The global picture features are segmented to obtain local picture features, and one global picture feature corresponds to a plurality of local picture features; if the global picture feature corresponds to the whole picture, the local picture feature corresponds to 3×3 sub-pictures (3 rows and 3 columns of sub-pictures) cut by the whole picture, then the corresponding 3×3 local attention weight matrixes can be calculated, the 3×3 local attention weight matrixes are built according to the positions of the sub-pictures on the whole picture, and global feature alignment can be completed according to the global attention weight matrixes, so that the aligned global picture feature and global text feature are obtained;

For example, the global attention weight matrix C may be expressed as:

wherein, A local attention weight matrix corresponding to the 1 st row and 1 st column sub-pictures; /(I)A local attention weight matrix corresponding to the x row and the 1 st column sub-picture; /(I)A local attention weight matrix corresponding to the 1 st row and the y th column sub-picture; /(I)And the local attention weight matrix is corresponding to the x-th row and y-th column sub-picture.

S404, performing feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.

The cross-modal feature fusion component is a component for feature fusion, and can perform feature fusion on the aligned global picture features and the aligned global text features to obtain first fusion features:

According to the structure of the cross-modal feature fusion component, the process of fusing the global picture features and the global text features is as follows:

for example, inputting the aligned global picture feature into a first layer structure of the cross-modal feature fusion component, and inputting the aligned global text feature into a second layer structure of the cross-modal feature fusion component, wherein the first layer structure comprises a plurality of first network layers, and the second layer structure comprises a plurality of second network layers;

As shown in fig. 5, the cross-modal feature fusion component includes two layer structures, the first layer structure includes a plurality of first network layers, the second layer structure includes a plurality of second network layers, the number of the first network layers and the second network layers can be determined according to actual requirements, each first network layer can be the same or different neural network structures, and each second network layer can be the same or different neural network structures; the global picture feature may be input to a first network layer of the first layer structure and the global text feature may be input to a first second network layer of the second layer structure; fusing the first output of the first network layer and the second output of the first second network layer to obtain fused input;

fusing the first output and the fusion input through a multiplier to obtain a first fusion output; fusing the second output and the fusion input through a multiplier to obtain a second fusion output; the first fusion output continues to be input to the next first network layer, and the second fusion output continues to be input to the next second network layer;

The last first network layer obtains a third output according to the first fusion output of the previous first network layer; the last second network layer obtains a fourth output according to the second fusion output of the previous second network layer; and fusing the third output and the fourth output through an adder to obtain a first fusion characteristic.

In the embodiment of the application, the global picture features and the global text features are subjected to feature alignment based on an attention mechanism; performing feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features; and the global picture features and the global text features are fully fused, so that the accuracy of the later position regression is ensured.

Fig. 6 is a flow chart diagram of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 6, the method includes:

s601, acquiring a triplet loss function and keyword confidence in the text description.

When a position prediction model is trained, in the process of carrying out position regression on the first fusion feature, carrying out rewarding and punishment through the triplet loss function and the keyword confidence coefficient, so that the triplet loss function and the keyword confidence coefficient are firstly obtained;

the process for constructing the triplet loss function comprises the following steps:

For example, a second fusion feature obtained in a training process is obtained, and a first distance between an anchor sample and a positive sample in the second fusion feature and a second distance between the anchor sample and a negative sample in the second fusion feature are confirmed, wherein the positive sample is a sample of the same type as the anchor sample, and the negative sample is a sample of a different type from the anchor sample;

The triplet loss function L can be expressed as:

wherein, For obtaining a maximum value between 0 and k; /(I)For obtaining the distance between r and p, i.e. the first distance; /(I)For obtaining the distance between r and n, i.e. the second distance; r is the anchor sample in the second fusion feature; p is the positive sample in the second fusion feature; n is a negative sample in the second fusion feature; m is a preset constant;

The second fusion characteristic is obtained by obtaining a training position prediction model, and the second fusion characteristic comprises characteristic representations of target contents, for example, a white refrigerator and a gray refrigerator are both target contents and are both refrigerators, but the characteristics corresponding to the selected white refrigerator are anchoring samples, other characteristics related to the white refrigerator are positive samples, and other characteristics inconsistent with the white refrigerator are negative samples.

S602, establishing a loss function of the position prediction model according to the keyword confidence and the triplet loss function; wherein the loss function is used to train the position prediction model.

By combining the keyword confidence coefficient and the triplet loss function, proper weights are respectively set for the keyword confidence coefficient and the triplet loss function, and a loss function suitable for a position prediction model can be constructed:

Illustratively, a first loss coefficient is obtained according to a first weight and the keyword confidence coefficient;

The Loss function Loss formula of the position prediction model is: loss=jn+gl; wherein j is a first weight of keyword confidence; g is a second weight of the triplet loss function; n is the confidence of the keyword.

In the embodiment of the application, when the position prediction model is trained, a proper loss function is constructed for the position prediction model, the position prediction model is trained by using the loss function in a back propagation mode until the position prediction model is trained, and the loss function is fixed in the position prediction model and is used for prediction.

When training a position prediction model, a large amount of data can be acquired for training:

For example, in the database, a second background picture is obtained;

The second background picture is a background picture which is obtained from a database and is different from the first background picture, the position of a third object in the second background picture can be pre-marked through a picture understanding large model, and a corresponding semantic text is obtained, wherein the semantic text is used for describing the position of the third object in the second background picture; and according to the requirements, carrying out data cleaning (data screening) on the processed second background picture and the semantic text, and training the second background picture and the semantic text after the data cleaning as the input of an initial position prediction model (an untrained position prediction model) until the training is completed to obtain the position prediction model.

Fig. 7 is a diagram of a large model-based picture generating device according to an embodiment of the present invention, where, as shown in fig. 7, the device includes: a prediction module 701, a segmentation module 702, a scaling module 703 and a fusion module 704;

The prediction module 701 is configured to input a first background picture and a text description into a position prediction model, and predict to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture.

The prediction module 701 is further configured to obtain, through a decoder in the position prediction model, a global picture feature of the first background picture and a global text feature of the text description, where the decoder includes a picture decoder and a text decoder;

The prediction module 701 is further configured to segment the global picture feature into local picture features, segment the global text feature into first local text features and word frequency information, where the word frequency information includes occurrence frequencies of text features associated with the first object in the respective first local text features; performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature;

The prediction module 701 is further configured to confirm a query vector of the attention mechanism according to the second local text feature and a projection of the second local text feature on the local picture feature;

The prediction module 701 is further configured to input the aligned global picture feature into a first layer structure of the cross-modal feature fusion component, and input the aligned global text feature into a second layer structure of the cross-modal feature fusion component, where the first layer structure includes a plurality of first network layers, and the second layer structure includes a plurality of second network layers;

The segmentation module 702 is configured to obtain a second object of the same type as the first object in the target picture, and segment the second object from the target picture according to a mask of the second object.

And a scaling module 703, configured to confirm the scaling of the segmented second object according to the size relationship between the target position size in the target position information and the segmented second object.

And a fusion module 704, configured to fuse the segmented second object into the first background picture according to the object position information and the scaling.

The application also provides a large model-based picture generation device, which comprises: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform a large model-based picture generation method.

Fig. 8 is a hardware schematic diagram of a large model-based picture generation device according to an embodiment of the present invention. As shown in fig. 8, the large model-based picture generation apparatus 80 provided in the present embodiment includes: at least one processor 801 and a memory 802. The device 80 further comprises a communication component 803. The processor 801, the memory 802, and the communication section 803 are connected via a bus 804.

In a specific implementation, the at least one processor 801 executes computer-executable instructions stored in the memory 802, so that the at least one processor 801 performs the above large model-based picture generation method.

The specific implementation process of the processor 801 may refer to the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

In the embodiment shown in fig. 8, it should be understood that the Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), other general purpose processors, digital signal Processor (english: DIGITAL SIGNAL Processor, abbreviated as DSP), application-specific integrated Circuit (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

The Memory may include high-speed Memory (Random Access Memory, RAM) or may further include Non-volatile Memory (NVM), such as at least one disk Memory.

The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.

The present application also provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the large model-based picture generation method as described above.

The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the readable storage medium may reside as discrete components in a device.

The division of the units is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any adaptations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the precise construction hereinbefore set forth and shown in the drawings and as follows in the scope of the appended claims.

Claims

1. A large model-based picture generation method, comprising:

According to the target position information and the scaling, fusing the segmented second target object into the first background picture;

Inputting the first background picture and the text description into a position prediction model, predicting to obtain target position information of a first target object in the first background picture, wherein the method comprises the following steps:

Position regression is carried out on the first fusion characteristic, and target position information of the first target object in the first background picture is obtained through prediction;

the step of carrying out feature alignment and fusion on the global picture features and the global text features to obtain first fusion features, including:

2. The large model based picture generation method of claim 1, wherein the performing global feature alignment on the local picture feature and the second local text feature based on an attention mechanism comprises:

3. The method for generating a large model-based picture according to claim 1, wherein the feature-fusing the aligned global picture feature and the global text feature by a cross-modal feature-fusion component to obtain the first fused feature comprises:

4. The method for generating a large model-based picture according to claim 1, wherein the performing position regression on the first fusion feature predicts that the first object is before the target position information of the first background picture, the method further comprising:

5. The large model based picture generation method according to claim 4, wherein the obtaining a triplet loss function in the text description comprises:

6. The method of claim 4, wherein the creating a loss function of the location prediction model based on the keyword confidence and the triplet loss function comprises:

7. The large model-based picture generation method according to claim 1, wherein before the inputting of the first background picture and the text description into the position prediction model, the method further comprises:

acquiring a second background picture in a database;

8. The large model-based picture generation method according to claim 1, wherein the dividing the second object from the target picture according to the mask of the second object includes:

9. The large model-based picture generation method according to claim 8, wherein the dividing each of the second objects from the target picture according to the mask of each of the second objects includes:

10. A large model-based picture generation apparatus, comprising:

the fusion module is used for fusing the segmented second object into the first background picture according to the object position information and the scaling;

The prediction module is specifically configured to obtain, through a decoder in the position prediction model, a global picture feature of the first background picture and a global text feature of the text description, where the decoder includes a picture decoder and a text decoder; performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features; position regression is carried out on the first fusion characteristic, and target position information of the first target object in the first background picture is obtained through prediction;

The prediction module is specifically configured to segment the global picture feature into local picture features and segment the global text feature into first local text features and word frequency information when performing feature alignment and fusion on the global picture feature and the global text feature to obtain a first fusion feature, where the word frequency information includes occurrence frequencies of word features associated with the first target object in each of the first local text features; performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature; based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features; and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.

11. A large model-based picture generation apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory, causing the at least one processor to perform the large model-based picture generation method of any one of claims 1-9.

12. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the large model based picture generation method as claimed in any one of claims 1-9.