CN117710234B - Picture generation method, device, equipment and medium based on large model - Google Patents

Picture generation method, device, equipment and medium based on large model Download PDF

Info

Publication number
CN117710234B
CN117710234B CN202410166499.7A CN202410166499A CN117710234B CN 117710234 B CN117710234 B CN 117710234B CN 202410166499 A CN202410166499 A CN 202410166499A CN 117710234 B CN117710234 B CN 117710234B
Authority
CN
China
Prior art keywords
picture
global
features
feature
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410166499.7A
Other languages
Chinese (zh)
Other versions
CN117710234A (en
Inventor
邓邱伟
王迪
苏明月
尹飞
孙涛
王中飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Qingdao Haier Intelligent Home Appliance Technology Co Ltd
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Qingdao Haier Intelligent Home Appliance Technology Co Ltd
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Qingdao Haier Intelligent Home Appliance Technology Co Ltd, Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202410166499.7A priority Critical patent/CN117710234B/en
Publication of CN117710234A publication Critical patent/CN117710234A/en
Application granted granted Critical
Publication of CN117710234B publication Critical patent/CN117710234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种基于大模型的图片生成方法、装置、设备和介质,涉及智能家���/智慧家庭技术领域。该方法包括:将第一背������片���������描述输入位置预测模型,预测得到第一目标物在第一背景图片的目标位置信息;获取目标图片中与第一目标物相同类型的第二目标物,并根据第二目标物的掩码将第二目标物从目标图片中分割出来;根据目标位置信息中的目标位置尺寸与分割出来的第二目标物之间的大小关系,确认分割出来的第二目标物的缩放比例;根据目标位置信息和缩放比例,将分割出来的第二目标物融合至第一背景图片中。本申请的方法可以将目标产品精准融合到背景图片中。

The present application provides a method, device, equipment and medium for generating an image based on a large model, and relates to the field of smart home/smart home technology. The method includes: inputting a first background image and a text description into a position prediction model, predicting the target position information of a first target object in the first background image; obtaining a second target object of the same type as the first target object in the target image, and segmenting the second target object from the target image according to the mask of the second target object; confirming the scaling ratio of the segmented second target object according to the size relationship between the target position size in the target position information and the segmented second target object; and integrating the segmented second target object into the first background image according to the target position information and the scaling ratio. The method of the present application can accurately integrate the target product into the background image.

Description

Picture generation method, device, equipment and medium based on large model
Technical Field
The application relates to the technical field of intelligent home/intelligent families, in particular to a picture generation method, device, equipment and medium based on a large model.
Background
The picture generation technology based on the large model can be applied to various fields including fields of multimedia, commercial propaganda, entertainment life and the like; for example, when the application scene of picture generation is used for commodity marketing propaganda, the target commodity needs to be merged into background pictures of various topics.
In the prior art, when a target commodity and a background picture are fused, if manual intervention is not performed, the phenomenon that the fused picture is unreasonable may occur, for example, a refrigerator is placed on a table, and a television is placed in a suspended state instead of being placed on a television cabinet; if manual intervention is performed, unreasonable pictures generated by fusion can be properly avoided, but the operation process is increased.
Therefore, how to reasonably fuse target commodities and background pictures without manual intervention is a problem to be solved at present.
Disclosure of Invention
The application provides a picture generation method, device, equipment and medium based on a large model, which are used for solving the problem that a fused generated picture is unreasonable easily when manual intervention is not performed in the prior art.
In a first aspect, the present application provides a method for generating a picture based on a large model, including:
Inputting a first background picture and text description into a position prediction model, and predicting to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;
acquiring a second target object which is the same type as the first target object in a target picture, and dividing the second target object from the target picture according to a mask of the second target object;
confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object;
and fusing the segmented second target object into the first background picture according to the target position information and the scaling.
In one possible implementation manner, the inputting the first background picture and the text description into the position prediction model predicts the target position information of the first target object in the first background picture, including:
Acquiring global picture features of the first background picture and global text features of the text description through a decoder in the position prediction model, wherein the decoder comprises a picture decoder and a text decoder;
performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features;
And carrying out position regression on the first fusion characteristic, and predicting to obtain target position information of the first target object in the first background picture.
In one possible implementation manner, the performing feature alignment and fusion on the global picture feature and the global text feature to obtain a first fusion feature includes:
Dividing the global picture feature into local picture features, dividing the global text feature into first local text features and word frequency information, wherein the word frequency information comprises occurrence frequencies of word features associated with the first target object in the first local text features;
Performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature;
Based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features;
and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.
In one possible implementation, the global feature alignment of the local picture feature and the second local text feature based on the attention mechanism includes:
Confirming a query vector of the attention mechanism according to the second local text feature and the projection of the second local text feature on the local picture feature;
confirming a key vector and a value vector of the attention mechanism according to the local picture feature and the projection of the local picture feature on the second local text feature;
According to the query vector, the key vector and the value vector, a local attention weight matrix is obtained, and the local attention weight matrix is collected to be a global attention weight matrix;
and according to the global attention weight matrix, carrying out global feature alignment on the local picture features and the second local text features.
In a possible implementation manner, the performing feature fusion on the aligned global picture feature and the global text feature by using a cross-modal feature fusion component to obtain the first fusion feature includes:
inputting the aligned global picture features into a first layer structure of the cross-modal feature fusion component, and inputting the aligned global text features into a second layer structure of the cross-modal feature fusion component, wherein the first layer structure comprises a plurality of first network layers, and the second layer structure comprises a plurality of second network layers;
Fusing a first output of each current non-last first network layer with a second output of each current non-last second network layer to obtain a fusion input, wherein the first output and the fusion input are used for fusing and inputting the next first network layer, and the second output and the fusion input are used for fusing and inputting the next second network layer;
and fusing the third output of the last first network layer and the fourth output of the last second network layer to obtain the first fusion characteristic.
In a possible implementation manner, the performing position regression on the first fusion feature predicts that the first target object is before the target position information of the first background picture, and the method further includes:
Acquiring a triplet loss function and keyword confidence in the text description, and establishing a loss function of the position prediction model according to the keyword confidence and the triplet loss function; wherein the loss function is used to train the position prediction model.
In one possible implementation, the obtaining the triplet loss function in the text description includes:
acquiring a second fusion characteristic obtained in a training process, and confirming a first distance between an anchoring sample and a positive sample in the second fusion characteristic and a second distance between the anchoring sample and a negative sample in the second fusion characteristic, wherein the positive sample is a sample of the same type as the anchoring sample, and the negative sample is a sample of a different type from the anchoring sample;
And confirming the triplet loss function according to a preset constant, the first distance and the second distance.
In one possible implementation manner, the building the loss function of the location prediction model according to the keyword confidence and the triplet loss function includes:
Acquiring a first loss coefficient according to the first weight and the keyword confidence coefficient;
Acquiring a second loss coefficient according to the second weight and the triplet loss function;
and obtaining a loss function of the position prediction model according to the first loss coefficient and the second loss coefficient.
In one possible implementation, before the inputting the first background picture and the text description into the position prediction model, the method further includes:
acquiring a second background picture in a database;
pre-labeling the position of a third target object in the second background picture to obtain a semantic text;
And carrying out data cleaning on the second background picture and the semantic text, and training the second background picture and the semantic text after data cleaning as the input of an initial position prediction model to obtain the position prediction model.
In one possible implementation manner, the dividing the second object from the object picture according to the mask of the second object includes:
Confirming the second target object according to preset target parameters, wherein the preset target parameters are used for selecting the second target object;
If a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;
the fusing the segmented second object into the first background picture includes:
And acquiring the first background pictures matched with the plurality of second targets, generating a large model based on cross-mode, and fusing the plurality of segmented second targets into the first background pictures.
In one possible implementation manner, the dividing each second object from the target picture according to the mask of each second object includes:
Performing semantic segmentation on the target pictures to obtain masks of each second target object;
And dividing each second object from the object picture according to the mask of each second object.
In a second aspect, the present application provides a large model-based picture generation apparatus, including:
The prediction module is used for inputting the first background picture and the text description into the position prediction model, and predicting to obtain the target position information of the first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;
The segmentation module is used for acquiring a second object which is the same type as the first object in the target picture and segmenting the second object from the target picture according to the mask of the second object;
the scaling module is used for confirming the scaling of the segmented second target object according to the size relation between the target position size in the target position information and the segmented second target object;
and the fusion module is used for fusing the segmented second target object into the first background picture according to the target position information and the scaling.
In a third aspect, the present application provides a large model-based picture generation apparatus, including: at least one processor and memory;
the memory stores computer-executable instructions;
The at least one processor executes the computer-executable instructions stored by the memory, causing the at least one processor to perform the large model-based picture generation method as described above.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a large model based picture generation method as described above.
According to the picture generation method, device, equipment and medium based on the large model, a first background picture and text description are input into a position prediction model, and target position information of a first target object in the first background picture is obtained through prediction; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture; acquiring a second target object which is the same type as the first target object in a target picture, and dividing the second target object from the target picture according to a mask of the second target object; confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object; and fusing the segmented second target object into the first background picture according to the target position information and the scaling.
In the method, a position prediction model which can be used for predicting the position of the target object in the background picture is trained in advance, and the position prediction model predicts the position information of the target object according to the global picture characteristic of the first background picture and the global text characteristic of the text description;
Inputting the first background picture and text description to be fused into a position prediction model, wherein the position prediction model can output target position information of a first target object in the first background picture, and the first target object is a certain type of target object which is determined to be fused into the first background picture; the target picture is provided with a second target object which needs to be fused to the first background picture, and the type of the second target object is the same as that of the first target object; after the second object is extracted from the object picture, confirming the scaling of the second object according to the size relation between the second object and the object position in the object position information, and fusing the segmented second object into the first background picture according to the object position information and the scaling; the target position information is the predicted accurate placement position of the second target, the scaling is the confirmed adjustable proportion of the second target, and finally the second target can be reasonably fused into the first background picture.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a hardware environment diagram of a large model-based picture generation method provided by an embodiment of the application;
Fig. 2 is a schematic flow chart of a large model-based picture generation method according to an embodiment of the present application;
fig. 3 is a second flow chart of a large model-based picture generation method according to an embodiment of the present application;
fig. 4 is a flow chart diagram III of a large model-based picture generation method according to an embodiment of the present application;
FIG. 5 is a cross-modal feature fusion component provided by an embodiment of the present application;
fig. 6 is a flow chart diagram of a large model-based picture generation method according to an embodiment of the present application;
FIG. 7 is a diagram of a large model-based picture generation device according to an embodiment of the present invention;
Fig. 8 is a hardware schematic diagram of a large model-based picture generation device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the prior art, when a target commodity and a background picture are fused, if manual intervention is not performed, the phenomenon that the fused picture is unreasonable may occur, for example, a refrigerator is placed on a table, and a television is placed in a suspended state instead of being placed on a television cabinet; if manual intervention is performed, unreasonable fusion of generated pictures can be properly avoided, but the operation process and the labor cost are increased.
In order to solve the problems, the application provides a picture generation method based on a large model.
The following describes the implementation process of a large model-based picture generation method according to the present application with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a hardware environment diagram of a large model-based picture generation method provided by an embodiment of the application; according to an aspect of an embodiment of the present application, a method for generating a picture based on a large model is provided. The picture generation method based on the large model is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (INTELLIGENCE HOUSE) ecology and the like. Alternatively, in the present embodiment, the above-described picture generation method may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.
The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (WIRELESS FIDELITY ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.
Fig. 2 is a schematic flow chart of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 2, the method includes:
s201, inputting a first background picture and text description into a position prediction model, and predicting to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture.
The position prediction model is a pre-trained prediction model for predicting the placement position of the target object in the background picture, and the position prediction model mainly utilizes characteristic information to confirm the position of the target object in the background picture in the training process; the position prediction model can take the first background picture and the text description as input, and can also take the first background picture as input only;
the position prediction model also extracts characteristic information of input information in the prediction process, and predicts target position information according to the characteristic information; the input information comprises a first background picture and text description, and the characteristic information comprises global picture characteristics of the first background picture and global text characteristics of the text description;
The first background picture is generally free of a first object, the text description is used for describing which position of the first object should be placed in the first background picture, the function of assisting in limiting the placement position of the first object is achieved, for example, the first object is a refrigerator, a sofa and a cabinet are arranged in the first background picture, the text description can be the left side of the sofa, the inside of the cabinet and the like, and the position information is obtained by utilizing the position prediction capability of a position prediction model based on the limitation of the first background picture and the text description and specifically confirming the position of the refrigerator.
S202, a second object which is the same type as the first object in the target picture is obtained, and the second object is segmented from the target picture according to a mask of the second object.
The target picture is provided with a second target object which needs to be fused to the first background picture, and the type of the second target object is the same as that of the first target object; the first object and the second object may be the same type of product, e.g., the first object and the second object are both refrigerators; or on the basis of the same product type, the product attributes (color, shape, etc.) are the same, for example, the first object and the second object are white refrigerators, and the gray refrigerators and the white refrigerators are of different types;
although the target picture is fused with the first background picture, a background with a proper theme is arranged for the second target object, the second target object in the target picture may have a background, and at this time, the second target object needs to be identified and extracted from the target picture; the second object may be segmented from the target picture by masking the second object.
S203, confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object.
The method comprises the steps that through a position prediction model, target position information of a first target object in a first background picture is predicted, and according to a plurality of target position coordinates in the target position information, the size of a target position allowing the first target object to be placed can be confirmed, namely the size of a second target object to be placed in the target picture is confirmed; the second object to be segmented from the target picture may have a difference in size from the allowable second object placement position, and therefore, it is necessary to confirm the scaling of the segmented second object according to the size relationship between the target position in the target position information and the segmented second object.
S204, fusing the segmented second object into the first background picture according to the object position information and the scaling.
And confirming the target position information and the scaling ratio, so that the specific position of the second target object fused on the first background picture and the scaling size fused on the first background picture can be known, and the second target object is fused with the first background picture accurately to generate a reasonable fusion picture.
The first background picture may have target position information corresponding to a plurality of first targets, and the second targets segmented from the target picture may also have a plurality of second targets, so the second targets and the first background picture need to be matched:
For example, the dividing the second object from the target picture according to the mask of the second object includes:
Confirming the second target object according to preset target parameters, wherein the preset target parameters are used for selecting the second target object;
If a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;
the fusing the segmented second object into the first background picture includes:
And acquiring the first background pictures matched with the plurality of second targets, generating a large model based on cross-mode, and fusing the plurality of segmented second targets into the first background pictures.
The target detection model can be used for determining preset target parameters, and in the process of training the target detection model, the target detection model can gradually identify the second target objects, so that the preset target parameters can be calculated through the target detection model when the target detection model is used, and one or more second target objects in the target picture are selected by the preset target parameters; if a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;
after the second target object is segmented, a certain first background picture which is suitable for the segmented second target object is needed to be found out from the plurality of first background pictures; for example, two refrigerators are extracted, then the first background picture predicted to have two cabinet positions is found correspondingly, and the two are fused.
Semantic segmentation is carried out on each second object, and each second object is completely segmented:
For example, performing semantic segmentation on the target picture to obtain a mask of each second target object;
And dividing each second object from the object picture according to the mask of each second object.
Dividing the target picture in a semantic dividing mode to obtain masks of second targets which can be identified by the semantic dividing model, and dividing each second target from the target picture according to the mask of each second target.
In the embodiment of the application, a position prediction model which can be used for predicting the position of a target object in a background picture is trained in advance, and the position prediction model predicts the position information of the target according to the global picture characteristic of a first background picture and the global text characteristic of text description;
Inputting the first background picture and text description to be fused into a position prediction model, wherein the position prediction model can output target position information of a first target object in the first background picture, and the first target object is a certain type of target object which is determined to be fused into the first background picture; the target picture is provided with a second target object which needs to be fused to the first background picture, and the type of the second target object is the same as that of the first target object; after the second object is extracted from the object picture, confirming the scaling of the second object according to the size relation between the second object and the object position in the object position information, and fusing the segmented second object into the first background picture according to the object position information and the scaling; the target position information is the predicted accurate placement position of the second target, the scaling is the confirmed adjustable proportion of the second target, and finally the second target can be reasonably fused into the first background picture.
Fig. 3 is a second flow chart of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 3, the method includes:
S301, acquiring global picture features of the first background picture and global text features of the text description through a decoder in the position prediction model, wherein the decoder comprises a picture decoder and a text decoder.
In the position prediction model, a picture and a text can be input and output in a coding and decoding mode, wherein the picture and the text can be subjected to feature extraction through a decoder of the position prediction model; correspondingly, the decoder for acquiring the global picture feature is a picture decoder, and the decoder for acquiring the global text feature is a text decoder, and the global picture feature of the first background picture and the global text feature of the text description corresponding to the first background picture are acquired in this embodiment.
S302, carrying out feature alignment and fusion on the global picture features and the global text features to obtain first fusion features.
The global picture features are used for representing the feature information of different target contents in the first background picture, the different target contents are represented by feature matrixes formed by a plurality of global picture feature values, the global text features are used for representing the feature information of different target contents in the text description, and the same is true, the different target contents are represented by the feature matrixes formed by a plurality of global text feature values; for example, the sofa and the cabinet are different target contents, and the global picture characteristic values corresponding to the sofa and the cabinet are different;
To fully define the location of the target content, it is necessary to feature align the global picture feature with the global text feature, to align the global picture feature with the global text feature with the same target content, for example, to align the feature related to the cabinet in the global picture feature with the feature related to the cabinet in the global text feature; and fusing the aligned global picture features and the global text features to obtain first fusion features.
S303, carrying out position regression on the first fusion characteristic, and predicting to obtain target position information of the first target object in the first background picture.
And carrying out position regression on the first fusion characteristic by using an end-to-end model algorithm to obtain the target position information of the first target object in the first background picture.
In the embodiment of the application, based on the global picture features and the global text features, model reasoning is carried out on the target position of the first target object in the first background picture, so that the position coordinate of the first target object in the first background picture is accurately predicted, and the subsequent fusion of the target object and the background picture is facilitated.
Fig. 4 is a flowchart illustrating a large-model-based picture generation method according to an embodiment of the present application. Fig. 5 is a cross-modal feature fusion component provided by an embodiment of the present application. As shown in fig. 4, the method includes:
S401, segmenting the global picture feature into local picture features, and segmenting the global text feature into first local text features and word frequency information, wherein the word frequency information comprises occurrence frequencies of word features associated with the first object in the first local text features.
The global picture features are picture features of the whole picture of the first background picture, and the local picture features are picture features of the local picture of the first background picture; dividing global picture features of a single first background picture into a plurality of local picture features according to a first preset dividing value; for example, 256×256 global picture features are cut into a plurality of 16×16 local picture features;
The global text feature comprises all text features of the text description, each first local text feature comprises part of the text feature of the text description, and the text features between the two first local text features may be identical; for example, if the text description is "in a cabinet in need of a refrigerator", the global text feature characterizes "in a cabinet in need of a refrigerator", and each of the first local text features characterizes "in need of a refrigerator", "in a cabinet", respectively, if there are repeated descriptions of "in the text description", there are a plurality of first local text features possibly characterizing "and words similar to" in need of "are that word features not associated with the first object are to be deleted;
The word frequency information may record the frequency of occurrence corresponding to a word feature associated with the first object, e.g., a word feature characterizes a "refrigerator", and two occurrences of the word feature characterize a first local text feature of the "refrigerator".
S402, de-duplicating the first local text feature according to the word frequency information to obtain a second local text feature.
The first local text feature repeatedly representing the same text feature appears, and duplication is needed to be removed, and one of the first local text features is left; for example, there are two first local text features that characterize a "refrigerator" from which one can be deleted; after the duplicate removal, the first local text feature left is the second local text feature.
S403, based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features.
When the local picture features and the second local text features are aligned, the weights corresponding to the features of the same target content of the picture and the text are increased at the same time; whether the local picture feature and the second local text feature are accurately aligned or not can be measured by the attention mechanism until the condition of the attention mechanism is satisfied:
For example, confirming the query vector of the attention mechanism according to the second local text feature and the projection of the second local text feature on the local picture feature;
confirming a key vector and a value vector of the attention mechanism according to the local picture feature and the projection of the local picture feature on the second local text feature;
According to the query vector, the key vector and the value vector, a local attention weight matrix is obtained, and the local attention weight matrix is collected to be a global attention weight matrix;
and according to the global attention weight matrix, carrying out global feature alignment on the local picture features and the second local text features.
Obtaining projection of the second local text feature on the local picture feature to obtain a first projection matrix W;
obtaining a query vector Q according to the first projection matrix and the second local text feature X;
Obtaining projection of the local picture features on the second local text features to obtain a second projection matrix U;
Obtaining a key vector K and a value vector V according to the second projection matrix and the local picture feature Y;
Optionally, q=xw, k=v=yu;
specifically, whether the local picture features and the second local text features are aligned accurately can be measured through a local attention weight matrix under an attention mechanism, and if the local attention weight matrix meets a second weight threshold, the alignment is confirmed;
The local attention weight matrix expression is:
wherein, Is a local attention weight matrix; /(I)Is a normalization function; x is a second local text feature; y is a local picture feature; q is a query vector; k is a key vector, k=v; v is a value vector; d is the number of rows of a second projection matrix U of the second local text feature on the local picture feature; t is the transpose;
The global picture features are segmented to obtain local picture features, and one global picture feature corresponds to a plurality of local picture features; if the global picture feature corresponds to the whole picture, the local picture feature corresponds to 3×3 sub-pictures (3 rows and 3 columns of sub-pictures) cut by the whole picture, then the corresponding 3×3 local attention weight matrixes can be calculated, the 3×3 local attention weight matrixes are built according to the positions of the sub-pictures on the whole picture, and global feature alignment can be completed according to the global attention weight matrixes, so that the aligned global picture feature and global text feature are obtained;
For example, the global attention weight matrix C may be expressed as:
wherein, A local attention weight matrix corresponding to the 1 st row and 1 st column sub-pictures; /(I)A local attention weight matrix corresponding to the x row and the 1 st column sub-picture; /(I)A local attention weight matrix corresponding to the 1 st row and the y th column sub-picture; /(I)And the local attention weight matrix is corresponding to the x-th row and y-th column sub-picture.
S404, performing feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.
The cross-modal feature fusion component is a component for feature fusion, and can perform feature fusion on the aligned global picture features and the aligned global text features to obtain first fusion features:
According to the structure of the cross-modal feature fusion component, the process of fusing the global picture features and the global text features is as follows:
for example, inputting the aligned global picture feature into a first layer structure of the cross-modal feature fusion component, and inputting the aligned global text feature into a second layer structure of the cross-modal feature fusion component, wherein the first layer structure comprises a plurality of first network layers, and the second layer structure comprises a plurality of second network layers;
Fusing a first output of each current non-last first network layer with a second output of each current non-last second network layer to obtain a fusion input, wherein the first output and the fusion input are used for fusing and inputting the next first network layer, and the second output and the fusion input are used for fusing and inputting the next second network layer;
and fusing the third output of the last first network layer and the fourth output of the last second network layer to obtain the first fusion characteristic.
As shown in fig. 5, the cross-modal feature fusion component includes two layer structures, the first layer structure includes a plurality of first network layers, the second layer structure includes a plurality of second network layers, the number of the first network layers and the second network layers can be determined according to actual requirements, each first network layer can be the same or different neural network structures, and each second network layer can be the same or different neural network structures; the global picture feature may be input to a first network layer of the first layer structure and the global text feature may be input to a first second network layer of the second layer structure; fusing the first output of the first network layer and the second output of the first second network layer to obtain fused input;
fusing the first output and the fusion input through a multiplier to obtain a first fusion output; fusing the second output and the fusion input through a multiplier to obtain a second fusion output; the first fusion output continues to be input to the next first network layer, and the second fusion output continues to be input to the next second network layer;
The last first network layer obtains a third output according to the first fusion output of the previous first network layer; the last second network layer obtains a fourth output according to the second fusion output of the previous second network layer; and fusing the third output and the fourth output through an adder to obtain a first fusion characteristic.
In the embodiment of the application, the global picture features and the global text features are subjected to feature alignment based on an attention mechanism; performing feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features; and the global picture features and the global text features are fully fused, so that the accuracy of the later position regression is ensured.
Fig. 6 is a flow chart diagram of a large model-based picture generation method according to an embodiment of the present application. As shown in fig. 6, the method includes:
s601, acquiring a triplet loss function and keyword confidence in the text description.
When a position prediction model is trained, in the process of carrying out position regression on the first fusion feature, carrying out rewarding and punishment through the triplet loss function and the keyword confidence coefficient, so that the triplet loss function and the keyword confidence coefficient are firstly obtained;
the process for constructing the triplet loss function comprises the following steps:
For example, a second fusion feature obtained in a training process is obtained, and a first distance between an anchor sample and a positive sample in the second fusion feature and a second distance between the anchor sample and a negative sample in the second fusion feature are confirmed, wherein the positive sample is a sample of the same type as the anchor sample, and the negative sample is a sample of a different type from the anchor sample;
And confirming the triplet loss function according to a preset constant, the first distance and the second distance.
The triplet loss function L can be expressed as:
wherein, For obtaining a maximum value between 0 and k; /(I)For obtaining the distance between r and p, i.e. the first distance; /(I)For obtaining the distance between r and n, i.e. the second distance; r is the anchor sample in the second fusion feature; p is the positive sample in the second fusion feature; n is a negative sample in the second fusion feature; m is a preset constant;
The second fusion characteristic is obtained by obtaining a training position prediction model, and the second fusion characteristic comprises characteristic representations of target contents, for example, a white refrigerator and a gray refrigerator are both target contents and are both refrigerators, but the characteristics corresponding to the selected white refrigerator are anchoring samples, other characteristics related to the white refrigerator are positive samples, and other characteristics inconsistent with the white refrigerator are negative samples.
S602, establishing a loss function of the position prediction model according to the keyword confidence and the triplet loss function; wherein the loss function is used to train the position prediction model.
By combining the keyword confidence coefficient and the triplet loss function, proper weights are respectively set for the keyword confidence coefficient and the triplet loss function, and a loss function suitable for a position prediction model can be constructed:
Illustratively, a first loss coefficient is obtained according to a first weight and the keyword confidence coefficient;
Acquiring a second loss coefficient according to the second weight and the triplet loss function;
and obtaining a loss function of the position prediction model according to the first loss coefficient and the second loss coefficient.
The Loss function Loss formula of the position prediction model is: loss=jn+gl; wherein j is a first weight of keyword confidence; g is a second weight of the triplet loss function; n is the confidence of the keyword.
In the embodiment of the application, when the position prediction model is trained, a proper loss function is constructed for the position prediction model, the position prediction model is trained by using the loss function in a back propagation mode until the position prediction model is trained, and the loss function is fixed in the position prediction model and is used for prediction.
When training a position prediction model, a large amount of data can be acquired for training:
For example, in the database, a second background picture is obtained;
pre-labeling the position of a third target object in the second background picture to obtain a semantic text;
And carrying out data cleaning on the second background picture and the semantic text, and training the second background picture and the semantic text after data cleaning as the input of an initial position prediction model to obtain the position prediction model.
The second background picture is a background picture which is obtained from a database and is different from the first background picture, the position of a third object in the second background picture can be pre-marked through a picture understanding large model, and a corresponding semantic text is obtained, wherein the semantic text is used for describing the position of the third object in the second background picture; and according to the requirements, carrying out data cleaning (data screening) on the processed second background picture and the semantic text, and training the second background picture and the semantic text after the data cleaning as the input of an initial position prediction model (an untrained position prediction model) until the training is completed to obtain the position prediction model.
Fig. 7 is a diagram of a large model-based picture generating device according to an embodiment of the present invention, where, as shown in fig. 7, the device includes: a prediction module 701, a segmentation module 702, a scaling module 703 and a fusion module 704;
The prediction module 701 is configured to input a first background picture and a text description into a position prediction model, and predict to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture.
The prediction module 701 is further configured to obtain, through a decoder in the position prediction model, a global picture feature of the first background picture and a global text feature of the text description, where the decoder includes a picture decoder and a text decoder;
performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features;
And carrying out position regression on the first fusion characteristic, and predicting to obtain target position information of the first target object in the first background picture.
The prediction module 701 is further configured to segment the global picture feature into local picture features, segment the global text feature into first local text features and word frequency information, where the word frequency information includes occurrence frequencies of text features associated with the first object in the respective first local text features; performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature;
Based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features;
and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.
The prediction module 701 is further configured to confirm a query vector of the attention mechanism according to the second local text feature and a projection of the second local text feature on the local picture feature;
confirming a key vector and a value vector of the attention mechanism according to the local picture feature and the projection of the local picture feature on the second local text feature;
According to the query vector, the key vector and the value vector, a local attention weight matrix is obtained, and the local attention weight matrix is collected to be a global attention weight matrix;
and according to the global attention weight matrix, carrying out global feature alignment on the local picture features and the second local text features.
The prediction module 701 is further configured to input the aligned global picture feature into a first layer structure of the cross-modal feature fusion component, and input the aligned global text feature into a second layer structure of the cross-modal feature fusion component, where the first layer structure includes a plurality of first network layers, and the second layer structure includes a plurality of second network layers;
Fusing a first output of each current non-last first network layer with a second output of each current non-last second network layer to obtain a fusion input, wherein the first output and the fusion input are used for fusing and inputting the next first network layer, and the second output and the fusion input are used for fusing and inputting the next second network layer;
and fusing the third output of the last first network layer and the fourth output of the last second network layer to obtain the first fusion characteristic.
The segmentation module 702 is configured to obtain a second object of the same type as the first object in the target picture, and segment the second object from the target picture according to a mask of the second object.
And a scaling module 703, configured to confirm the scaling of the segmented second object according to the size relationship between the target position size in the target position information and the segmented second object.
And a fusion module 704, configured to fuse the segmented second object into the first background picture according to the object position information and the scaling.
The application also provides a large model-based picture generation device, which comprises: at least one processor and memory;
the memory stores computer-executable instructions;
The at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform a large model-based picture generation method.
Fig. 8 is a hardware schematic diagram of a large model-based picture generation device according to an embodiment of the present invention. As shown in fig. 8, the large model-based picture generation apparatus 80 provided in the present embodiment includes: at least one processor 801 and a memory 802. The device 80 further comprises a communication component 803. The processor 801, the memory 802, and the communication section 803 are connected via a bus 804.
In a specific implementation, the at least one processor 801 executes computer-executable instructions stored in the memory 802, so that the at least one processor 801 performs the above large model-based picture generation method.
The specific implementation process of the processor 801 may refer to the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.
In the embodiment shown in fig. 8, it should be understood that the Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), other general purpose processors, digital signal Processor (english: DIGITAL SIGNAL Processor, abbreviated as DSP), application-specific integrated Circuit (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
The Memory may include high-speed Memory (Random Access Memory, RAM) or may further include Non-volatile Memory (NVM), such as at least one disk Memory.
The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.
The present application also provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the large model-based picture generation method as described above.
The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the readable storage medium may reside as discrete components in a device.
The division of the units is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any adaptations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the precise construction hereinbefore set forth and shown in the drawings and as follows in the scope of the appended claims.

Claims (12)

1. A large model-based picture generation method, comprising:
Inputting a first background picture and text description into a position prediction model, and predicting to obtain target position information of a first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;
acquiring a second target object which is the same type as the first target object in a target picture, and dividing the second target object from the target picture according to a mask of the second target object;
confirming the scaling of the segmented second object according to the size relation between the target position size in the target position information and the segmented second object;
According to the target position information and the scaling, fusing the segmented second target object into the first background picture;
Inputting the first background picture and the text description into a position prediction model, predicting to obtain target position information of a first target object in the first background picture, wherein the method comprises the following steps:
Acquiring global picture features of the first background picture and global text features of the text description through a decoder in the position prediction model, wherein the decoder comprises a picture decoder and a text decoder;
performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features;
Position regression is carried out on the first fusion characteristic, and target position information of the first target object in the first background picture is obtained through prediction;
the step of carrying out feature alignment and fusion on the global picture features and the global text features to obtain first fusion features, including:
Dividing the global picture feature into local picture features, dividing the global text feature into first local text features and word frequency information, wherein the word frequency information comprises occurrence frequencies of word features associated with the first target object in the first local text features;
Performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature;
Based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features;
and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.
2. The large model based picture generation method of claim 1, wherein the performing global feature alignment on the local picture feature and the second local text feature based on an attention mechanism comprises:
Confirming a query vector of the attention mechanism according to the second local text feature and the projection of the second local text feature on the local picture feature;
confirming a key vector and a value vector of the attention mechanism according to the local picture feature and the projection of the local picture feature on the second local text feature;
According to the query vector, the key vector and the value vector, a local attention weight matrix is obtained, and the local attention weight matrix is collected to be a global attention weight matrix;
and according to the global attention weight matrix, carrying out global feature alignment on the local picture features and the second local text features.
3. The method for generating a large model-based picture according to claim 1, wherein the feature-fusing the aligned global picture feature and the global text feature by a cross-modal feature-fusion component to obtain the first fused feature comprises:
inputting the aligned global picture features into a first layer structure of the cross-modal feature fusion component, and inputting the aligned global text features into a second layer structure of the cross-modal feature fusion component, wherein the first layer structure comprises a plurality of first network layers, and the second layer structure comprises a plurality of second network layers;
Fusing a first output of each current non-last first network layer with a second output of each current non-last second network layer to obtain a fusion input, wherein the first output and the fusion input are used for fusing and inputting the next first network layer, and the second output and the fusion input are used for fusing and inputting the next second network layer;
and fusing the third output of the last first network layer and the fourth output of the last second network layer to obtain the first fusion characteristic.
4. The method for generating a large model-based picture according to claim 1, wherein the performing position regression on the first fusion feature predicts that the first object is before the target position information of the first background picture, the method further comprising:
Acquiring a triplet loss function and keyword confidence in the text description, and establishing a loss function of the position prediction model according to the keyword confidence and the triplet loss function; wherein the loss function is used to train the position prediction model.
5. The large model based picture generation method according to claim 4, wherein the obtaining a triplet loss function in the text description comprises:
acquiring a second fusion characteristic obtained in a training process, and confirming a first distance between an anchoring sample and a positive sample in the second fusion characteristic and a second distance between the anchoring sample and a negative sample in the second fusion characteristic, wherein the positive sample is a sample of the same type as the anchoring sample, and the negative sample is a sample of a different type from the anchoring sample;
And confirming the triplet loss function according to a preset constant, the first distance and the second distance.
6. The method of claim 4, wherein the creating a loss function of the location prediction model based on the keyword confidence and the triplet loss function comprises:
Acquiring a first loss coefficient according to the first weight and the keyword confidence coefficient;
Acquiring a second loss coefficient according to the second weight and the triplet loss function;
and obtaining a loss function of the position prediction model according to the first loss coefficient and the second loss coefficient.
7. The large model-based picture generation method according to claim 1, wherein before the inputting of the first background picture and the text description into the position prediction model, the method further comprises:
acquiring a second background picture in a database;
pre-labeling the position of a third target object in the second background picture to obtain a semantic text;
And carrying out data cleaning on the second background picture and the semantic text, and training the second background picture and the semantic text after data cleaning as the input of an initial position prediction model to obtain the position prediction model.
8. The large model-based picture generation method according to claim 1, wherein the dividing the second object from the target picture according to the mask of the second object includes:
Confirming the second target object according to preset target parameters, wherein the preset target parameters are used for selecting the second target object;
If a plurality of second targets exist, dividing each second target from the target picture according to the mask of each second target;
the fusing the segmented second object into the first background picture includes:
And acquiring the first background pictures matched with the plurality of second targets, generating a large model based on cross-mode, and fusing the plurality of segmented second targets into the first background pictures.
9. The large model-based picture generation method according to claim 8, wherein the dividing each of the second objects from the target picture according to the mask of each of the second objects includes:
Performing semantic segmentation on the target pictures to obtain masks of each second target object;
And dividing each second object from the object picture according to the mask of each second object.
10. A large model-based picture generation apparatus, comprising:
The prediction module is used for inputting the first background picture and the text description into the position prediction model, and predicting to obtain the target position information of the first target object in the first background picture; the position prediction model is used for acquiring global picture features of the first background picture and global text features of the text description, predicting the target position information according to the global picture features and the global text features, and the text description is used for describing the position of the first target object in the first background picture;
The segmentation module is used for acquiring a second object which is the same type as the first object in the target picture and segmenting the second object from the target picture according to the mask of the second object;
the scaling module is used for confirming the scaling of the segmented second target object according to the size relation between the target position size in the target position information and the segmented second target object;
the fusion module is used for fusing the segmented second object into the first background picture according to the object position information and the scaling;
The prediction module is specifically configured to obtain, through a decoder in the position prediction model, a global picture feature of the first background picture and a global text feature of the text description, where the decoder includes a picture decoder and a text decoder; performing feature alignment and fusion on the global picture features and the global text features to obtain first fusion features; position regression is carried out on the first fusion characteristic, and target position information of the first target object in the first background picture is obtained through prediction;
The prediction module is specifically configured to segment the global picture feature into local picture features and segment the global text feature into first local text features and word frequency information when performing feature alignment and fusion on the global picture feature and the global text feature to obtain a first fusion feature, where the word frequency information includes occurrence frequencies of word features associated with the first target object in each of the first local text features; performing de-duplication on the first local text feature according to the word frequency information to obtain a second local text feature; based on an attention mechanism, carrying out global feature alignment on the local picture features and the second local text features to obtain aligned global picture features and global text features; and carrying out feature fusion on the aligned global picture features and the global text features through a cross-modal feature fusion component to obtain the first fusion features.
11. A large model-based picture generation apparatus, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
The at least one processor executing computer-executable instructions stored in the memory, causing the at least one processor to perform the large model-based picture generation method of any one of claims 1-9.
12. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the large model based picture generation method as claimed in any one of claims 1-9.
CN202410166499.7A 2024-02-06 2024-02-06 Picture generation method, device, equipment and medium based on large model Active CN117710234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410166499.7A CN117710234B (en) 2024-02-06 2024-02-06 Picture generation method, device, equipment and medium based on large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410166499.7A CN117710234B (en) 2024-02-06 2024-02-06 Picture generation method, device, equipment and medium based on large model

Publications (2)

Publication Number Publication Date
CN117710234A CN117710234A (en) 2024-03-15
CN117710234B true CN117710234B (en) 2024-05-24

Family

ID=90146597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410166499.7A Active CN117710234B (en) 2024-02-06 2024-02-06 Picture generation method, device, equipment and medium based on large model

Country Status (1)

Country Link
CN (1) CN117710234B (en)

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458918A (en) * 2019-08-16 2019-11-15 北京百度网讯科技有限公司 Method and device for outputting information
CN110599557A (en) * 2017-08-30 2019-12-20 深圳市腾讯计算机系统有限公司 Image description generation method, model training method, device and storage medium
CN112016545A (en) * 2020-08-11 2020-12-01 中国银联股份有限公司 A method and device for generating images containing text
CN112419328A (en) * 2019-08-22 2021-02-26 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
CN113052159A (en) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 Image identification method, device, equipment and computer storage medium
WO2021184303A1 (en) * 2020-03-19 2021-09-23 深圳市创梦天地科技有限公司 Video processing method and device
CN115082916A (en) * 2022-05-30 2022-09-20 华南理工大学 Scene text perception reference expression understanding method and device and storage medium
WO2023056835A1 (en) * 2021-10-09 2023-04-13 北京字节跳动网络技术有限公司 Video cover generation method and apparatus, and electronic device and readable medium
CN116091630A (en) * 2022-11-01 2023-05-09 哈尔滨工业大学(深圳) Method and device for training image generation model
CN116311279A (en) * 2023-03-24 2023-06-23 苏州科达科技股份有限公司 Generation of sample images, model training, character recognition method, equipment and medium
WO2023138188A1 (en) * 2022-01-24 2023-07-27 腾讯科技(深圳)有限公司 Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device
CN116704062A (en) * 2023-06-02 2023-09-05 支付宝(杭州)信息技术有限公司 AIGC-based data processing method, device, electronic equipment and storage medium
CN116778148A (en) * 2023-06-12 2023-09-19 广州亚信技术有限公司 Target detection method, target detection device, electronic equipment and storage medium
CN116912143A (en) * 2023-07-20 2023-10-20 上海蜜度信息技术有限公司 Image synthesis method, equipment and computer-readable medium
CN116934908A (en) * 2023-09-12 2023-10-24 深圳兔展智能科技有限公司 Automatic poster generation method, device, computer equipment and storage medium
CN117152302A (en) * 2023-08-02 2023-12-01 抖音视界有限公司 Method, device, equipment and storage medium for generating display image of target object
CN117312957A (en) * 2023-10-11 2023-12-29 中国工商银行股份有限公司 Remote sensing image recognition model generation method, device, equipment, medium and product
CN117372570A (en) * 2023-06-08 2024-01-09 阿里巴巴(中国)有限公司 Advertisement image generation method and device

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110599557A (en) * 2017-08-30 2019-12-20 深圳市腾讯计算机系统有限公司 Image description generation method, model training method, device and storage medium
CN110458918A (en) * 2019-08-16 2019-11-15 北京百度网讯科技有限公司 Method and device for outputting information
CN112419328A (en) * 2019-08-22 2021-02-26 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
WO2021184303A1 (en) * 2020-03-19 2021-09-23 深圳市创梦天地科技有限公司 Video processing method and device
CN112016545A (en) * 2020-08-11 2020-12-01 中国银联股份有限公司 A method and device for generating images containing text
CN113052159A (en) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 Image identification method, device, equipment and computer storage medium
WO2023056835A1 (en) * 2021-10-09 2023-04-13 北京字节跳动网络技术有限公司 Video cover generation method and apparatus, and electronic device and readable medium
WO2023138188A1 (en) * 2022-01-24 2023-07-27 腾讯科技(深圳)有限公司 Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device
CN115082916A (en) * 2022-05-30 2022-09-20 华南理工大学 Scene text perception reference expression understanding method and device and storage medium
CN116091630A (en) * 2022-11-01 2023-05-09 哈尔滨工业大学(深圳) Method and device for training image generation model
CN116311279A (en) * 2023-03-24 2023-06-23 苏州科达科技股份有限公司 Generation of sample images, model training, character recognition method, equipment and medium
CN116704062A (en) * 2023-06-02 2023-09-05 支付宝(杭州)信息技术有限公司 AIGC-based data processing method, device, electronic equipment and storage medium
CN117372570A (en) * 2023-06-08 2024-01-09 阿里巴巴(中国)有限公司 Advertisement image generation method and device
CN116778148A (en) * 2023-06-12 2023-09-19 广州亚信技术有限公司 Target detection method, target detection device, electronic equipment and storage medium
CN116912143A (en) * 2023-07-20 2023-10-20 上海蜜度信息技术有限公司 Image synthesis method, equipment and computer-readable medium
CN117152302A (en) * 2023-08-02 2023-12-01 抖音视界有限公司 Method, device, equipment and storage medium for generating display image of target object
CN116934908A (en) * 2023-09-12 2023-10-24 深圳兔展智能科技有限公司 Automatic poster generation method, device, computer equipment and storage medium
CN117312957A (en) * 2023-10-11 2023-12-29 中国工商银行股份有限公司 Remote sensing image recognition model generation method, device, equipment, medium and product

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Backend service system of ship detection and plate recognition system based on deep learning;Pengyu Li 等;《AOPC 2021: Optical Sensing and Imaging Technology》;20211231;1-6 *
面向视觉语言理解与生成的多模态预训练方法;刘天义 等;《软件学报》;20230531;第34卷(第5期);2024-2034 *

Also Published As

Publication number Publication date
CN117710234A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN115685765B (en) A device linkage solution recommendation method, device and medium
US11868738B2 (en) Method and apparatus for generating natural language description information
CN110599492A (en) Training method and device for image segmentation model, electronic equipment and storage medium
Kumar et al. Adaptive cluster tendency visualization and anomaly detection for streaming data
JP2020071665A (en) Action recognition method, action recognition program, and action recognition device
JP7222418B2 (en) METHOD, APPARATUS AND COMPUTER-READABLE STORAGE MEDIUM FOR DETERMINING ARRANGEMENT POSITION OF ITEM
CN112069412A (en) Information recommendation method and device, computer equipment and storage medium
CN117710234B (en) Picture generation method, device, equipment and medium based on large model
CN110110209B (en) A cross recommendation method and system based on local weighted linear regression model
CN114647761A (en) Image frame processing method and device, equipment and computer readable storage medium
CN119168819B (en) Knowledge recommendation method, device, computer equipment and medium for online education platform
Haines et al. Delta-dual hierarchical dirichlet processes: A pragmatic abnormal behaviour detector
CN114861742A (en) Graph classification method and system based on graph neural network
CN115063713B (en) Training method of video generation model, video generation method, device, electronic equipment and readable storage medium
CN118972436A (en) Material recommendation method, device, equipment and computer storage medium
CN117095677A (en) Semantic understanding template generation method, device, storage medium and electronic device
CN117541913A (en) Digital twinning-based deployment scene generation method and device
CN111488476B (en) Image pushing method, model training method and corresponding devices
CN114998814A (en) Target video generation method, apparatus, computer equipment and storage medium
EP4395397A1 (en) Method for generating a deployment weighted multigraph of connected objects and method for generating optimized deployment solutions of connected objects thereof
CN116010697B (en) Data processing methods, electronic devices, and storage media
CN110942306A (en) Data processing method, device and electronic device
CN114722456B (en) Acoustic visualization and auralization simulation method, device, equipment and storage medium
CN115097738B (en) Equipment control method, device, storage medium and electronic device based on digital twin
CN116416357A (en) Chart rendering method, device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant