CN118509542B

CN118509542B - Video generation method, device, computer equipment and storage medium

Info

Publication number: CN118509542B
Application number: CN202410962785.4A
Authority: CN
Inventors: 沈靖程; 吴大为
Original assignee: Pi Technology Changzhou Co ltd
Current assignee: Pi Technology Changzhou Co ltd
Priority date: 2024-07-18
Filing date: 2024-07-18
Publication date: 2024-11-29
Anticipated expiration: 2044-07-18
Also published as: CN118509542A

Abstract

The application provides a video generation method, a device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining a panoramic video, carrying out human body target tracking on each video frame to obtain a target boundary frame of a human body target, drawing a rectangular rendering picture according to the target boundary frame, enabling the target boundary frame to be positioned in the middle area of the rectangular rendering picture, carrying out human body detection on the target boundary frame to obtain a body region of interest, adjusting the field angle of the rendering picture, enabling the body region of interest to occupy the set proportion of the rendering picture, generating a planar video related to the human body target according to the adjusted rendering picture, carrying out frame rate adjustment on the planar video, and generating a slow motion video with the target frame rate. The application can automatically generate the special effect plane video in dynamic moment without complex video editing operation of the user.

Description

Video generation method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to a video generation method, a video generation device, computer equipment and a storage medium.

Background

With the development of video processing technology, people are increasingly enthusiastic for various special video processing effects, especially for image processing involving moving objects. For example, the way in which a sports close-up shot is handled during sports. At present, the video processing and generating of the moving object are more complex than the common video processing and generating, so that the requirement on the shooting method of a user is high, the technical requirement on video editing technology of post-processing the video by the user to generate the special effect video is high, and particularly for panoramic video, the panoramic video has a panoramic view of 360 degrees, so that the video content is more, and the processing is more complex and tedious. Therefore, there is an urgent need for a method for generating a dynamic special effect video, which has a simple shooting mode and does not require a user to perform a post-complex video editing operation.

Disclosure of Invention

The embodiment of the application provides a video generation method, a video generation device, computer equipment and a storage medium, aiming at solving the technical problem that aiming at the defects of the prior art, the existing special effect video generation method comprising moving objects is complex and tedious, and requires a user to have higher professional capability of video processing.

In order to solve the technical problems, the application adopts a technical scheme that a video generating method, a device, a computer device and a storage medium are provided, wherein the video generating method comprises the following steps:

Acquiring a panoramic video;

Tracking human body targets on all video frames of the panoramic video to obtain target boundary frames of the human body targets, and drawing a rectangular rendering picture according to the target boundary frames so that the target boundary frames are positioned in the middle area of the rectangular rendering picture;

Performing human body detection on the target boundary box to obtain a body region of interest;

Adjusting the angle of view of the rendered picture so that the body region of interest occupies a set proportion of the rendered picture;

Generating a planar video about the human body target according to the adjusted rendering picture;

and carrying out frame rate adjustment on the planar video to generate slow motion video with target frame rate.

In a specific embodiment, the performing human body detection on the target bounding box to obtain a body region of interest includes:

detecting human body key points of the target boundary box to obtain N human body skeleton key points of the human body target, wherein N is a natural number and is more than or equal to 10 and less than or equal to 20;

calculating displacement amounts of N human skeleton key points of the human body target in a plurality of video frames;

determining a target human skeleton key point set according to the displacement amounts of N human skeleton key points in a plurality of video frames;

And determining the human body part corresponding to the target human body skeleton key point set as the interested body region according to the mapping relation between the target human body skeleton key point set and the human body.

In a specific embodiment, the detecting the human body key points of the target bounding box to obtain N human body skeleton key points of the human body target includes:

outputting the target boundary frame to a human skeleton key point detection model, and detecting N human skeleton key points in the target boundary frame of each video frame through the human skeleton key point detection model to obtain a coordinate sequence of the N human skeleton key points.

In a specific embodiment, the calculating the displacement amounts of the N human skeleton key points of the human target in the plurality of video frames includes:

setting the displacement from the key point of the human skeleton in the i-1 th frame to the i frame of the next frame as Wherein the saidCalculating norm, wherein k is a natural number and is defined as the kth human skeleton key point, and k is more than or equal to 0 and less than or equal to N;

The displacement of the N human skeleton key points of the human body target in the continuous M video frames starting from the ith-M frame is the displacement of the N human skeleton key points of the human body target in the video frames, and the displacement is that ,0≤k≤N。

In a specific embodiment, the determining the target set of human skeleton key points according to the displacement amounts of the N human skeleton key points in the plurality of video frames includes:

According to the displacement of each human skeleton key point in a plurality of video frames, selecting L human skeleton key points with the largest displacement as target human skeleton key points, and determining the target human skeleton key point set according to the target human skeleton key points.

In a specific embodiment, the determining, according to the mapping relationship between the target set of human skeleton key points and the human body, the human body part corresponding to the target set of human skeleton key points as the body region of interest includes:

and mapping the human body parts corresponding to the target human body skeleton key point set based on the human body skeleton key point detection model, and determining the corresponding human body parts as the interested body area.

In a specific embodiment, N is 18, L is 2, and the body area of interest occupies 5% -10% of the rendered screen.

In one embodiment, the method further comprises a video generating device, the device comprising:

The acquisition module is used for acquiring panoramic video;

The tracking module is used for tracking human body targets of all video frames of the panoramic video to obtain target boundary frames of the human body targets, and drawing a rectangular rendering picture according to the target boundary frames so that the target boundary frames are positioned in the middle area of the rectangular rendering picture;

The human body detection module is used for detecting the human body of the target boundary box to obtain a body region of interest;

The visual angle adjusting module is used for adjusting the visual angle of the rendering picture so that the interested body area occupies a set proportion of the rendering picture;

The plane video generation module is used for generating a plane video about the human body target according to the adjusted rendering picture;

And the frame rate adjusting module is used for adjusting the frame rate of the planar video and generating slow motion video with a target frame rate.

In a specific embodiment, the video generating method further comprises a computer device, and the computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the video generating method according to any of the specific embodiments.

In a specific embodiment, a computer readable storage medium is further included, on which a computer program is stored, which when executed by a processor, implements the steps of the video generation method according to any of the above specific embodiments.

The video generation method and device, the computer equipment and the storage medium have the beneficial effects that the video generation method and device provided by the embodiment of the application are different from the situation in the prior art, and the video generation method comprises the following steps:

Acquiring a panoramic video;

From the above technical solutions, the embodiment of the present application has the following advantages:

according to the method, a target boundary box of a human body target is obtained through a target tracking algorithm, the target boundary box is input into a human body skeleton key point model to obtain N human body skeleton key points of the human body target, finally, a body region of interest is obtained through analyzing the N human body skeleton key points, and then, key protruding parts of the human body target are determined according to the body region of interest, so that a slow-motion planar video with dynamic special effects is obtained.

The dynamic special effect video generation method cannot take complex shooting skills of a photographer and follow-up complex video editing operation, and can quickly and efficiently generate the dynamic special effect slow motion planar video by simply taking the video, so that a brand-new playing method of the panoramic video is newly added, and the playability of a user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic flow chart of an implementation of a video generating method according to a first embodiment of the present application;

fig. 2 is a schematic diagram of key points of a human skeleton in a video generating method according to an embodiment of the present application;

fig. 3 is a schematic block diagram of a video generating apparatus according to a second embodiment of the present application;

fig. 4 is a schematic diagram of an internal structure of a computer device according to a third embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and embodiments. It is to be noted that the following embodiments are only for illustrating the present application, but do not limit the scope of the present application. Likewise, the following embodiments are only some, but not all, of the embodiments of the present application, and all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

The terms "comprising" and "having" and any variations thereof herein are intended to cover a non-exclusive inclusion. A process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules but may, in the alternative, include steps or modules not listed or inherent to such process, method, article, or apparatus.

The above terms are merely for convenience of description and should not be construed as limiting the present technical solution.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.

The following describes in detail the implementation of the present application in connection with specific embodiments:

Embodiment one:

Fig. 1 shows a flow of implementation of a video generating method according to the first embodiment of the present application, and for convenience of explanation, only the portions relevant to the embodiment of the present application are shown, which is described in detail below.

Acquiring a panoramic video;

It should be noted that the embodiment of the present application is applicable to video image processing apparatuses such as video display and acquisition. The panoramic video is formed by shooting and splicing panoramic cameras.

the target detection and target tracking algorithm adopted by human body target tracking is carried out on each video frame of the panoramic video.

The target detection algorithm can be a common detection method, the common detection method can be one or more detection methods based on manual characteristics, the detection methods based on manual characteristics comprise a template matching method, a key point matching method and a key characteristic method, the common detection method can also be one or more detection methods based on a convolutional neural network technology, and the convolutional neural network technology can adopt one or more models of YOLO (You OnlyLook Once), SSD (Single Shot MultiBox, single shot multi-frame), R-CNN (Region-basedConvolutional Neural Networks, region-based convolutional neural network) or Mask R-CNN (MaskRegion-based Convolutional Neural Networks, mask-based Region-based convolutional neural network). Illustratively, the YOLO model may be one or more models of the versions YOLOv-YOLOv 5, YOLOR, YOLOX, etc.

When the human body target is detected by adopting a target detection algorithm, a target boundary frame of the human body target is obtained, the target boundary frame is a rectangular block diagram containing all the characteristics of the human body target, a rectangular rendering picture is drawn according to the rectangular block diagram, the rectangular rendering picture is a video picture with the size of a display screen, and the rectangular block diagram is positioned in the middle area of the rectangular rendering picture so as to highlight the position of the human body target in the rectangular rendering picture.

When the human body target tracking is carried out, taking the video frame of which the human body target is initially detected as a first video picture, and carrying out target tracking by using a preset target tracking algorithm to obtain a target rectangular frame of the first video picture. The tracking device can be a discriminant correlation filter (DISCRIMINATIVE CORRELATION FILTER, DCF) and tracking devices of other correlation filtering classes, can be a twin area recommendation network (Siamese Region Proposal Network, siamRPN) and tracking devices based on CNN technology, and can be other tracking devices.

The body region of interest is a body region of the human body target, and the body region needs to be emphasized in the rectangular rendering picture, such as a body part with larger movement amplitude, such as an arm part, a thigh part and the like, so as to form a dynamic instant special effect.

One embodiment of the present invention is that the detecting the human body of the target bounding box to obtain the body region of interest, including:

It should be noted that, human body key point detection (Human Keypoints Detection) is also called human body posture estimation, and is a relatively basic task in computer vision, and is a pre-task for human body action recognition, behavior analysis, human-computer interaction, and the like. Human body key point detection can be generally divided into single/multi-person key point detection and 2D/3D key point detection, and meanwhile, an algorithm can also track key points after completing the key point detection, which is also called human body gesture tracking.

Human keypoint detection algorithms include, but are not limited to OpenPose, alphaPose and the like.

The target human skeleton key point set, namely, among N human skeleton key points, the human skeleton key points with the largest displacement amplitude can be judged by calculating the target human skeleton key point set, so that which part of the target human body has the largest movement amplitude can be judged, and the interested body area can be judged.

In one specific embodiment, the detecting the human body key points of the target bounding box to obtain N human body skeleton key points of the human body target includes:

The human skeleton key point detection model is an image data set containing various human body posture features. Including but not limited to MPII (Max Planck Institute Informatik) and MS COCO. MPII is a data set for human body posture estimation used by the maxPlanck information research institute, has about 2.5 ten thousand images, contains more than 4 ten thousand human bodies with annotation key points, has data mainly of multiple people, and is a verification and test set for Shan Zhen single-person posture, single-frame multiple-person posture and video multiple-person posture, and most of methods mainly use a single-frame multiple-person posture test set. The method is characterized in that 16 possible key points of the whole body are marked at most, body part shielding, 3D trunk and head direction marking are also recorded in the test set, and MSCOCO data sets are large and rich object detection, segmentation, caption and human key point data sets. The human body key point data set part of the MS COCO data set is a mainstream data set of multi-person posture estimation, and comprises more than 20 ten thousand images and 25 ten thousand character instances marked by key points, and the maximum 17 possible key points of the whole body are marked, as shown in a human body skeleton key point model schematic diagram of fig. 2, and the 17 key points are a nose, a left eye, a left ear, a right ear, a left shoulder, a left elbow, a right wrist, a left hip, a left knee, a right ankle. Average 2 persons in a single image, and at most 13 persons.

In one specific embodiment, the calculating the displacement amounts of the N human skeleton key points of the human target in the plurality of video frames includes:

It should be noted that the norm is a function with the concept of "length" and can be used to measure the length or size of each vector in a certain vector space (or matrix). In the embodiment of the application, the norm calculation adopts an L1 norm or an L2 norm, the L1 norm is the sum of displacement amounts of each human skeleton key point of the human target in a plurality of video frames, and the L2 norm is the euclidean distance of each human skeleton key point in a plurality of video frames.

In one embodiment, the determining the target set of human skeleton key points according to the displacement amounts of the N human skeleton key points in the plurality of video frames includes:

The motion amplitude of the corresponding human body part is larger as the human body skeleton key points with larger displacement are described, so that the L human body skeleton key points with the largest displacement are taken as target human body skeleton key points, and the human body parts corresponding to the L human body skeleton key points can be confirmed to be the interested body areas, and L is an integer.

In one embodiment, the determining, according to the mapping relationship between the target human skeleton key point set and the human body, the human body part corresponding to the target human skeleton key point set as the body region of interest includes:

It should be noted that, each human skeleton key point or several human skeleton key points together may map out a corresponding human body part, as shown in fig. 2, the human skeleton key point 13 maps the head of the human body, 0 maps the neck of the human body, 1 maps the right shoulder of the human body, 4 maps the left shoulder of the human body, 1, 2, 3 sets map the right hand part of the human body, 4, 5, 6 maps the left hand part of the human body, 7, 8, 9 maps the right leg part of the human body, 10, 11, 12 maps the left leg part of the human body, and according to which human skeleton key points in the human skeleton key point model of fig. 2 are included in the target human skeleton key point set, it may be determined that the human body part corresponding to the target human skeleton key point set is the body region of interest, if the target human skeleton key point set includes the human skeleton key points 7 and 8, then the right thigh position is determined as the body region of interest.

in one specific embodiment, N is 18, L is 2, and the body area of interest occupies 5% -10% of the rendered frame.

In the detection of the human skeleton key points, 18 human skeleton key points are usually detected, and each key point part of the human body can be contained in the maximum range. The target human skeleton key point set can confirm the interested body region by selecting 2 human skeleton key points according to a large amount of experimental experience, so that the interested body region can be detected with maximum efficiency, and excessive calculation amount is avoided.

When the body region of interest is determined, namely, the body region of interest is highlighted by adjusting the ratio of the body region of interest to the rendered picture, the set proportion is 5% -10%, preferably 5%.

Another possible implementation way is to achieve the purpose of highlighting the body region of interest by adjusting the viewing angle of the body region of interest throughout the rendered screen, e.g. by having the body region of interest appear as much as possible in the middle region of the rendered screen.

in the planar video related to the human body target, the human body target is located in a middle area of a display screen of the planar video.

And carrying out slow motion processing on the human body target containing the motion state, so as to obtain a plane video of the dynamic instant special effect of the human body target of the interested body area.

Embodiment two:

fig. 3 is a schematic block diagram of a video generating apparatus according to a second embodiment of the present application, and for convenience of explanation, only a portion related to the second embodiment of the present application is shown.

An embodiment of the present application provides a video generating apparatus, including:

The acquisition module is used for acquiring panoramic video;

a field angle adjustment module for adjusting the field angle of the rendered picture so that the body region of interest occupies a set proportion of the rendered picture

The video generating device of the embodiment of the application is used for realizing the video generating method provided by any one of the specific implementation modes of the embodiment of the application:

Acquiring a panoramic video;

It should be noted that each module in the video generating apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

It should be noted that, for the technical details not described in detail in the present embodiment, reference may be made to the video generating method provided in the first embodiment, and details thereof are not repeated here.

Embodiment III:

in an embodiment of the present application, a computer device is provided, and fig. 4 is a schematic diagram of an internal structure of the computer device.

The computer device includes a memory and a processor, where the memory stores a computer program, and the processor implements the video generating method according to any one of the embodiments of the present application when executing the computer program, and the method specifically includes:

Acquiring a panoramic video;

It should be noted that the computer device may be a terminal, and the internal structure diagram thereof may be as shown in fig. 4. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video generation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Embodiment four:

In an embodiment of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video generating method according to any one of the embodiments of the present application, including:

Acquiring a panoramic video;

Unlike the prior art, the embodiment of the application provides a video generation method, a device, a computer device and a storage medium, wherein the video generation method comprises the following steps:

Acquiring a panoramic video;

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product or all or part of the technical solution, which is stored in a storage medium, and includes several instructions for causing a terminal device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the application.

Claims

1. A method of video generation, the method comprising:

Acquiring a panoramic video;

Performing frame rate adjustment on the planar video to generate a slow motion video with a target frame rate;

The detecting the human body of the target boundary box to obtain a body region of interest includes:

2. The method of claim 1, wherein the detecting the human body key points of the target bounding box to obtain N human body skeleton key points of the human body target comprises:

3. The method according to claim 2, wherein said calculating the displacement amounts of N of said human skeletal key points of said human target in a plurality of said video frames comprises:

Setting the displacement from the human skeleton key point in the i-1 th frame to the i frame of the next frame as MF _{i_k}＝norm(P_{i-1_k},P_{i_k}, wherein P _{i-1_k} is a coordinate sequence of the human skeleton key point in the i-1 th frame, norm is calculated, k is a natural number, and k is defined as the kth human skeleton key point, and k is more than or equal to 0 and less than or equal to N;

the displacement amount of the N human skeleton key points of the human body target in the continuous M video frames from the i-M frame is the displacement amount of the N human skeleton key points of the human body target in the video frames, wherein the displacement amount is MM _{i_k}＝∑_i-M≤j＜iMF_{i_k}, and k is more than or equal to 0 and less than or equal to N.

4. A video generating method according to claim 3, wherein said determining a target set of human skeleton key points based on the amounts of displacement of N human skeleton key points in a plurality of said video frames comprises:

5. The method of generating video according to claim 4, wherein determining, according to the mapping relationship between the target set of human skeleton key points and the human body, the human body part corresponding to the target set of human skeleton key points as the body region of interest includes:

6. A video generating method according to claim 4 or 5, wherein N is 18, L is 2, and the body region of interest occupies 5% -10% of the rendered screen.

7. A video generating apparatus, the apparatus comprising:

The acquisition module is used for acquiring panoramic video;

According to the mapping relation between the target human skeleton key point set and the human body, determining the human body part corresponding to the target human skeleton key point set as the interested body area;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the video generation method of any of claims 1 to 6 when the computer program is executed.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the video generation method of any one of claims 1 to 8.