KR20260004340A

KR20260004340A - Frame enhancement using diffusion models

Info

Publication number: KR20260004340A
Application number: KR1020257035340A
Authority: KR
Inventors: 옌스 페터젠; 미하우 야쿱 스티풀코브스키; 고우스 누르 파티마 카눔 모하메드; 아우커 요리스 비허르스; 기욤 콘라드 소띠에르
Original assignee: 퀄컴 인코포레이티드
Priority date: 2023-05-09
Filing date: 2024-05-01
Publication date: 2026-01-08
Also published as: CN121039695A; US20240378698A1; EP4710296A1

Abstract

이미지 데이터를 프로세싱하기 위한 시스템들 및 기법들이 제공된다. 일부 양태들에 따르면, 컴퓨팅 디바이스는 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정할 수 있다. 컴퓨팅 디바이스는 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성할 수 있으며, 제2 해상도는 제1 해상도보다 높다. 컴퓨팅 디바이스는 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성할 수 있다.Systems and techniques for processing image data are provided. According to some aspects, a computing device can determine an optical flow between a current frame having a first resolution and a first previous frame having the first resolution. The computing device can warp a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having a second resolution, wherein the second resolution is higher than the first resolution. The computing device can process the noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

Description

Frame enhancement using diffusion models

본 개시는 일반적으로 비디오 프로세싱에 관한 것이다. 예를 들어, 본 개시의 양태들은 확산 모델을 사용하여 프레임 향상(예를 들어, 비디오 초해상도, 샤프닝 등)을 수행하기 위한 시스템들 및 기법들에 관련된다.The present disclosure relates generally to video processing. For example, aspects of the present disclosure relate to systems and techniques for performing frame enhancement (e.g., video super-resolution, sharpening, etc.) using a diffusion model.

많은 디바이스들 및 시스템들은 장면의 이미지들(또는 프레임들) 및/또는 비디오 데이터(다수의 프레임들을 포함함)를 생성하는 것에 의해 장면이 캡처될 수 있게 한다. 예를 들어, 카메라 또는 카메라를 포함하는 디바이스는 장면 프레임들의 시퀀스(예컨대, 장면의 비디오)를 캡처할 수 있다. 일부 경우들에서, 프레임들의 시퀀스는, 다른 사용들 중에서도, 하나 이상의 기능들을 수행하기 위해 프로세싱될 수 있고, 디스플레이를 위해 출력될 수 있고, 다른 디바이스들에 의한 프로세싱 및/또는 소비를 위해 출력될 수 있다.Many devices and systems enable a scene to be captured by generating images (or frames) of the scene and/or video data (including multiple frames). For example, a camera or a device including a camera may capture a sequence of scene frames (e.g., a video of the scene). In some cases, the sequence of frames may be processed to perform one or more functions, output for display, and/or output for processing and/or consumption by other devices, among other uses.

인공 뉴럴 네트워크는 동물의 뇌를 구성하는 생물학적 뉴럴 네트워크가 수행하는 논리적 추론을 컴퓨터 기술을 이용하여 복제하고자 한다. 컨볼루션 뉴럴 네트워크와 같은 심층 뉴럴 네트워크는 특히 오브젝트 검출, 오브젝트 분류, 오브젝트 추적, 빅 데이터 분석과 같은 다수의 애플리케이션들에 널리 사용된다. 예를 들어, 컨볼루션 뉴럴 네트워크들은 입력 이미지로부터 얼굴 형상들과 같은 하이-레벨 피처들을 추출할 수 있고, 이러한 하이-레벨 피처들을 사용하여, 예를 들어, 입력 이미지가 특정 오브젝트를 포함할 확률을 출력할 수 있다.Artificial neural networks (ANNs) aim to replicate the logical reasoning performed by biological neural networks, such as those found in animal brains, using computer technology. Deep neural networks, such as convolutional neural networks (CNNs), are widely used in numerous applications, including object detection, object classification, object tracking, and big data analysis. For example, CNNs can extract high-level features, such as facial features, from input images and use these high-level features to output, for example, the probability that the input image contains a specific object.

하기 내용은 본 명세서에 개시된 하나 이상의 양태들에 관한 간략화된 요약을 제시한다. 따라서, 하기 요약은, 모든 고려되는 양태들에 관한 포괄적인 개관으로 간주되지 않아야 하며, 모든 고려되는 양태들에 관한 핵심적이거나 결정적인 엘리먼트들을 식별하거나 임의의 특정 양태와 연관된 범주를 명확히 구분하는 것으로 간주되지 않아야 한다. 따라서, 하기 요약은 아래에 제시된 상세한 설명에 선행하는 간략화된 형태로, 본 명세서에 개시된 메커니즘들에 관한 하나 이상의 양태들에 관한 특정 개념들을 제시하기 위한 유일한 목적을 갖는다.The following presents a brief summary of one or more aspects disclosed herein. Therefore, the following summary should not be considered a comprehensive overview of all contemplated aspects, nor should it be considered to identify key or critical elements of all contemplated aspects, or to delineate categories associated with any particular aspect. Accordingly, the following summary serves the sole purpose of presenting specific concepts of one or more aspects of the mechanisms disclosed herein in a simplified form that precedes the more detailed description presented below.

확산 모델을 사용하여 프레임 향상(예를 들어, 비디오 초해상도, 샤프닝 등)을 수행하기 위한 시스템들, 방법들, 장치들, 및 컴퓨터-판독가능 매체들이 개시된다. 적어도 하나의 예시적인 예에 따르면, 현재 프레임의 이미지 데이터를 프로세싱하는 방법이 제공된다. 방법은: 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하는 단계; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하는 단계 - 제2 해상도는 제1 해상도보다 높음 -; 및 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하는 단계를 포함한다.Systems, methods, devices, and computer-readable media for performing frame enhancement (e.g., video super-resolution, sharpening, etc.) using a diffusion model are disclosed. According to at least one illustrative example, a method of processing image data of a current frame is provided. The method includes: determining an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; warping a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and processing the noisy frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

다른 예시적인 예에서, 현재 프레임의 이미지 데이터를 프로세싱하기 위한 장치가 제공된다. 장치는 이미지 데이터를 저장하도록 구성된 적어도 하나의 메모리, 및 적어도 하나의 메모리에 커플링된 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는: 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하고; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하고 - 제2 해상도는 제1 해상도보다 높음 -; 그리고 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하도록 구성된다.In another illustrative example, an apparatus for processing image data of a current frame is provided. The apparatus includes at least one memory configured to store the image data, and at least one processor coupled to the at least one memory, wherein the at least one processor is configured to: determine an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; warp a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and process the noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

다른 예시적인 예에서, 저장된 명령들을 포함하는 비일시적 컴퓨터 판독가능 저장 매체가 제공되고, 명령들은, 적어도 하나의 프로세서에 의해 실행될 때, 적어도 하나의 프로세서로 하여금, 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하게 하고; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하게 하고 - 제2 해상도는 제1 해상도보다 높음 -; 그리고 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하게 한다.In another illustrative example, a non-transitory computer-readable storage medium is provided comprising instructions stored thereon, which, when executed by at least one processor, cause the at least one processor to: determine an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; warp a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and process the noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

다른 예시적인 예에서, 현재 프레임의 이미지 데이터를 프로세싱하기 위한 장치가 제공된다. 장치는: 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하는 수단; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하는 수단 - 제2 해상도는 제1 해상도보다 높음 -; 및 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하는 수단을 포함한다.In another exemplary embodiment, an apparatus for processing image data of a current frame is provided. The apparatus includes: means for determining an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; means for warping a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and means for processing the noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

양태들은 대체적으로, 도면들 및 명세서를 참조하여 실질적으로 설명된 바와 같은, 그리고 도면들 및 명세서에 의해 예시된 바와 같은, 방법, 장치, 시스템, 컴퓨터 프로그램 제품, 비일시적 컴퓨터 판독가능 매체, 사용자 디바이스, 사용자 장비, 무선 통신 디바이스, 및/또는 프로세싱 시스템을 포함한다.The embodiments generally include methods, apparatus, systems, computer program products, non-transitory computer-readable media, user devices, user equipment, wireless communication devices, and/or processing systems, substantially as described herein with reference to the drawings and the specification, and as exemplified by the drawings and the specification.

전술한 것은, 후속하는 상세한 설명이 더 양호하게 이해될 수 있게 하기 위해 본 개시내용에 따른 예들의 특징들 및 기술적 장점들을 다소 광범위하게 약술하였다. 부가적인 특징들 및 장점들이 아래에서 설명될 것이다. 개시된 개념 및 특정한 예들은 본 개시내용의 동일한 목적들을 수행하기 위해 다른 구조들을 변형 또는 설계하기 위한 기반으로서 용이하게 이용될 수 있다. 이러한 동등한 구성들은 첨부된 청구항들의 범위를 벗어나지 않는다. 본 명세서에 개시된 개념들의 특성들, 즉, 그들의 조직 및 동작 방법 둘 모두는, 연관된 이점들과 함께, 첨부 도면들과 관련하여 고려될 때 후속하는 설명으로부터 더 양호하게 이해될 것이다. 도면들 각각은 예시 및 설명의 목적들을 위해 제공되며, 청구항의 제한들의 정의로서 제공되지 않는다. 전술한 것은 다른 특징들 및 양태들과 함께, 다음의 명세서, 청구항들, 및 첨부 도면들을 참조할 시에 더 명백해질 것이다.The foregoing has outlined rather broadly the features and technical advantages of examples according to the present disclosure so that the detailed description that follows may be better understood. Additional features and advantages will be described below. The concepts and specific examples disclosed may readily be utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the scope of the appended claims. The features of the concepts disclosed herein, both their organization and method of operation, together with associated advantages, will be better understood from the following description when considered in conjunction with the accompanying drawings. Each of the drawings is provided for purposes of illustration and description and is not intended as a definition of the limitations of the claims. The foregoing, together with other features and aspects, will become more apparent upon reference to the following specification, claims, and accompanying drawings.

이 개요는 청구 대상의 핵심적인 또는 본질적인 특징들을 식별하도록 의도되지 않으며, 청구 대상의 범위를 결정하기 위해 별개로 사용되도록 의도되지 않는다. 청구 대상은 본 특허의 전체 명세서의 적절한 부분들, 임의의 또는 모든 도면들, 및 각각의 청구항에 대한 참조에 의해 이해되어야 한다.This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The claimed subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all of the drawings, and each claim.

첨부 도면들은, 본 개시내용의 다양한 양태들의 설명을 돕기 위해 제시되고, 양태들의 제한이 아닌 양태들의 예시를 위해서만 제공된다. 본 개시내용의 위에 기재된 특징들이 상세히 이해될 수 있도록, 위에 간략히 요약된 더 구체적인 설명이 양태들을 참조하여 이루어질 수 있으며, 이러한 양태들 중 일부는 첨부 도면들에 예시되어 있다. 그러나, 첨부된 도면들은 단지 본 개시내용의 소정의 전형적인 양태들만을 예시하는 것이고 따라서 본 개시내용의 범주를 제한하는 것으로 간주되지 않아야 한다는 것이 주목되어야 하는데, 이는 본 설명이 다른 균등하게 유효한 양태들을 허용할 수 있기 때문이다. 상이한 도면들에서의 동일한 참조 부호들은 동일하거나 유사한 요소들을 식별할 수 있다.
도 1은 일부 양태들에 따른, 시스템 온 칩(system-on-a-chip, SOC)의 예시적인 구현을 예시한다.
도 2a는 일부 양태들에 따른, 완전히 연결된 뉴럴 네트워크의 일 예를 예시한다.
도 2b는 일부 양태들에 따른, 국부적으로 연결된 뉴럴 네트워크의 일 예를 예시한다.
도 3은 일부 양태들에 따른, 확산 모델의 순방향 확산 프로세스 및 역방향 확산 프로세스를 예시하는 다이어그램이다.
도 4는 일부 양태들에 따른, 확산 모델을 사용하여 프레임 향상을 수행하기 위한 시스템의 일 예를 예시하는 다이어그램이다.
도 5는 일부 양태들에 따른, 확산 모델을 사용하여 프레임 향상을 수행하기 위한 시스템의 다른 예�� 시��는 다이어그램이다.
도 6은 일부 양태들에 따른, 확산 모델�� 사용하여 프레임 향상을 수행하기 위한 시스템의 �� 다른 예를 예시하는 다이어그램이다.
도 7은 일부 양태들에 따른, 확산 모델을 사용하여 프레임 향상을 수행하기 위한 시스템의 또 다른 예를 예시하는 다이어그램이다.
도 8은 일부 양태들에 따른, 프레임 데이터를 프로세싱하기 위한 프로세스의 일 예를 예시하는 흐름도이다.
도 9는 본 명세서에 설명된 특정 양태들을 구현하기 위한 컴퓨팅 시스템의 일 예를 예시하는 블록도이다.The accompanying drawings are provided to aid in describing various aspects of the present disclosure and are provided solely for the purpose of illustration, not limitation, of the aspects. So that the features described above of the present disclosure may be understood in detail, a more specific description, briefly summarized above, may be had by reference to aspects, some of which are illustrated in the accompanying drawings. It should be noted, however, that the appended drawings illustrate only certain typical aspects of the present disclosure and are therefore not to be considered limiting of the scope of the present disclosure, for the description may admit to other equally effective aspects. The same reference numerals in different drawings may identify the same or similar elements.
FIG. 1 illustrates an exemplary implementation of a system-on-a-chip (SOC), according to some embodiments.
Figure 2a illustrates an example of a fully connected neural network according to some embodiments.
Figure 2b illustrates an example of a locally connected neural network according to some embodiments.
FIG. 3 is a diagram illustrating the forward diffusion process and the reverse diffusion process of a diffusion model according to some embodiments.
FIG. 4 is a diagram illustrating an example of a system for performing frame enhancement using a diffusion model, according to some embodiments.
FIG. 5 is a diagram illustrating another example of a system for performing frame enhancement using a diffusion model, according to some embodiments.
FIG. 6 is a diagram illustrating another example of a system for performing frame enhancement using a diffusion model, according to some embodiments.
FIG. 7 is a diagram illustrating another example of a system for performing frame enhancement using a diffusion model, according to some embodiments.
FIG. 8 is a flowchart illustrating an example of a process for processing frame data according to some embodiments.
FIG. 9 is a block diagram illustrating an example of a computing system for implementing certain aspects described herein.

본 개시내용의 특정 양태들이 예시 목적들을 위해 아래에 제공된다. 본 개시내용의 범주를 벗어남이 없이 대안적인 양태들이 고안될 수 있다. 추가적으로, 본 개시내용의 잘 알려진 엘리먼트들은 본 개시내용의 관련 세부사항들을 모호하게 하지 않도록 상세히 설명되지 않거나 생략될 것이다. 본 명세서에 설명된 양태들 중 일부는 독립적으로 적용될 수 있고, 그들 중 일부는 당업자에게 명백할 바와 같이 조합하여 적용될 수 있다. 하기 설명에서, 설명의 목적들을 위해, 특정 세부사항들이 본 출원의 양태들의 철저한 이해를 제공하기 위해 기재된다. 그러나, 다양한 양태들은 이들 특정 세부사항들 없이 실시될 수 있음이 명백할 것이다. 도면들 및 설명은 제한적인 것으로 의도되지 않는다.Certain aspects of the present disclosure are provided below for illustrative purposes. Alternative aspects may be devised without departing from the scope of the present disclosure. Additionally, well-known elements of the present disclosure may not be described in detail or may be omitted so as not to obscure relevant details of the present disclosure. Some of the aspects described herein can be applied independently, and some can be applied in combination, as would be apparent to those skilled in the art. In the following description, for purposes of explanation, specific details are set forth to provide a thorough understanding of the aspects of the present application. However, it will be apparent that various aspects may be practiced without these specific details. The drawings and description are not intended to be limiting.

다음의 설명은 예시적인 양태들을 제공하고, 본 개시내용의 범주, 적용가능성, 또는 구성을 제한하도록 의도되지 않는다. 오히려, 예시적인 양태들의 다음의 설명은 예시적인 양태를 구현하기 위한 실행가능한 설명(enabling description)을 당업자들에게 제공할 것이다. 첨부된 청구항들에 기재된 바와 같은 본 출원의 범주를 벗어나지 않으면서 엘리먼트들의 기능 및 배열에서 다양한 변경들이 이루어질 수 있다는 것이 이해되어야 한다.The following description provides exemplary embodiments and is not intended to limit the scope, applicability, or configuration of the present disclosure. Rather, the following description of exemplary embodiments will provide those skilled in the art with an enabling description for implementing the exemplary embodiments. It should be understood that various changes in the function and arrangement of elements may be made without departing from the scope of the present application as set forth in the appended claims.

이미지 및 비디오 데이터의 수요 및 소비는 소비자 및 전문가 설정들에서 현저히 증가하였다. 전술된 바와 같이, 디바이스들 및 시스템들에는, 일반적으로, 이미지 및 비디오 데이터를 캡처하고 프로세싱하기 위한 능력들이 장비된다. 예를 들어, 카메라 또는 카메라를 포함하는 컴퓨팅 디바이스(예컨대, 하나 이상의 카메라들을 포함하는 스마트폰 또는 모바일 전화기)는 장면, 사람, 오브젝트 등의 비디오 및/또는 이미지를 캡처할 수 있다. 이미지 및/또는 비디오는 캡처되고 프로세싱된 다음, 소비를 위해 출력(및/또는 저장)될 수 있다. 이미지 및/또는 비디오는 특히, 초해상도(업스케일링 또는 해상도 향상), 압축(인코딩으로도 지칭됨), 프레임 레이트 상향 변환, 샤프닝, 컬러 공간 변환, 이미지 향상, 높은 동적 범위(HDR), 잡음제거, 저조도 보상과 같은 특정 효과들 또는 개선들(예를 들어, 품질, 비트레이트 등에서의 개선들)에 대한 하나 이상의 프레임 향상 기술들에 의해 추가로 프로세싱될 수 있다. 이미지 및/또는 비디오는 또한, 다른 것들 중에서도, 컴퓨터 비전, 확장 현실 (예컨대, 증강 현실, 가상 현실 등), 이미지 인식 (예컨대, 얼굴 인식, 오브젝트 인식, 장면 인식 등), 및 자율 주행과 같은 특정 어플리케이션들을 위해 추가로 프로세싱될 수 있다. 일부 예들에서, 이미지 및/또는 비디오는 AI 품질 향상 및 AI 증강 모델들을 포함할 수 있지만 이에 제한되지 않는 하나 이상의 이미지 또는 비디오 인공 지능(AI) 모델들을 사용하여 프로세싱될 수 있다. 본 명세서에서 사용되는 바와 같이, 용어들 "이미지 프로세싱" 및 "비디오 프로세싱"은 (예를 들어, 연속적으로 프로세싱될 수 있는 일련의 프레임들(예를 들어, 이미지들)을 포함하는 비디오 데이터에 기초하여) 예컨대, 이미지 프로세싱 뉴럴 네트워크 및 비디오 프로세싱 뉴럴 네트워크를 기술하는데 있어 상호교환가능하게 사용될 수도 있다.The demand for and consumption of image and video data has increased significantly in both consumer and professional settings. As described above, devices and systems are typically equipped with capabilities for capturing and processing image and video data. For example, a camera or a computing device including a camera (e.g., a smartphone or mobile phone including one or more cameras) may capture video and/or images of scenes, people, objects, etc. The images and/or videos may be captured, processed, and then output (and/or stored) for consumption. The images and/or videos may be further processed using one or more frame enhancement techniques for specific effects or improvements (e.g., improvements in quality, bitrate, etc.), such as super-resolution (also referred to as upscaling or resolution enhancement), compression (also referred to as encoding), frame rate upconversion, sharpening, color space conversion, image enhancement, high dynamic range (HDR), noise reduction, and low-light compensation, among others. Images and/or videos may also be further processed for specific applications such as, among others, computer vision, extended reality (e.g., augmented reality, virtual reality, etc.), image recognition (e.g., facial recognition, object recognition, scene recognition, etc.), and autonomous driving. In some examples, the images and/or videos may be processed using one or more image or video artificial intelligence (AI) models, which may include, but are not limited to, AI quality enhancement and AI augmentation models. As used herein, the terms "image processing" and "video processing" may be used interchangeably to describe, for example, an image processing neural network and a video processing neural network (e.g., based on video data comprising a series of frames (e.g., images) that may be processed sequentially).

이미지 및 비디오 프로세싱 동작들은 계산 집약적일 수 있다. 일부 경우들에서, 이미지 및 비디오 프로세싱 동작들은 비디오 데이터의 입력 이미지 또는 프레임의 해상도가 증가함에 따라(예를 들어, 비디오 데이터의 입력 이미지 또는 프레임당 프로세싱될 픽셀들의 수가 증가함에 따라) 점점 더 계산 집약적이 될 수 있다. 예를 들어, 4K 해상도를 갖는 비디오 데이터의 프레임은 풀 HD(예를 들어, 1080p) 해상도를 갖는 비디오 데이터의 프레임보다 약 4배 많은 개별 픽셀을 포함할 수 있다.Image and video processing operations can be computationally intensive. In some cases, image and video processing operations can become increasingly computationally intensive as the resolution of the input images or frames of video data increases (e.g., as the number of pixels to be processed per input image or frame of video data increases). For example, a frame of video data having a 4K resolution may contain approximately four times as many individual pixels as a frame of video data having a Full HD (e.g., 1080p) resolution.

이미지 또는 비디오 프로세싱 기술의 일 예는, (해상도를 증가시키기 위한) 공간적 초해상도 및/또는 (예를 들어, 프레임 레이트를 증가시키기 위한) 시간적 초해상도에 사용될 수 있는 비디오 초해상도(VSR)이다. 예를 들어, VSR은 비디오의 공간 해상도를, 예컨대 720p에서 1080p로 또는 1080p에서 4K로 증가시키기 위해 사용될 수 있다. VSR은 (예를 들어, 게이밍에서의) 비디오 합성 또는 비디오 코딩과 같은 계산 또는 대역폭-소비가 큰 업스트림 작업에 대한 부담을 감소시키면서 ��용으로 사용자 경험을 향상시키는 데 사용될 수 있다. 게이밍을 위한 VSR 기법의 일 예는, (예를 들어, 레이-트레이싱을 수행함으로써) 더 낮은 해상도에서, 매우 높은 품질이지만 고비용의 프레임들을 생성하고, 합리적인 양으로 계산 요건들을 유지하기 위해 더 높은 해상도로 프레임들을 "저비용으로" 보간하는 것이다. 예를 들어, 레이-트레이싱은 고품질 저해상도 프레임을 생성하는 데 사용될 수 있고, VSR 기술은 저해상도 프레임을 더 높은 해상도로 업샘플링하는 데 사용될 수 있다. 일부 경우들에서, 이러한 VSR 기술은 또한 시간 초해상도(예를 들어, 프레임 보간)를 수행할 수 있다. 비디오 코딩 설정을 사용하는 하나의 예시적인 예에서, 비트레이트는 720p에서 1080p 비디오를 인코딩하고, 계산을 위해 수신기-단부, 트레이딩 비트레이트에서 비디오 초해상도를 수행함으로써(예를 들어, 더 높은 계산 비용으로 더 낮은 비트레이트를 초래함) 저장될 수 있다. 예를 들어, 미디어 호스팅 플랫폼들은 저장 공간을 절약하고 수신기 디바이스 상에서 계산 비용을 발생시키기 위해 이러한 기술을 사용할 수 있다.An example of an image or video processing technique is video super-resolution (VSR), which can be used for spatial super-resolution (to increase resolution) and/or temporal super-resolution (e.g., to increase frame rate). For example, VSR can be used to increase the spatial resolution of a video, such as from 720p to 1080p or from 1080p to 4K. VSR can be used to improve the user experience at low cost while reducing the burden on computationally or bandwidth-intensive upstream tasks, such as video synthesis or video coding (e.g., in gaming). An example of a VSR technique for gaming is to generate very high-quality but expensive frames at a lower resolution (e.g., by performing ray tracing), and then "cheaply" interpolate the frames to a higher resolution to keep the computational requirements within a reasonable amount. For example, ray tracing can be used to generate high-quality low-resolution frames, and VSR techniques can be used to upsample the low-resolution frames to a higher resolution. In some cases, these VSR techniques can also perform temporal super-resolution (e.g., frame interpolation). In one illustrative example using a video coding configuration, bitrate can be saved by encoding a 720p to 1080p video and performing video super-resolution at the receiver-end, trading bitrate for computation (e.g., resulting in a lower bitrate at a higher computational cost). For example, media hosting platforms can use this technique to save storage space and incur computational costs on the receiver device.

VSR은 매우 복잡한 고차원의 일대다(one-to-many) 매핑일 수 있다. 전통적인 VSR 기술은 인간적으로 인지되는 품질과 잘 정렬되지 않는 피크 신호 대 잡음비(PSNR)와 같은 왜곡 메트릭에 대해 최적화한다. 예를 들어, 왜곡에 대해 최적화함으로써, 복원된 프레임은 왜곡 관점에서 원래 프레임과 매우 유사할 수 있지만, 낮은 지각 품질을 가질 수 있다. 일부 머신 러닝 기반 VSR 기술들은 지각 품질(예를 들어, GAN 기반 VSR 모델들)에 대해 최적화되어, 높은 지각 품질을 초래한다(따라서 인간 관점에서 멋진 외관 결과들을 초래한다). 그러나, 이러한 머신 러닝 기반 VSR 기술들은 비디오 또는 프레임들의 시퀀스들이 아닌 스틸 이미지 초해상도에 초점을 맞추고, 비디오에 적용될 때 시간적 불일치들(예를 들어, 플리커링)을 초래한다. 일부 기법들은 또한 미래 프레임들의 이용가능성을 가정하고, 따라서 원격 회��, 비디오 스트리밍 등과 같은 저-지연 애플리케이션들에 부합하지 않는다. 일부 기법들은 높은 품질 및 시간적 일관성 양자를 가질 수 있지만, 추가적인 입력들(예를 들어, 광학 흐름, 심도 맵들 등)을 요구하는 특정 애플리케이션들(예를 들어, 게이밍) 또는 이력 비디오들의 오프라인 업스케일링과 같은 고-지연 애플리케이션들에 제한된다. 지각적 품질, 시간적 일관성 및 레이턴시의 균형을 맞추는 VSR 기술이 필요하다.VSR can be a complex, high-dimensional, one-to-many mapping. Traditional VSR techniques optimize for distortion metrics, such as peak signal-to-noise ratio (PSNR), which do not align well with human-perceived quality. For example, by optimizing for distortion, the reconstructed frame may be very similar to the original frame in terms of distortion, but may have poor perceptual quality. Some machine learning-based VSR techniques optimize for perceptual quality (e.g., GAN-based VSR models), resulting in high perceptual quality (and thus pleasing appearance from a human perspective). However, these machine learning-based VSR techniques focus on still image super-resolution, not video or frame sequences, and can introduce temporal inconsistencies (e.g., flickering) when applied to video. Some techniques also assume the availability of future frames, making them unsuitable for low-latency applications such as teleconferencing and video streaming. While some techniques can achieve both high quality and temporal consistency, they are limited to certain applications (e.g., gaming) that require additional inputs (e.g., optical flow, depth maps, etc.) or high-latency applications such as offline upscaling of historical videos. A VSR technique that balances perceptual quality, temporal consistency, and latency is needed.

확산 모델(본 명세서에서 비디오 향상 확산 모델로 지칭됨)을 사용하여 프레임 향상(예를 들어, 비디오 초해상도, 샤프닝 등)을 수행하기 위한 시스템들, 장치들, 프로세스들(또한 방법들로 지칭됨), 및 컴퓨터 판독가능 매체(집합적으로 "시스템들 및 기술들"로 지칭됨)가 본 명세서에서 설명된다. 비디오 향상 확산 모델은 기존의 초해상도 시스템들보다 더 일반적인 애플리케이션들을 가능하게 하는 범용, 고품질, 저-지연 프레임 향상(예를 들어, 초해상도, 샤프닝 등) 알고리즘을 제공한다. 예를 들어, 비디오 향상 확산 모델은, (지각 품질을 제공할 수 있는) 확산-기반 트레이닝, (시간적 일관성을 제공할 수 있는) 광학 흐름 추정 및 후방 워핑 또는 변형가능한 컨볼루션들을 사용하는 픽셀-또는 피처-공간 워핑을 통한 시간적 모델링을 위한 광학 흐름, 및 병렬 업샘플링 대신에 (저-지연을 제공할 수 있는) 순환(recurrent) 업샘플링을 포함하는, 적어도 3개의 양태들에 기초하여, 고-지각 품질 및 높은 시간적 일관성을 갖는 저-지연 프레임 향상(예를 들어, 초해상도, 샤프닝 등)을 가능하게 할 수 있다.Systems, devices, processes (also referred to as methods), and computer-readable media (collectively referred to as “systems and techniques”) for performing frame enhancement (e.g., video super-resolution, sharpening, etc.) using a diffusion model (referred to herein as a video enhancement diffusion model) are described herein. The video enhancement diffusion model provides a general-purpose, high-quality, low-latency frame enhancement (e.g., super-resolution, sharpening, etc.) algorithm that enables more general applications than existing super-resolution systems. For example, a video enhancement diffusion model can enable low-latency frame enhancement (e.g., super-resolution, sharpening, etc.) with high perceptual quality and high temporal consistency based on at least three aspects: diffusion-based training (which can provide perceptual quality), optical flow for temporal modeling via pixel- or feature-space warping using optical flow estimation and backwarping or deformable convolutions (which can provide temporal consistency), and recurrent upsampling instead of parallel upsampling (which can provide low latency).

본 개시내용의 다양한 양태들이 도면들을 참조하여 설명될 것이다.Various aspects of the present disclosure will be described with reference to the drawings.

도 1은 시스템 온 칩(SOC)(100)의 예시적인 구현을 예시하며, 이는 본 명세서에서 설명된 기능들 중 하나 이상을 수행하도록 구성된 중앙 프로세싱 유닛(central processing unit, CPU)(102) 또는 다중 코어 CPU를 포함할 수 있다. 다른 정보 중에서도, 파라미터들 또는 변수들(예컨대, 뉴럴 신호들 및 시냅스 가중치(synaptic weight)들), 계산 디바이스와 연관된 시스템 파라미터들(예컨대, 가중치들을 갖는 뉴럴 네트워크), 지연들, 주파수 빈(bin) 정보, 작업 정보가 NPU(neural processing unit)(108)와 연관된 메모리 블록에, CPU(102)와 연관된 메모리 블록에, 그래픽 프로세싱 유닛(GPU)(104)와 연관된 메모리 블록에, 디지털 신호 프로세서(DSP)(106)와 연관된 메모리 블록에, 메모리 블록(118)에 저장될 수 있거나, 그리고/또는 다수의 블록들에 걸쳐 분산될 수 있다. CPU(102)에서 실행되는 명령어들은 CPU(102)와 연관된 프로그램 메모리로부터 로딩될 수 있거나, 또는 메모리 블록(118)으로부터 로딩될 수 있다.FIG. 1 illustrates an exemplary implementation of a system on a chip (SOC) (100), which may include a central processing unit (CPU) (102) or a multi-core CPU configured to perform one or more of the functions described herein. Among other information, parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with a computational device (e.g., a neural network having weights), delays, frequency bin information, task information may be stored in a memory block associated with a neural processing unit (NPU) (108), in a memory block associated with a CPU (102), in a memory block associated with a graphics processing unit (GPU) (104), in a memory block associated with a digital signal processor (DSP) (106), in a memory block (118), and/or may be distributed across multiple blocks. Instructions executed in the CPU (102) may be loaded from a program memory associated with the CPU (102) or may be loaded from a memory block (118).

SOC(100)는 또한, 특정 기능들에 맞춤화된 부가적인 프로세싱 블록들, 이를테면 GPU(104), DSP(106), 5세대(5G) 연결성, 4세대 롱 텀 에볼루션(4G LTE) 연결성, Wi-Fi 연결성, USB 연결성, 블루투스 연결성 등을 포함할 수 있는 연결성 블록(110), 및 예를 들어, 제스처들을 검출 및 인식할 수 있는 멀티미디어 프로세서(112)를 포함할 수 있다. 일 구현에서, NPU는 CPU(102), DSP(106), 및/또는 GPU(104)에서 구현된다. SOC(100)는 또한, 센서 프로세서(114), 이미지 신호 프로세서(ISP)들(116), 및/또는 글로벌 포지셔닝 시스템(global positioning system)을 포함할 수 있는 내비게이션 모듈(120)을 포함할 수 있다. 일부 예들에서, 센서 프로세서(114)는 센서 프로세서(114)에 센서 입력(들)을 제공하기 위해 하나 이상의 센서들과 연관되거나 그에 연결될 수 있다. 예를 들어, 하나 이상의 센서들 및 센서 프로세서(114)는 동일한 컴퓨팅 디바이스에 제공되거나, 그에 커플링되거나, 아니면 그와 연관될 수 있다.The SOC (100) may also include additional processing blocks tailored to specific functions, such as a GPU (104), a DSP (106), a connectivity block (110) that may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, etc., and a multimedia processor (112) that may detect and recognize gestures, for example. In one implementation, the NPU is implemented in the CPU (102), the DSP (106), and/or the GPU (104). The SOC (100) may also include a navigation module (120) that may include a sensor processor (114), image signal processors (ISPs) (116), and/or a global positioning system. In some examples, the sensor processor (114) may be associated with or connected to one or more sensors to provide sensor input(s) to the sensor processor (114). For example, the one or more sensors and the sensor processor (114) may be provided on, coupled to, or otherwise associated with the same computing device.

SOC(100)는 ARM 명령 세트에 기초할 수 있다. 본 개시내용의 일 양태에서, CPU(102)에 로딩된 명령들은 입력 값 및 필터 가중치의 곱셈 산출물에 대응하는 룩업 테이블(LUT)에서 저장된 곱셈 결과를 탐색하기 위한 코드를 포함할 수 있다. CPU(102)에 로딩된 명령들은 또한, 곱셈 산출물의 룩업 테이블 히트(hit)가 검출될 때, 곱셈 산출물의 곱셈 연산 동안 곱셈기를 디스에이블(disable)하기 위한 코드를 포함할 수 있다. 또한, CPU(102)에 로딩된 명령들은, 곱셈 산출물의 룩업 테이블 미스(miss)가 검출될 때, 입력 값 및 필터 가중치의 컴퓨팅된 곱셈 산출물을 저장하기 위한 코드를 포함할 수 있다. SOC(100) 및/또는 그의 컴포넌트들은 본 명세서에서 논의된 본 개시내용의 양태들에 따른 머신 학습 기법들을 사용하여 이미지 프로세싱을 수행하도록 구성될 수 있다. 예를 들어, SOC(100) 및/또는 그의 컴포넌트들은 본 개시의 양태들에 따른 시맨틱 이미지 세그멘테이션 및/또는 오브젝트 검출을 수행하도록 구성될 수 있다.The SOC (100) may be based on the ARM instruction set. In one aspect of the present disclosure, instructions loaded into the CPU (102) may include code for searching a stored multiplication result in a lookup table (LUT) corresponding to a multiplication product of an input value and a filter weight. The instructions loaded into the CPU (102) may also include code for disabling a multiplier during a multiplication operation of the multiplication product when a lookup table hit of the multiplication product is detected. In addition, the instructions loaded into the CPU (102) may include code for storing a computed multiplication product of the input value and the filter weight when a lookup table miss of the multiplication product is detected. The SOC (100) and/or components thereof may be configured to perform image processing using machine learning techniques according to aspects of the present disclosure discussed herein. For example, the SOC (100) and/or its components may be configured to perform semantic image segmentation and/or object detection according to aspects of the present disclosure.

ML(machine learning)은 AI(artificial intelligence)의 서브세트로 간주될 수 있다. ML 시스템들은, 컴퓨터 시스템들이 명시적 명령들의 사용 없이, 패턴들 및 추론에 의존하는 것에 의해 다양한 작업들을 수행하기 위해 사용할 수 있는 알고리즘들 및 통계 모델들을 포함할 수 있다. ML 시스템의 일 예는 인공 뉴런들(예컨대, 뉴런 모델들)의 상호연결된 그룹을 포함할 수 있는 뉴럴 네트워크(또한 인공 뉴럴 네트워크로 지칭됨)이다. 뉴럴 네트워크들은 특히 이미지 및/또는 비디오 코딩, 이미지 분석 및/또는 컴퓨터 비전 애플리케이션들, 인터넷 프로토콜(Internet Protocol, IP) 카메라들, 사물 인터넷(Internet of Things, IoT) 디바이스들, 자율 차량들, 서비스 로봇들과 같은 다양한 애플리케이션들 및/또는 디바이스들을 위해 사용될 수 있다.Machine learning (ML) can be considered a subset of artificial intelligence (AI). ML systems can include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and inferences, without the use of explicit instructions. An example of an ML system is a neural network (also referred to as an artificial neural network), which can include interconnected groups of artificial neurons (e.g., neuron models). Neural networks can be used for a variety of applications and/or devices, such as image and/or video coding, image analysis, and/or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, and service robots, among others.

뉴럴 네트워크 내의 개별 노드들은, 입력 데이터를 취하고 데이터에 대해 간단한 연산들을 수행함으로써 생물학적 뉴런들을 에뮬레이트(emulate)할 수 있다. 입력 데이터에 대해 수행된 간단한 연산들의 결과들은 다른 뉴런들에 선택적으로 전달된다. 가중치 값들은 네트워크 내의 각각의 벡터 및 노드와 연관되고, 이들 값들은 입력 데이터가 출력 데이터와 어떻게 관련되는지를 제약한다. 예를 들어, 각각의 노드의 입력 데이터에 대응하는 가중치 값이 곱해지고, 산출물(product)이 합산될 수 있다. 산출물들의 합은 선택적 바이어스에 의해 조정될 수 있고, 활성화 함수가 결과에 적용되어, 노드의 출력 신호 또는 "출력 활성화"(때때로 활성화 맵 또는 피처 맵으로 지칭됨)를 산출할 수 있다. 가중치 값들은 초기에 네트워크를 통한 트레이닝 데이터의 반복적인 흐름에 의해 결정될 수 있다(예컨대, 가중치 값들은 네트워크가 전형적인 입력 데이터 특성들에 의해 특정 클래스들을 식별하는 방법을 학습하는 트레이닝 단계 동안 확립된다).Individual nodes within a neural network can emulate biological neurons by taking input data and performing simple operations on the data. The results of these operations are then optionally passed on to other neurons. Weight values are associated with each vector and node within the network, and these values constrain how the input data relates to the output data. For example, the input data of each node may be multiplied by the corresponding weight value, and the resulting products may be summed. The sum of the products may be adjusted by an optional bias, and an activation function may be applied to the result to produce the node's output signal, or "output activation" (sometimes referred to as an activation map or feature map). The weight values may be initially determined by iteratively passing training data through the network (e.g., established during a training phase in which the network learns to identify specific classes based on typical input data characteristics).

상이한 타입들의 뉴럴 네트워크들, 이를테면 다른 것들 중에서도, CNN(convolutional neural network)들, RNN(recurrent neural network)들, GAN(generative adversarial network)들, MLP(multilayer perceptron) 뉴럴 네트워크들, 변환기 뉴럴 네트워크들, 확산-기반 뉴럴 네트워크들이 존재한다. 예를 들어, CNN(convolutional neural network)들은 피드��워드(feed-forward) 인공 뉴럴 네트워크의 타입이다. 컨볼루셔널 뉴럴 네트워크들은 수용 필드(예컨대, 입력 공간의 공간적으로 국소화된 영역)를 각각 갖고 그리고 입력 공간을 집합적으로 타일링하는 인공 뉴런들의 집합들을 포함할 수 있다. RNN은 계층의 출력을 저장하고 이 출력을 입력에 피드백하여 계층의 성과를 예측하는 것을 돕는 원리상에서 작동한다. GAN은 입력 데이터에서 패턴들을 학습할 수 있는 생성 뉴럴 네트워크의 형태이며, 따라서 뉴럴 네트워크 모델은 원래 데이터세트로부터 적합하게 얻어질 수 있었던 새로운 합성 출력들을 생성할 수 있다. GAN은 합성된 출력을 생성하는 생성 뉴럴 네트워크 및 진본성에 대해 출력을 평가하는 판별 뉴럴 네트워크를 포함한, 함께 동작하는 2개의 뉴럴 네트워크들을 포함할 수 있다. MLP 뉴럴 네트워크들에서, 데이터는 입력 계층에 공급될 수 있고, 하나 이상의 은닉 계층들은 데이터에 대한 추상화의 레벨들을 제공한다. 이어서, 추상화된 데이터에 기초하여 출력 계층 상에서 예측들이 이루어질 수 있다.There are different types of neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), multilayer perceptron (MLP) neural networks, transformer neural networks, and diffusion-based neural networks, among others. For example, convolutional neural networks (CNNs) are a type of feed-forward artificial neural network. Convolutional neural networks may comprise a set of artificial neurons, each with a receptive field (e.g., a spatially localized region of the input space) that collectively tile the input space. RNNs operate on the principle of storing the output of a layer and feeding this output back to the input to help predict the layer's performance. GANs are a form of generative neural network that can learn patterns from input data, allowing the neural network model to generate new synthetic outputs that could have been appropriately derived from the original dataset. A GAN may include two neural networks working together: a generative neural network that generates synthesized outputs and a discriminative neural network that evaluates the outputs for authenticity. In MLP neural networks, data can be fed into an input layer, and one or more hidden layers provide levels of abstraction for the data. Predictions can then be made at the output layer based on the abstracted data.

DL(deep learning)은 머신 학습 기법의 일 예이고, ML의 서브세트로 간주될 수 있다. 많은 DL 접근법들은 RNN 또는 CNN과 같은 뉴럴 네트워크에 기초하고, 다수의 계층들을 활용한다. 심층 뉴럴 네트워크들에서 다수의 계층들의 사용은, 점진적으로 더 하이-레벨의 피처들이 주어진 원시 데이터 입력으로부터 추출되도록 허용할 수 있다. 예를 들어, 인공 뉴런들의 제1 계층의 출력은 인공 뉴런들의 제2 계층에 대한 입력이 되고, 인공 뉴런들의 제2 계층의 출력은 인공 뉴런들의 제3 계층에 대한 입력이 되는 식이다. 전체 심층 뉴럴 네트워크의 입력과 출력 사이에 위치되는 계층들은 종종 은닉 계층들로 지칭된다. 은닉 계층들은, 최종 또는 원하는 표현이 심층 뉴럴 네트워크의 최종 출력으로서 획득될 때까지, 선행 계층으로부터의 중간 입력을 후속 계층에 제공될 수 있는 약간 더 추상적이고 복합적인 표현으로 변환하도록 학습한다(예컨대, 트레이닝됨).Deep learning (DL) is an example of a machine learning technique and can be considered a subset of machine learning (ML). Many DL approaches are based on neural networks, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), and utilize multiple layers. The use of multiple layers in deep neural networks allows for progressively higher-level features to be extracted from a given raw data input. For example, the output of the first layer of artificial neurons becomes the input to the second layer of artificial neurons, the output of the second layer becomes the input to the third layer of artificial neurons, and so on. The layers located between the input and output of the entire deep neural network are often referred to as hidden layers. The hidden layers learn (i.e., are trained) to transform intermediate inputs from preceding layers into slightly more abstract and complex representations that can be fed to subsequent layers, until the final or desired representation is obtained as the final output of the deep neural network.

위에서 언급된 바와 같이, 뉴럴 네트워크는 머신 학습 시스템의 일 예이고, 입력 계층, 하나 이상의 은닉 계층들, 및 출력 계층을 포함할 수 있다. 데이터는 입력 계층의 입력 노드들로부터 제공되고, 프로세싱은 하나 이상의 은닉 계층들의 은닉 노드들에 의해 수행되고, 출력은 출력 계층의 출력 노드들을 통해 생산된다. 딥 러닝 네트워크들은 전형적으로는 다수의 은닉 계층들을 포함한다. 뉴럴 네트워크의 각각의 계층은 인공 뉴런들(또는 노드들)을 포함할 수 있는 피처 맵들 또는 활성화 맵들을 포함할 수 있다. 피처 맵은 필터, 커널 등을 포함할 수 있다. 노드들은 계층들 중 하나 이상의 노드들의 중요도를 표시하는 데 사용되는 하나 이상의 가중치들을 포함할 수 있다. 일부 경우들에서, 심층 학습 네트워크는 일련의 많은 은닉 계층들을 가질 수 있으며, 이때 초기 계층들은 단순하고 낮은 레벨의 입력 특성들을 결정하는 데 사용되고, 나중 계층들은 더 복잡하고 추상적인 특성들의 계층 구조를 구축한다.As mentioned above, a neural network is an example of a machine learning system and may include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes of the input layer, processing is performed by hidden nodes of one or more hidden layers, and output is produced through output nodes of the output layer. Deep learning networks typically include multiple hidden layers. Each layer of the neural network may include feature maps or activation maps, which may include artificial neurons (or nodes). Feature maps may include filters, kernels, etc. Nodes may include one or more weights used to indicate the importance of one or more nodes in the layers. In some cases, a deep learning network may have a series of many hidden layers, where early layers are used to determine simple, low-level input features, and later layers build a hierarchy of more complex, abstract features.

딥 러닝 아키텍처는 피처들의 계층 구조를 학습할 수 있다. 예를 들어, 시각적 데이터가 제시되는 경우, 제1 계층은 입력 스트림에서 에지들과 같은 비교적 간단한 피처들을 인식하도록 학습할 수 있다. 다른 예에서, 청각적 데이터가 제시되는 경우, 제1 계층은 특정 주파수들에서 스펙트럼 파워를 인식하도록 학습할 수 있다. 제2 계층은 - 이는 제1 계층의 출력을 입력으로서 취한다 - 시각적 데이터에 대한 간단한 형상들 또는 청각적 데이터에 대한 사운드들의 조합들과 같은 피처들의 조합들을 인식하도록 학습할 수 있다. 예를 들어, 상위 계층들은 시각적 데이터에서 복잡한 형상들 또는 청각적 데이터에서 단어(word)들을 표현하도록 학습할 수 있다. 보다 상위 계층들은 공통 시각적 오브젝트들 또는 발화된 어구(spoken phrase)들을 인식하도록 학습할 수 있다.Deep learning architectures can learn a hierarchy of features. For example, when presented with visual data, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, when presented with auditory data, the first layer may learn to recognize spectral power at specific frequencies. The second layer, which takes the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes in visual data or combinations of sounds in auditory data. For example, higher layers may learn to represent complex shapes in visual data or words in auditory data. Even higher layers may learn to recognize common visual objects or spoken phrases.

딥 러닝 아키텍처들은 자연 계층적(hierarchical) 구조를 갖는 문제들에 적용될 때 특히 잘 수행될 수 있다. 예를 들어, 전동 차량들의 분류는 휠들, 앞유리창(windshield)들, 및 다른 피처들을 인식하는 제1 학습으로부터 이익을 얻을 수 있다. 이들 피처들은 자동차들, 트럭들, 및 항공기들을 인식하기 위해 상이한 방식들로 상위 계층들에서 조합될 수 있다.Deep learning architectures can perform particularly well when applied to problems with a natural hierarchical structure. For example, classifying electric vehicles can benefit from first-level training that recognizes wheels, windshields, and other features. These features can then be combined in different ways at higher levels to recognize cars, trucks, and airplanes.

뉴럴 네트워크들은 다양한 연결성 패턴들로 설계될 수 있다. 피드 포워드 네트워크들에서, 정보는 하위 계층으로부터 상위 계층으로 전달되고, 주어진 계층의 각각의 뉴런은 상위 계층들의 뉴런들과 통신한다. 위에서 설명된 바와 같이, 피드 포워드 네트워크의 연속적인 계층들에서 계층적 표현이 구축될 수 있다. 뉴럴 네트워크들은 또한 순환 또는 피드백(또한 탑-다운식(top-down)으로 칭함) 연결들을 가질 수 있다. 순환 연결에서, 주어진 계층 내의 뉴런으로부터의 출력이 동일한 계층 내의 다른 뉴런으로 통신될 수 있다. 순환 아키텍처는 시퀀스로 뉴럴 네트워크에 전달되는 입력 데이터 청크들 중 하나 초과에 걸쳐 있는 패턴들을 인식하는데 도움이 될 수 있다(예를 들어, 순환 뉴럴 네트워크 아키텍처의 예가 도 4 내지 도 6에 도시되고, 아래에 더 자세히 설명된다). 주어진 계층 내의 뉴런으로부터 하위 계층 내의 뉴런으로의 연결을 피드백(또는 탑-다운식) 연결로 칭한다. 많은 피드백 연결들을 갖는 네트워크는 고수준 개념의 인식이 입력의 특정 저수준 피처들을 판별하는 것을 도울 수 있을 때 도움이 될 수 있다.Neural networks can be designed with a variety of connectivity patterns. In feedforward networks, information flows from lower layers to higher layers, with each neuron in a given layer communicating with neurons in higher layers. As described above, hierarchical representations can be built across successive layers of a feedforward network. Neural networks can also have recurrent or feedback (also called top-down) connections. In recurrent connections, the output from a neuron within a given layer can be communicated to another neuron within the same layer. Recurrent architectures can be helpful in recognizing patterns that span more than one chunk of input data passed to the neural network in sequence (e.g., examples of recurrent neural network architectures are illustrated in Figures 4-6 and described in more detail below). Connections from neurons within a given layer to neurons in lower layers are called feedback (or top-down) connections. Networks with many feedback connections can be helpful when the recognition of high-level concepts can help identify specific low-level features of the input.

뉴럴 네트워크의 계층들 사이의 연결들은 완전 연결(fully connected)되거나 로컬 연결(locally connected)될 수 있다. 도 2a는 완전 연결된 뉴럴 네트워크(202)의 예를 도시한다. 완전 연결된 뉴럴 네트워크(202)에서, 제1 계층 내의 뉴런은 제2 계층 내의 각각의 뉴런이 제1 계층 내의 모�� 각각의 뉴런��로부터 입력을 수신하도록 그의 출력을 제2 계층 내의 모든 각각의 뉴런에 통신할 수 있다. 도 2b는 로컬 연결된 뉴럴 네트워크(204)의 일 예를 도시한다. 로컬 연결된 뉴럴 네트워크(204)에서, 제1 계층 내의 뉴런은 제2 계층 내의 제한된 수의 뉴런들에 연결될 수 있다. 보다 일반적으로, 로컬 연결된 뉴럴 네트워크(204)의 로컬 연결된 계층은 계층 내의 각각의 뉴런이 동일하거나 유사한 연결성 패턴을 갖지만 상이한 값들(예를 들어, 210, 212, 214, 216)을 가질 수 있는 연결 강도들을 갖도록 구성될 수 있다. 로컬 연결된 연결 패턴은 상위 계층에서 공간적으로 별개인 수용 필드들을 발생시킬 수 있는데, 이는 주어진 영역 내의 상위 계층 뉴런들이 네트워크에 대한 총 입력의 제한된 부분의 속성들에 대한 트레이닝을 통해 튜닝되는 입력들을 수신할 수 있기 때문이다.Connections between layers of a neural network can be fully connected or locally connected. Figure 2a illustrates an example of a fully connected neural network (202). In a fully connected neural network (202), a neuron in a first layer can communicate its output to every neuron in a second layer, such that each neuron in the second layer receives input from every neuron in the first layer. Figure 2b illustrates an example of a locally connected neural network (204). In a locally connected neural network (204), a neuron in a first layer can be connected to a limited number of neurons in a second layer. More generally, a locally connected layer of a locally connected neural network (204) can be configured such that each neuron in a layer has the same or similar connectivity pattern, but connection strengths that can have different values (e.g., 210, 212, 214, 216). Locally connected connectivity patterns can give rise to spatially distinct receptive fields in higher layers, because higher-layer neurons within a given region can receive inputs that are tuned through training on properties of a limited subset of the total input to the network.

위에서 언급된 바와 같이, 머신 러닝 모델들의 하나의 클래스는 확산 확률 모델들로도 지칭될 수 있는 확산 모델들(예를 들어, 확산-기반 뉴럴 네트워크들)을 포함한다. 확산 모델은 레이턴트 변수 모델이다. 예를 들어, 확산 모델은 데이터에 랜덤 잡음(예를 들어, 가우시안 잡음)을 천천히 부가하기 위해 확산 단계들의 마르코프 체인을 정의하고, 이어서 확산 프로세스를 역전시켜 잡음으로부터 원하는 데이터 샘플들을 구성하는 것을 학습한다. 예를 들어, 확산 모델은 (고정된) 순방향 확산 프로세스 및 (학습된) 역방향 확산 프로세스를 사용하여 트레이닝될 수 있다. 확산 모델은 생성 프로세스(예를 들어, 잡음제거 프로세스)를 수행할 수 있도록 트레이닝될 수 있다. 확산 모델의 하나의 예시적인 목표는 입력 데이터(예를 들어, 비디오)에 추가되는 임의의 잡음을 제거할 수 있는 것이다.As mentioned above, one class of machine learning models includes diffusion models (e.g., diffusion-based neural networks), also known as diffusion probabilistic models. Diffusion models are latent variable models. For example, a diffusion model defines a Markov chain of diffusion steps to slowly add random noise (e.g., Gaussian noise) to data, and then learns to construct desired data samples from the noise by reversing the diffusion process. For example, a diffusion model can be trained using a (fixed) forward diffusion process and a (trained) backward diffusion process. A diffusion model can also be trained to perform a generative process (e.g., a denoising process). One exemplary goal of a diffusion model is to remove random noise added to input data (e.g., a video).

도 3은 확산 모델의 (고정된) 순방향 확산 프로세스 및 (학습된) 역방향 확산 프로세스를 도시하는 이미지들의 2개의 세트들(300)을 제공한다. 도 3의 순방향 확산 프로세스에 도시된 바와 같이, 잡음(303)은 총 T개의 시간 단계들(예를 들어, 마르코프 체인을 구성함) 동안 상이한 시간 단계들에서 이미지들의 제1 세트(302)에 점진적으로 부가되어, 잡음 샘플들 X₁ 내지 X_T의 시퀀스를 생성한다. 하나의 예시적인 예에서, 잡음(303)은 가우시안 잡음이다. 각각의 시간 단계는 도 3에 도시된 이미지들의 제1 세트(302)의 각각의 연속적인 이미지에 대응할 수 있다. 도 3의 초기 이미지 X₀는 고양이에 대한 것이다. (잡음 샘플들 X₁ 내지 X_T에 대응하는) 각각의 이미지에 대한 잡음(303)의 추가는 최종 이미지(샘플 X_T에 대응함)가 본질적으로 잡음 분포에 매칭할 때까지 각 이미지에서 픽셀들의 점진적인 확산을 초래한다. 예를 들어, 잡음을 추가함으로써, 각 데이터 샘플 X₁ 내지 X_T은시간 단계가 점차 커질수록 구별가능한 피처들이 점차 없어지고, 결국에는 최종 샘플 X_T이 타겟 잡음 분포, 예를 들면 단위 분산 제로-중심 가우시안 과 동등하게 된다.Figure 3 provides two sets (300) of images illustrating the (fixed) forward diffusion process and the (learned) backward diffusion process of the diffusion model. As depicted in the forward diffusion process of Figure 3, noise (303) is incrementally added to a first set (302) of images at different time steps for a total of T time steps (e.g., forming a Markov chain) to generate a sequence of noise samples X ₁ through X _T . In one illustrative example, the noise (303) is Gaussian noise. Each time step may correspond to a respective successive image of the first set (302) of images illustrated in Figure 3 . The initial image X ₀ of Figure 3 is of a cat. The addition of noise (303) to each image (corresponding to noise samples X ₁ through X _T ) results in a gradual diffusion of pixels in each image until the final image (corresponding to sample X _T ) essentially matches the noise distribution. For example, by adding noise, each data sample X ₁ to X _T will gradually lose distinguishable features as the time step gets larger, and eventually the final sample X _T will have a target noise distribution, for example, a unit variance zero-centered Gaussian. It becomes equivalent to .

이미지들의 제2 세트(304)는 X_T가 잡음 이미지(예를 들면, 가우시안 잡음을 갖는 것)를 갖는 시작 포인트인 역방향 확산 프로세스를 도시한다. 확산 모델은 (예를 들어, 모델 p_θ-(x_t-1 | x_t)를 트레이닝함으로써) 확산 프로세스를 역전시키도록 트레이닝되어 새로운 데이터를 생성할 수 있다. 하나의 예시적인 예에서, 확산 모델은 트레이닝 데이터의 가능성을 최대화하는 역방향 마르코프 트랜지션들을 찾음으로써 트레이닝될 수 있다. 시간 단계들의 체인을 따라 역방향으로 횡단함으로써, 확산 모델은 새로운 데이터를 생성할 수 있다. 예를 들어, 도 3에 도시된 바와 같이, 역방향 확산 프로세스는 X₀를 고양이의 이미지로 생성하기 위해 진행된다.The second set of images (304) illustrates a backward diffusion process where X _T is a starting point with a noisy image (e.g., having Gaussian noise). A diffusion model can be trained to reverse the diffusion process (e.g., by training a model _pθ- (x _t-1 | x _t )) to generate new data. In one illustrative example, the diffusion model can be trained by finding backward Markov transitions that maximize the likelihood of the training data. By traversing backwards along a chain of time steps, the diffusion model can generate new data. For example, as illustrated in FIG. 3, a backward diffusion process is performed to generate X ₀ as an image of a cat.

앞서 언급한 바와 같이, 일단 캡처되면, 이미지 및 비디오는 특정 효과에 대해 추가로 프로세싱될 수 있다. 예를 들어, 비디오 향상은 일부 열화 프로세스에 의해 교란된 프레임들의 시퀀스를 고품질로 복원하는 것을 목표로 한다. 이러한 일반적인 클래스의 문제들은 예를 들어, 초해상도, 압축된 비디오 향상, 및 잡음제거를 포함한다.As mentioned earlier, once captured, images and videos can be further processed for specific effects. For example, video enhancement aims to restore high-quality sequences of frames disturbed by some degradation process. Common problems in this class include super-resolution, compressed video enhancement, and noise removal, for example.

기존의 머신 러닝(예를 들어, 뉴럴 네트워크) 기반 비디오 향상 방법들은 단방향 및 양방향 접근법들로 분할될 수 있고, 이는 방법이 현재 프레임을 향상시킬 때 미래 프레임에 액세스하는지 여부를 나타낸다. 양자의 카테고리들 모두 장단점이 있다. 단방향 접근법들은 오직 이전 프레임 정보에만 또는 때때로 하나까지의 미래(향상되지 않음) 프레임에 액세스한다. 따라서 이러한 방법들은 잠재적으로 실시간으로 실행될 수 있지만, 장기 의존성 및 시간적 일관성으로 인해 어려움을 겪을 수 있다. 양방향 접근법은 모든 프레임(과거와 미래)을 동시에 향상시키고 과거와 미래 정보를 자유롭게 사용할 수 있다. 과거 및 미래 프레임들을 사용하는 것은 개선된 시간적 일관성을 초래하고 고도로 병렬화된 추론을 허용하지만, 증가된 메모리 사용량 및 레이턴시를 희생한다(예를 들어, 원격 회의, 비디오 스트리밍 등과 같은 애플리케이션들에 대한 실시간 동작을 허용하지 않음).Existing machine learning (e.g., neural networks)-based video enhancement methods can be categorized into unidirectional and bidirectional approaches, which indicate whether the method accesses future frames when enhancing the current frame. Both categories have advantages and disadvantages. Unidirectional approaches access only information from previous frames, or sometimes up to one future (unenhanced) frame. Therefore, these methods can potentially run in real-time, but can suffer from long-term dependencies and temporal coherence. Bidirectional approaches enhance all frames (past and future) simultaneously and freely access both past and future information. Using past and future frames improves temporal coherence and allows for highly parallelized inference, but at the expense of increased memory usage and latency (e.g., preventing real-time operation for applications such as teleconferencing and video streaming).

일부 경우들에서 비디오 향상을 위해 확산 모델들(또는 확산 확률 모델들(DPM들))이 사용될 수 있다. DPM 기반 비디오 향상 접근법들은 전형적으로, 3차원(3D) 컨볼루션들 또는 시간적 어텐션 계층들을 사용하여 다수의 프레임들을 동시에 프로세싱하는 아키텍처들을 사용하여, 양방향 카테고리에 속한다. 다수의 프레임들을 동시에 프로세싱하는 것은 인접한 프레임들에 대해 병렬 샘플링을 가능하게 하지만, (예를 들어, 연속적인 프레임들이 유사한 경우에도 모든 프레임들에 대해 동일한 고정된 수의 단계들을 사용하여) 거의 유연성을 제공하지 않는다. 또한, 많은 DPM 기반 작업은 프레임 정렬을 생략하고, 대신 프레임 정렬이 없는 상태에서 모델이 필요한 프로세싱을 수행하도록 선택한다. 다른 DPM-기반 비디오 향상 방법들은 비디오-대-비디오 변환에서의 충실도를 개선하기 위해 DPM에 대한 입력으로서 컨텍스트 정보(예를 들어, 심도 맵들)를 제공한다. 그러나, 이들에 대해서도, 프레임 정렬은 통상적으로 암시적이다. 맞춤형 흐름 추정기 또는 워핑 연산자가 더 이상 필요하지 않기 때문에, 명시적 정렬 없이 양방향 접근법으로 수렴하는 것이 장점이라고 주장할 수 있다. 그러나 DPM은 이제 이러한 작업을 암시적으로 수행해야 하므로, 이러한 우선 순위를 없애는 것은 일반적으로 계산 시 비용이 발생한다.In some cases, diffusion models (or diffusion probabilistic models (DPMs)) can be used for video enhancement. DPM-based video enhancement approaches typically fall into the bidirectional category, employing architectures that process multiple frames simultaneously using three-dimensional (3D) convolutions or temporal attention layers. Simultaneous processing of multiple frames allows for parallel sampling of adjacent frames, but offers little flexibility (e.g., using the same fixed number of steps for all frames, even if consecutive frames are similar). Furthermore, many DPM-based works omit frame alignment, opting instead to let the model perform the necessary processing without frame alignment. Other DPM-based video enhancement methods provide contextual information (e.g., depth maps) as input to the DPM to improve fidelity in video-to-video conversion. However, even in these cases, frame alignment is typically implicit. One could argue that convergence to a bidirectional approach without explicit alignment is advantageous, since custom flow estimators or warping operators are no longer required. However, since DPM must now perform these operations implicitly, removing these priorities is generally computationally expensive.

이전에 언급된 바와 같이, 기존의 초해상도 시스템들보다 더 일반적인 애플리케이션들을 가능하게 하는 범용, 고품질, 저-지연 프레임 향상(예를 들어, 비디오 초해상도, 샤프닝 등) 알고리즘을 제공하는 비디오 향상 확산 모델을 사용하여 프레임 향상(예를 들어, 비디오 초해상도, 샤프닝 등)을 수행하기 위한 시스템들 및 기법들이 본 명세서에서 설명된다. 예를 들어, 비디오 향상 확산 모델은 지각 품질을 제공할 수 있는 확산 기반 트레이닝에 기초하여 트레이닝될 수 있다. 일부 경우들에서, 시간적 일관성을 제공할 수 있는 시간적 모델링을 위해(예를 들어, 광학 흐름 추정 및 후방 워핑을 사용하여 픽셀-공간에서 또는 변형가능한 컨볼루션들을 사용하여 피처-공간에서) 광학 흐름이 수행될 수 있다. 예를 들어, 모델은 입력 프레임들 사이의 명시적 모션 추정을 사용할 수 있고, 입력 프레임들 사이의 모션에 대해 컨디셔닝될 수 있다. 순환 업샘플링은 또한 병렬 업샘플링 대신에 (예를 들어, 프레임별로 순차적으로 확산 프로세스를 수행함으로써) 수행될 수 있다. 예를 들어, 순환적 방식으로 프레임들을 프로세싱하는 것은 순차적 의존성을 초래하며, 이 경우 병렬 샘플링이 수행되지 않을 수 있다. 그러나, (예를 들어, 프레임별로 순차적으로 확산 프로세스를 수행함으로써) 프레임들의 순환적 프로세싱을 수행하는 것은 비디오의 모든 또는 다수의 프레임들의 "병렬" 샘플링보다 느릴 수 있다. (예를 들어, 병렬 샘플링의 부족으로 인한) 순환 프로세싱의 고유 지연은, 잡음 및 샘플링 방식의 이전 확산 레이턴트를 재사용하는 것, 샘플링 단계들을 스킵하는 것, 샘플링 단계들을 재사용하는 것, 및/또는 이들의 임의의 적응에 의해서와 같이 이전 시간 단계로부터의 정보를 재사용함으로써 보상될 수 있다. 예를 들어, 지연은 인접한 시간단계들 사이의 샘플링 단계들을 재사용하거나 생략함으로써 보상될 수 있다. 일부 경우들에서, 본 명세서에 설명된 비디오 향상 확산 모델은 단방향 비디오 향상 설정에서 수행할 수 있다.As previously mentioned, systems and techniques for performing frame enhancement (e.g., video super-resolution, sharpening, etc.) using a video enhancement diffusion model are described herein, which provides a general-purpose, high-quality, low-latency frame enhancement (e.g., video super-resolution, sharpening, etc.) algorithm that enables more general applications than existing super-resolution systems. For example, the video enhancement diffusion model can be trained based on diffusion-based training, which can provide perceptual quality. In some cases, optical flow can be performed for temporal modeling (e.g., in pixel-space using optical flow estimation and backwarping, or in feature-space using deformable convolutions) that can provide temporal consistency. For example, the model can use explicit motion estimation between input frames and can be conditioned on motion between input frames. Circular upsampling can also be performed instead of parallel upsampling (e.g., by performing the diffusion process sequentially on a frame-by-frame basis). For example, processing frames in a circular manner introduces sequential dependencies, in which case parallel sampling may not be performed. However, performing circular processing of frames (e.g., by performing the diffusion process sequentially frame by frame) may be slower than "parallel" sampling of all or many frames of the video. The inherent delay of circular processing (e.g., due to the lack of parallel sampling) can be compensated for by reusing information from previous time steps, such as by reusing noise and previous diffusion latencies of the sampling scheme, skipping sampling steps, reusing sampling steps, and/or any adaptation thereof. For example, the delay can be compensated for by reusing or skipping sampling steps between adjacent time steps. In some cases, the video enhancement diffusion model described herein can be performed in a unidirectional video enhancement setting.

예들은 비디오 향상 확산 모델에 의해 수행될 수 있는 프레임 향상 동작의 예시적인 예로서 비디오 초해상도를 사용하여 본 명세서에서 설명될 것이다. 그러나, 일부 양태들에서, 비디오 향상 확산 모델은 다른 프레임 향상 동작들을 수행하도록 트레이닝될 수 있다.Examples will be described herein using video super-resolution as an illustrative example of frame enhancement operations that can be performed by a video enhancement diffusion model. However, in some embodiments, the video enhancement diffusion model can be trained to perform other frame enhancement operations.

도 4는 본 명세서에 설명된 양태들에 따른 비디오 향상 확산 모델(408)을 포함하는 비디오 모델(400)의 일 예를 예시하는 다이어그램이다. 도시된 바와 같이, 비디오 향상 확산 모델(408)로의 입력은 입력 저해상도 프레임(404), 입력 잡음(406), 및 하나 이상의 이전 업샘플링된 프레임들(402)을 포함한다. 하나 이상의 이전 업샘플링된 프레임들(402)은 현재 입력 저해상도 프레임(404) 이전의 프레임(예를 들어, 입력 저해상도 프레임(404)이 프레임 x_t인 경우 프레임 x_t-1)을 포함할 수 있다. 입력 잡음(406)은 잡음 맵을 포함할 수 있다. 비디오 향상 확산 모델(408)은 입력 저해상도 프레임(404), 입력 잡음(406), 및 하나 이상의 이전 업샘플링된 프레임들(402)에 기초하여 출력 업샘플링된 프레임(410)을 생성할 수 있다. 출력 업샘플링된 프레임(410)은 입력 저해상도 프레임(404)보다 더 높은 해상도를 갖지만, 이전 프레임들과의 높은 지각 품질 및 높은 시간적 일관성을 갖는다.FIG. 4 is a diagram illustrating an example of a video model (400) that includes a video enhancement diffusion model (408) according to aspects described herein. As illustrated, inputs to the video enhancement diffusion model (408) include an input low-resolution frame (404), input noise (406), and one or more previous upsampled frames (402). The one or more previous upsampled frames (402) may include a frame prior to the current input low-resolution frame (404) (e.g., frame x _t-1 if the input low-resolution frame (404) is frame x _t ). The input noise (406) may include a noise map. The video enhancement diffusion model (408) may generate an output upsampled frame (410) based on the input low-resolution frame (404), the input noise (406), and the one or more previous upsampled frames (402). The output upsampled frame (410) has a higher resolution than the input low-resolution frame (404), but has high perceptual quality and high temporal consistency with previous frames.

이전에 언급된 바와 같이, 대부분의 기존의 비디오 초해상도 방법들은 미래의 저해상도 프레임들에 액세스하고, 따라서 저지연 애플리케이션들에 적합하지 않다. 도 4에 도시된 바와 같이, 미래의 저해상도 프레임들은 비디오 초해상도를 수행하기 위해 필요하지 않다. 비디오 향상 확산 모델(408)에 대한 확산 모델의 사용은 다른 머신 러닝 접근법들에 비해 더 높은 지각 품질을 초래한다.As previously mentioned, most existing video super-resolution methods access future low-resolution frames and are therefore unsuitable for low-latency applications. As illustrated in Figure 4, future low-resolution frames are not required to perform video super-resolution. The use of a diffusion model for the video enhancement diffusion model (408) results in higher perceptual quality compared to other machine learning approaches.

도 5는 본 명세서에 설명된 양태들에 따른 비디오 향상 확산 모델(508)을 포함하는 비디오 모델(500)의 다른 예를 예시하는 다이어그램이다. 도 4의 비디오 모델(400)과 유사하게, 도 5의 비디오 향상 확산 모델(508)에 대한 입력은 입력 저해상도 프레임(504) 및 입력 잡음(506)을 포함한다. 입력 잡음(506)은 잡음 맵을 포함할 수 있다.FIG. 5 is a diagram illustrating another example of a video model (500) including a video enhancement diffusion model (508) according to aspects described herein. Similar to the video model (400) of FIG. 4, the input to the video enhancement diffusion model (508) of FIG. 5 includes an input low-resolution frame (504) and input noise (506). The input noise (506) may include a noise map.

비디오 모델(500)은 또한 흐름 추정 엔진(514)을 포함한다. 흐름 추정 엔진(514)은 현재 입력 저해상도 프레임(504)과 이전 저해상도 프레임(505)(예를 들어, 입력 저해상도 프레임(504)이 프레임 x_t일 때 프레임 x_t-1와 같은 현재 입력 저해상도 프레임(504) 이전의 프레임) 사이의 광학 흐름을 추정하거나 결정할 수 있다. 일부 양태들에서, 흐름 추정 엔진(514)은 뉴럴 네트워크 기반 광학 흐름 추정기와 같은 머신 러닝 기반 광학 흐름 추정기이다. 예를 들어, 흐름 추정 엔진(514)은 프레임들 사이의 광학 흐름을 추정하도록 트레이닝된 뉴럴 네트워크 모델(예를 들어, CNN, RNN 등)일 수 있다. 뉴럴 네트워크 모델은 지도 학습 기술, 반-지도 학습 기술, 비지도 학습 기술, 자기 지도 학습 기술, 또는 다른 트레이닝 기술과 같은 임의의 트레이닝 기술을 사용하여 트레이닝될 수 있다. 예시적인 예로서 지도 학습을 사용하여, 트레이닝 데이터세트는 하나 이상의 비디오들을 포함할 수 있고, 트레이닝에 대한 실측(ground truth)은 비디오의 프레임들에 대한 광학 흐름 맵들을 포함할 수 있다. 손실(예를 들어, 평균 제곱 에러(MSE) 또는 다른 손실)은 흐름 추정 엔진(514)에 의해 출력된 추정된 광학 흐름 맵들과 실측 광학 흐름 맵들의 비교에 기초하여 결정될 수 있다. 손실은 흐름 추정 엔진(514)의 뉴럴 네트워크의 파라미터들(예를 들어, 가중치들, 바이어스들 등)을 튜닝하기 위해 역전파를 수행하는 데 사용될 수 있다.The video model (500) also includes a flow estimation engine (514). The flow estimation engine (514) can estimate or determine optical flow between a current input low-resolution frame (504) and a previous low-resolution frame (505) (e.g., a frame prior to the current input low-resolution frame (504), such as frame x _t _-1 , when the input low-resolution frame (504) is frame x t). In some aspects, the flow estimation engine (514) is a machine learning-based optical flow estimator, such as a neural network-based optical flow estimator. For example, the flow estimation engine (514) can be a neural network model (e.g., a CNN, an RNN, etc.) trained to estimate optical flow between frames. The neural network model can be trained using any training technique, such as a supervised learning technique, a semi-supervised learning technique, an unsupervised learning technique, a self-supervised learning technique, or another training technique. As an illustrative example, using supervised learning, the training dataset may include one or more videos, and the ground truth for training may include optical flow maps for frames of the videos. A loss (e.g., mean square error (MSE) or another loss) may be determined based on a comparison of the estimated optical flow maps output by the flow estimation engine (514) with the ground truth optical flow maps. The loss may be used to perform backpropagation to tune parameters (e.g., weights, biases, etc.) of the neural network of the flow estimation engine (514).

일부 양태들에서, 흐름 추정 엔진(514)은 픽셀 단위로 광학 흐름 모션 추정을 수행할 수 있다. 예를 들어, 이전의 저해상도 프레임(505) 내의 각각의 픽셀에 대해, 모션 추정 f 은 입력 저해상도 프레임(504)에서 대응하는 픽셀의 위치를 정의한다. 각 픽셀에 대한 모션 추정 f 은 프레임들 사이에서 픽셀의 움직임을 나타��는 광학 흐름 벡터(예를 들면, 모션 벡터)를 포함할 수 있다. 일부 예들에서, 광학 흐름 맵(예를 들면, 모션 벡터 맵으로도 지칭됨)은 프레임들 간의 광학 흐름 벡터의 계산에 기초하여 생성될 수 있다. 일부 경우들에서, 광학 흐름 맵은 프레임에서의 각 픽셀에 대한 광학 흐름 벡터를 포함할 수 있으며, 여기서 각 벡터는 프레임들 간의 픽셀의 이동을 나타낸다. 예를 들어, 조밀한 광학 흐름은 인접 프레임들 사이에서 계산되어 조밀한 광학 흐름 맵에 포함될 수 있는, 프레임 내의 각 픽셀에 대한 광학 흐름 벡터를 생성할 수 있다. 일부 경우들에서, 광학 흐름 맵은 프레임 내의 모든 픽셀들보다 적은 픽셀들에 대한 벡터들을 포함할 수 있다. 일부 예들에서, 픽셀에 대한 광학 흐름 벡터는 이전의 저해상도 프레임(505)으로부터 입력 저해상도 프레임(504)으로의 픽셀의 이동을 나타내는 변위 벡터(예를 들어, 수평(x-) 및 수직(y-) 변위들과 같은 수평 및 수직 변위들을 나타냄)일 수 있다.In some embodiments, the flow estimation engine (514) may perform optical flow motion estimation on a pixel-by-pixel basis. For example, for each pixel in a previous low-resolution frame (505), a motion estimate f defines the location of the corresponding pixel in the input low-resolution frame (504). The motion estimate f for each pixel may include an optical flow vector (e.g., a motion vector) representing the movement of the pixel between frames. In some examples, an optical flow map (also referred to as, e.g., a motion vector map) may be generated based on the computation of the optical flow vectors between frames. In some cases, the optical flow map may include an optical flow vector for each pixel in a frame, where each vector represents the movement of the pixel between frames. For example, a dense optical flow may be computed between adjacent frames to generate an optical flow vector for each pixel in a frame, which may be included in the dense optical flow map. In some cases, the optical flow map may include vectors for fewer than all pixels in the frame. In some examples, the optical flow vector for a pixel may be a displacement vector (e.g., representing horizontal and vertical displacements, such as horizontal (x-) and vertical (y-) displacements) representing the movement of the pixel from a previous low-resolution frame (505) to an input low-resolution frame (504).

비디오 모델(500)의 워핑 엔진(516)은 입력 저해상도 프레임(504)과 이전 저해상도 프레임(505) 사이의 결정된 광학 흐름을 사용하여 워핑된 이전 업샘플링된 프레임(517)을 생성하기 위해 이전 업샘플링된 프레임(503)(예를 들어, 비디오 모델(500)에 의해 업샘플링됨)을 워핑할 수 있다. 예를 들어, 워핑된 이전 업샘플링된 프레임(517)을 생성하기 위해, 워핑 엔진(516)은 이전 업샘플링된 프레임(503)의 각각의 픽셀을 각각의 픽셀에 대해 결정된 각각의 광학 흐름 벡터(또는 모션 벡터)에 의해 표시된 (예를 들어, 수평 및 수직 방향에서의) 양만큼 조정(예를 들어, 이동)할 수 있다.The warping engine (516) of the video model (500) can warp a previous upsampled frame (503) (e.g., upsampled by the video model (500)) to generate a warped previous upsampled frame (517) using the determined optical flow between the input low-resolution frame (504) and the previous low-resolution frame (505). For example, to generate the warped previous upsampled frame (517), the warping engine (516) can adjust (e.g., translate) each pixel of the previous upsampled frame (503) by an amount (e.g., in the horizontal and vertical directions) indicated by each optical flow vector (or motion vector) determined for each pixel.

도 5에 도시된 바와 같이, 광학 흐름에 기초하여 워핑 엔진(516)에 의해 생성된 워핑된 이전 업샘플링된 프레임(517)은 비디오 향상 확산 모델(508)에 대한 추가 입력으로서 사용될 수 있다. 비디오 향상 확산 모델(508)은 입력 저해상도 프레임(504), 입력 잡음(506), 및 워핑된 이전 업샘플링된 프레임(517)에 기초하여 출력 업샘플링된 프레임(510)을 생성할 수 있다. 출력 업샘플링된 프레임(510)은 이전 프레임들과의 높은 지각 품질 및 높은 시간적 일관성을 갖는 입력 저해상도 프레임(504)보다 높은 해상도를 갖는다. 예를 들어, 저해상도 프레임들(예를 들어, 입력 저해상도 프레임(504) 및 이전 저해상도 프레임(505))을 사용하여 광학 흐름을 추정한 다음, 광학 흐름을 사용하여 이전 업샘플링된 프레임(503)을 워핑함으로써, 업샘플링 예측(출력 업샘플링된 프레임(510))은 이전 업샘플링된 프레임(503)에 더 가까울 것이다(더 시간적으로 일관성있다).As illustrated in FIG. 5, the warped previous upsampled frame (517) generated by the warping engine (516) based on optical flow can be used as an additional input to the video enhancement diffusion model (508). The video enhancement diffusion model (508) can generate an output upsampled frame (510) based on the input low-resolution frame (504), input noise (506), and the warped previous upsampled frame (517). The output upsampled frame (510) has a higher resolution than the input low-resolution frame (504) with high perceptual quality and high temporal consistency with the previous frames. For example, by estimating optical flow using low-resolution frames (e.g., input low-resolution frame (504) and previous low-resolution frame (505)) and then warping the previous upsampled frame (503) using optical flow, the upsampled prediction (output upsampled frame (510)) will be closer (more temporally consistent) to the previous upsampled frame (503).

일부 양태들에서, 하나 이상의 변형가능한 컨볼루션들은 피처-공간에서 입력 저해상도 프레임(504)과 이전 저해상도 프레임(505) 사이에서 결정된 광학 흐름(예를 들어, 비디오 모델(500)의 머신 러닝 시스템의 하나 이상의 계층들에 의해 출력된 피처들)을 사용하여 피처-공간 워핑을 수행할 수 있다. 예를 들어, 변형가능한 컨볼루션 커널 또는 필터는 광학 흐름 맵(500)과 함께 비디오 모델을 구성하는 뉴럴 네트워크의 하나 이상의 계층들에 의해 출력된 피처들(예를 들어, 피처 벡터들 또는 다른 피처 표현)을 프로세싱하여 광학 흐름에 기초한 워핑을 수행할 수 있다. 일부 경우들에서, 워핑을 수행하기 위해, 변형가능한 컨볼루션은 특정 영역 또는 픽셀들의 이웃에서 다른 픽셀 값들의 가중 합으로서 픽셀 값을 대체할 수 있으며, 여기서 광학 흐름 맵은 픽셀 값 주위의 픽셀들의 영역 또는 이웃에서 픽셀 값들의 위치들을 특정할 수 있다.In some embodiments, one or more deformable convolutions can perform feature-space warping using optical flow (e.g., features output by one or more layers of a machine learning system of a video model (500)) determined between an input low-resolution frame (504) and a previous low-resolution frame (505) in feature space. For example, a deformable convolution kernel or filter can process features (e.g., feature vectors or other feature representations) output by one or more layers of a neural network constituting the video model along with an optical flow map (500) to perform warping based on optical flow. In some cases, to perform the warping, the deformable convolution can replace a pixel value as a weighted sum of other pixel values in a particular region or neighborhood of pixels, where the optical flow map can specify locations of pixel values in a region or neighborhood of pixels around the pixel value.

도 5에 도시된 바와 같이, 보조 입력들(예를 들어, 심도 맵, 알베도 맵 등)이 필요하지 않아, 솔루션이 매우 다양한 시나리오들 또는 사용 사례들에 적용가능하게 한다. 많은 다른 VSR 방법, 그러나 또한 흐름 추정을 사용하는 애플리케이션들은 보조 입력들(예를 들어, 심도 맵, 알베도 맵 등)을 사용한다.As illustrated in Figure 5, no auxiliary inputs (e.g., depth maps, albedo maps, etc.) are required, making the solution applicable to a wide variety of scenarios or use cases. Many other VSR methods, but also applications that utilize flow estimation, use auxiliary inputs (e.g., depth maps, albedo maps, etc.).

일부 경우들에서, 본 명세서에 설명된 비디오 향상 확산 모델들 중 하나 이상에 제공된 입력 잡음(예를 들어, 도 4의 비디오 향상 확산 모델(408)에 제공된 입력 잡음(406) 또는 도 5의 비디오 향상 확산 모델(508)에 제공된 입력 잡음(406))은 특정 포인트들 또는 간격들에서 잡음의 ��겟 분��(예를 들어, 단위 분산, 제로-중심 가우시안 잡음)로부터 리샘플링될 수 있다. 예를 들어, 위에서 언급된 바와 같이, 입력 잡음(예를 들어, 잡음(406) 및/또는 잡음(506))은 잡음 맵(또는 잡음 이미지)을 포함할 수 있다. 하나의 예시적인 예에서, 입력 잡음은 특정 해상도(예를 들어, 4K 해상도)에서의 가우시안 잡음의 맵/이미지일 수 있다. 입력 잡음을 리샘플링하는 것은 예를 들어, 가우시안 잡음 분포로부터 새로운 샘플을 인출함으로써, 잡음의 타겟 분포(예를 들어, 이전의 입력 잡음 맵과 상이한 가우시안 잡음의 새로운 맵)로부터 새로운 잡음 맵을 획득하는 것을 지칭한다.In some cases, the input noise provided to one or more of the video enhancement diffusion models described herein (e.g., the input noise (406) provided to the video enhancement diffusion model (408) of FIG. 4 or the input noise (406) provided to the video enhancement diffusion model (508) of FIG. 5) may be resampled from a target distribution of noise (e.g., unit variance, zero-centered Gaussian noise) at specific points or intervals. For example, as noted above, the input noise (e.g., noise (406) and/or noise (506)) may include a noise map (or noise image). In one illustrative example, the input noise may be a map/image of Gaussian noise at a specific resolution (e.g., 4K resolution). Resampling the input noise refers to obtaining a new noise map from the target distribution of noise (e.g., a new map of Gaussian noise that is different from the previous input noise map), for example, by drawing new samples from the Gaussian noise distribution.

하나의 예시적인 예에서, 입력 잡음은 입력 비디오 시퀀스에서 모든 시간 단계(또는 시간 시간단계들의 서브세트)에서 리샘플링될 수 있다. 다른 예시적인 예에서, 입력 잡음은 입력 비디오 시퀀스의 프레임들의 시퀀스당 한 번 리샘플링될 수 있다(예를 들어, 프레임들의 시퀀스는 12개의 프레임들, 24개의 프레임들, 60개의 프레임들, 120개의 프레임들, 또는 임의의 다른 수의 프레임들을 포함할 수 있음). 이러한 예에서, 잡음은 각각의 시퀀스 내의 프레임들의 수(예를 들어, 12개의 프레임들, 24개의 프레임들 등)에 대해 일정할 수 있고, 각각의 시퀀스의 제1 프레임에서 리샘플링될 수 있다. 다른 예시적인 예에서, 입력 잡음은 장면 컷이 검출될 때 (예를 들어, 하나의 프레임으로부터 다른 프레임으로의 장면의 변화에 기초하여) 리샘플링될 수 있다. 다른 예시적인 예에서, 입력 잡음은 설정된 시간 간격으로(예를 들어, 0.5초마다, 1초마다, 또는 다른 시간 간격마다) 리샘플링될 수 있다.In one exemplary example, the input noise may be resampled at every time step (or a subset of time steps) in the input video sequence. In another exemplary example, the input noise may be resampled once per sequence of frames in the input video sequence (e.g., the sequence of frames may include 12 frames, 24 frames, 60 frames, 120 frames, or any other number of frames). In such an example, the noise may be constant for the number of frames in each sequence (e.g., 12 frames, 24 frames, etc.) and may be resampled at the first frame of each sequence. In another exemplary example, the input noise may be resampled when a scene cut is detected (e.g., based on a change in scene from one frame to another). In another exemplary example, the input noise can be resampled at set time intervals (e.g., every 0.5 second, every 1 second, or at other time intervals).

도 6은 본 명세서에 설명된 양태들에 따른 비디오 향상 확산 모델(608)을 포함하는 비디오 모델(600)의 다른 예를 예시하는 다이어그램이다. 도 4의 비디오 모델(400)과 유사하게, 비디오 모델(600)은 흐름 추정 엔진(614) 및 프레임 워핑 엔진(616)을 포함한다. 흐름 추정 엔진(614)은 도 5와 관련하여 전술한 흐름 추정 엔진(514)의 것과 동일한 동작을 수행할 수 있다. 예를 들어, 흐름 추정 엔진(614)은 현재 입력 저해상도 프레임(604)과 이전 저해상도 프레임(605) 사이의 광학 흐름을 추정하거나 결정할 수 있다.FIG. 6 is a diagram illustrating another example of a video model (600) that includes a video enhancement diffusion model (608) according to aspects described herein. Similar to the video model (400) of FIG. 4, the video model (600) includes a flow estimation engine (614) and a frame warping engine (616). The flow estimation engine (614) may perform the same operations as the flow estimation engine (514) described above with respect to FIG. 5. For example, the flow estimation engine (614) may estimate or determine optical flow between a current input low-resolution frame (604) and a previous low-resolution frame (605).

프레임 워핑 엔진(616)은 도 5와 관련하여 전술한 워핑 엔진(516)의 것과 동일한 동작을 수행할 수 있다. 예를 들어, 프레임 워핑 엔진(616)은 입력 저해상도 프레임(604)과 이전 저해상도 프레임(605) 사이의 결정된 광학 흐름을 사용하여 워핑된 이전 업샘플링된 프레임(617)을 생성하기 위해 이전 업샘플링된 프레임(603)(예를 들어, 비디오 모델(600)에 의해 업샘플링됨)을 워핑할 수 있다. 일부 경우들에서, 워핑된 이전 업샘플링된 프레임(617)을 생성하기 위해, 워핑 엔진(616)은 이전 업샘플링된 프레임(603)의 각각의 픽셀을, 각각의 픽셀에 대해 결정된 개별 광학 흐름 벡터(또는 모션 벡터)에 의해 표시된 양만큼 (예를 들어, 수평 및 수직 방향으로) 조정(예를 들어, 이동)할 수 있다.The frame warping engine (616) may perform the same operations as the warping engine (516) described above with respect to FIG. 5 . For example, the frame warping engine (616) may warp a previous upsampled frame (603) (e.g., upsampled by the video model (600)) to generate a warped previous upsampled frame (617) using the determined optical flow between the input low-resolution frame (604) and the previous low-resolution frame (605). In some cases, to generate the warped previous upsampled frame (617), the warping engine (616) may adjust (e.g., translate) each pixel of the previous upsampled frame (603) (e.g., in the horizontal and vertical directions) by an amount indicated by an individual optical flow vector (or motion vector) determined for each pixel.

비디오 모델(600)은 잡음 워핑 엔진(624)을 더 포함한다. 일부 양태들에서, 잡음 워핑 엔진(624) 및 프레임 워핑 엔진(616)은 동일한 워핑 엔진일 수 있다 (예를 들어, 공유 워핑 엔진은 이전 잡음 입력(622) 및 이전 업샘플링된 프레임(603)을 워핑할 수 있다). 잡음 워핑 엔진(624)은 입력 저해상도 프레임(604)과 이전 저해상도 프레임(605) 사이의 결정된 광학 흐름을 사용하여 이전 잡음 입력(622)을 워핑하여 워핑된 잡음(619)을 생성할 수 있다. 예를 들어, 이전 잡음 입력(622)은 잡음 분포(예를 들어, 가우시안 잡음 분포)로부터 샘플링되는 픽셀 값들을 갖는 프레임을 포함할 수 있다. 잡음 워핑 엔진(624)은 입력 저해상도 프레임(604)의 각각의 픽셀에 대해 결정된 개별 광학 흐름 벡터(또는 모션 벡터)에 의해 표시된 양만큼 (예를 들어, 수평 및 수직 방향으로) 이전 잡음 입력(622)의 각각의 픽셀을 조정(예를 들어, 이동)할 수 있다.The video model (600) further includes a noise warping engine (624). In some aspects, the noise warping engine (624) and the frame warping engine (616) may be the same warping engine (e.g., a shared warping engine may warp the previous noise input (622) and the previous upsampled frame (603)). The noise warping engine (624) may warp the previous noise input (622) using the determined optical flow between the input low-resolution frame (604) and the previous low-resolution frame (605) to generate warped noise (619). For example, the previous noise input (622) may include a frame having pixel values sampled from a noise distribution (e.g., a Gaussian noise distribution). The noise warping engine (624) can adjust (e.g., translate) each pixel of the previous noise input (622) by an amount indicated by an individual optical flow vector (or motion vector) determined for each pixel of the input low-resolution frame (604) (e.g., in the horizontal and vertical directions).

입력 저해상도 프레임(604) 및 입력 잡음(606)(예를 들어, 잡음 맵 또는 이미지)에 더하여, 프레임 워핑 엔진(616)에 의해 생성된 워핑된 이전 업샘플링된 프레임(617) 및 잡음 워핑 엔진(624)에 의해 생성된 워핑된 잡음(619)은 비디오 향상 확산 모델(608)에 대한 입력으로서 사용될 수 있다. 비디오 향상 확산 모델(608)은 입력 저해상도 프레임(604), 입력 잡음(606), 워핑된 이전 업샘플링된 프레임(617), 및 워핑된 잡음(619)에 기초하여 출력 업샘플링된 프레임(610)을 생성할 수 있다. 출력 업샘플링된 프레임(610)은 이전 프레임들과의 높은 지각 품질 및 높은 시간적 일관성을 갖는 입력 저해상도 프레임(604)보다 높은 해상도를 갖는다.In addition to the input low-resolution frame (604) and input noise (606) (e.g., a noise map or image), the warped previous upsampled frame (617) generated by the frame warping engine (616) and the warped noise (619) generated by the noise warping engine (624) can be used as inputs to the video enhancement diffusion model (608). The video enhancement diffusion model (608) can generate an output upsampled frame (610) based on the input low-resolution frame (604), the input noise (606), the warped previous upsampled frame (617), and the warped noise (619). The output upsampled frame (610) has a higher resolution than the input low-resolution frame (604) with high perceptual quality and high temporal coherence with the previous frames.

이전 업샘플링된 프레임(603)을 워핑하기 위해 광학 흐름을 사용하는 것에 더하여, 도 6의 비디오 모델(600)은 또한 입력 잡음(606)(예를 들어, 잡음 맵)을 워핑하여, 비디오 향상 확산 모델(608)에 대한 입력으로서 사용될 수 있는 워핑된 잡음(619)을 생성할 수 있다. 일부 경우들에서, 워핑된 잡음(619)은 입력 잡음(606)을 대체할 수 있거나 또는 입력 잡음(606)과 (예를 들어, 이전 샘플링된 잡음과 현재 리샘플링된 잡음 사이의 선형 조합으로서) 조합될 수 있다. 이전에 언급된 바와 같이, 일부 경우들에서, 본 명세서에 설명된 비디오 향상 확산 모델들 중 하나 이상에 제공된 입력 잡음(예를 들어, 도 4의 비디오 향상 확산 모델(408)에 제공된 입력 잡음(406) 또는 도 5의 비디오 향상 확산 모델(508)에 제공된 입력 잡음(406))은 특정 포인트들 또는 간격들에서(예를 들어, 각각의 시간 단계에서, 프레임들의 각각의 시��스에 대해, 미리 결정된 시간 간격에서, 장면 컷 또는 변화가 검출될 때, 이들의 임의의 조합, 및/또는 다른 포인트들 또는 간격들에서) 리샘플링될 수 있다. 도 6에 도시된 바와 같이, 광학 흐름으로 하나 이상의 이전 프레임들로부터의 잡음 맵을 워핑하는 것은 장면에서 오브젝트들(예를 들어, 이동이는 오브젝트들)에 일관된 텍스처를 강제하는 것을 도울 수 있다. 일부 양태들에서, 워핑된 이전 잡음과 새롭게 샘플링된 잡음의 조합이 사용될 수 있으며, 이는 폐색들(예를 들어, 장면의 부분들이 폐색되는 곳), 장면 컷들(예를 들어, 새로운 장면이 하나의 프레임으로부터 다른 프레임으로 제시되는 곳) 등을 도울 수 있다.In addition to using optical flow to warp the previously upsampled frame (603), the video model (600) of FIG. 6 may also warp the input noise (606) (e.g., a noise map) to generate warped noise (619) that may be used as input to the video enhancement diffusion model (608). In some cases, the warped noise (619) may replace the input noise (606) or may be combined with the input noise (606) (e.g., as a linear combination between the previously sampled noise and the currently resampled noise). As previously mentioned, in some cases, the input noise provided to one or more of the video enhancement diffusion models described herein (e.g., the input noise (406) provided to the video enhancement diffusion model (408) of FIG. 4 or the input noise (406) provided to the video enhancement diffusion model (508) of FIG. 5) may be resampled at specific points or intervals (e.g., at each time step, for each sequence of frames, at predetermined time intervals, when a scene cut or change is detected, any combination thereof, and/or at other points or intervals). As illustrated in FIG. 6, warping the noise map from one or more previous frames with optical flow may help enforce a consistent texture on objects in a scene (e.g., moving objects). In some embodiments, a combination of warped previous noise and newly sampled noise can be used, which can help with occlusions (e.g., where parts of a scene are occluded), scene cuts (e.g., where a new scene is presented from one frame to another), etc.

일부 양태들에서, 본 명세서에 설명된 비디오 향상 확산 모델(예를 들어, 도 5의 비디오 향상 확산 모델 (508), 도 6의 비디오 향상 확산 모델 (608) 등)은, 예를 들어, 아래에 도시된 바와 같이, T 단계들에서 데이터를 파괴하는 순방향 확산 프로세스를 역전시키기 위해 역방향 확산 프로세스를 수행하는 것을 학습하는 레이턴트-변수 모델들일 수 있다:In some embodiments, the video enhancement diffusion models described herein (e.g., the video enhancement diffusion model (508) of FIG. 5 , the video enhancement diffusion model (608) of FIG. 6 , etc.) may be latent-variable models that learn to perform a backward diffusion process to reverse a forward diffusion process that destroys data in T steps, for example, as illustrated below:

식 (1)에서, 및 은 잡음 스케줄을 제어한다(순방향 확산 프로세스의 진행을 지칭함). 그 후, 생성 모델은 가우시안 트랜지션을 갖는, 여기서 스케줄 σ_t ²에 따라 고정된 대각선 공분산 및 사용자 지정 컨디셔닝 c을 갖는 마르코프 체인이다:In equation (1), and controls the noise schedule (referring to the progress of the forward diffusion process). Then, the generative model is a Markov chain with Gaussian transitions, with fixed diagonal covariance and user-specified conditioning c according to the schedule σ _t ² :

(x)로부터의 샘플링은, 그 후 조상 샘플링(ancestral sampling)의 T개의 단계들을 수반한다. 확산 모델로부터의 샘플링의 계산 부하를 감소시키기 위해, 예컨대, (하기에 설명되는 것과 같은) 샘플링 스케줄을 리스페이싱함으로써 또는 단계 증류(step distillation)와 같은 더 진보된 기술을 통해, 테스트 시간에 적은 수의 샘플링 단계들이 수행될 수 있다. Sampling from (x) then involves T steps of ancestral sampling. To reduce the computational load of sampling from the diffusion model, fewer sampling steps can be performed at test time, for example, by respacing the sampling schedule (as described below) or through more advanced techniques such as step distillation.

일부 경우들에서, 비디오 향상 확산 모델(예를 들어, 도 5의 비디오 향상 확산 모델(508), 도 6의 비디오 향상 확산 모델(608) 등)은 v 파라미터화를 사용하여 트레이닝될 수 있다. 예를 들어, v-파라미터화는 낮은 신호-대-잡음비(SNR) 경우들을 처리하여, 본질적으로 트레이닝을 안정화시키는 확산 모델들에 대한 대안적인 목적함수(objective)이다. 낮은 SNR은 더 적은 그리고 더 적은 확산 단계들에 대해 트레이닝할 때 점점 더 가능성이 높다. 이것은 SNR+1 목적함수(v-파라미터화로도 알려짐)를 사용하는 이유이다. 일부 경우에, v 파라미터화는 일반적으로 사용되는 목적함수보다 낮은 수의 단계에 더 적합할 수 있다. 예를 들어, 본 명세서에 설명된 비디오 향상 시스템(비디오 향상 확산 모델을 포함함)은, 많은 단계들이 더 많은 계산을 사용하고 더 많은 시간을 요구하기 때문에, 낮은 복잡도 및 낮은 지연 둘 모두를 만족시키기 위해 낮은 수의 단계들을 갖도록 설계될 수 있다. 시스템은 잔여 공간에서 확산을 수행할 수 있고, 이전 시간 단계들로부터의 잡음을 재사용하며, 특정 스케줄링 및 잡음(예를 들어, DDIM(denoising diffusion implicit model))을 사용하여, 낮은 수의 단계들을 초래한다.In some cases, video enhancement diffusion models (e.g., video enhancement diffusion model 508 of FIG. 5, video enhancement diffusion model 608 of FIG. 6, etc.) can be trained using v-parameterization. For example, v-parameterization is an alternative objective to diffusion models that essentially stabilizes training by handling low signal-to-noise ratio (SNR) cases. Low SNR is increasingly likely when training for fewer and fewer diffusion steps. This is why the SNR+1 objective (also known as v-parameterization) is used. In some cases, v-parameterization is commonly used. A lower number of steps may be more appropriate than the objective function. For example, the video enhancement system described herein (including the video enhancement diffusion model) can be designed with a lower number of steps to achieve both low complexity and low latency, since multiple steps require more computation and time. The system can perform diffusion in the residual space, reuse noise from previous time steps, and utilize specific scheduling and noise (e.g., the denoising diffusion implicit model (DDIM)) to achieve a lower number of steps.

일부 양태들에서, 잔차들 의 공간은 이미지의 공간을 직접 모델링하는 대신, 모델링될 수 있다. 가 보다 해상도가 낮은 경우, 즉, 나이브(naive) 업샘플링(예를 들어, 바이큐빅(bicubic) 보간)이 준비 단계로 사용될 수 있다. 예를 들어, 일부 예들에서, 공간적 잔차 및 시간적 잔차를 포함하는 잔차들에 대한 적어도 2개의 옵션들이 사용될 수 있다. 공간 잔차의 경우, 이고, 여기서 는 저해상도 프레임이다. 시간 잔차의 경우, 이고, 여기서 는 이전 업샘플링된 프레임이고, f는 광학 흐름 벡터 필드이다.In some aspects, residuals The space of can be modeled instead of directly modeling the space of the image. go For lower resolutions, naive upsampling (e.g., bicubic interpolation) can be used as a preparatory step. For example, in some examples, at least two options for residuals, including spatial residuals and temporal residuals, can be used. For spatial residuals, and here is a low-resolution frame. For temporal residuals, and here is the previous upsampled frame, and f is the optical flow vector field.

따라서, 비디오 향상 확산 모델은 를 예측하는 것을 학습할 수 있고, 여기서 및 는 잡음 스케줄을 제어하고, r은 이전에 설명된 것과 같이 잔차들 ()이다. 하나의 예시적인 예에서, 위에서 언급된 바와 같이, SNR+1 트레이닝 목적함수(또는 v-파라미터화)가 사용될 수 있으며, 이는 SNR(로 표현됨) 플러스 1과 동일한 가중 계수에 기초한다:Therefore, the video enhancement diffusion model is can learn to predict, where and controls the noise schedule, and r is the residuals (as previously described). ) is. In one illustrative example, as mentioned above, the SNR+1 training objective function (or v-parameterization) can be used, which is SNR( ) is based on a weighting factor equal to plus 1:

여기서 는 비디오 향상 확산 모델의 출력일 수 있다. 출력은 정확한 손실 공식에 따라 상이할 수 있으며, SNR+1(v 파라미터화)은 단지 예로서 주어진다는 점에 유의해야 한다.Here can be the output of a video enhancement diffusion model. Note that the output may vary depending on the exact loss formula, and that SNR+1 (v parameterization) is given as an example only.

일부 양태들에서, 이미지 모델은 프레임들의 시퀀스(예를 들어, 비디오)의 단일 프레임들에 대해(예를 들어, 이전 프레임들이 이용가능하지 않을 때 제1 또는 초기 프레임 또는 다른 개별 프레임들에 대해) 초해상도를 수행하기 위해 사용될 수 있다. 예를 들어, 프레임들에 대해 이미지 향상(예를 들어, 초해상도)을 수행할 때, 이미지 모델은 이전의 저해상도 프레임이 이용가능하지 않을 때 시퀀스의 제1 또는 초기 프레임에 사용될 수 있다. 이미지 모델은 조건부 확산 모델일 수 있고, 여기서 컨디셔닝 은 제1 저해상도 프레임(예를 들어, 향상될 프레임), 일 수 있다. 어떤 초해상도 방법도 사용될 수 있다. 비디오 모델과 유사한 아키텍처를 갖는 확산 모델을 사용하는 것은 가중치들을 재사용하는 것을 허용한다. 이러한 양태들에서, 본 명세서에 설명된 비디오 향상 확산 모델 (예를 들어, 비디오 향상 확산 모델(508)을 포함하는 비디오 모델(500), 비디오 향상 확산 모델(608)을 포함하는 비디오 모델(600) 등)을 포함하는 비디오 모델은, 프레임들의 시퀀스의 나머지 프레임들(예를 들어, 제1 또는 초기 프레임 후에 발생하는 프레임들)에 적용될 수 있다.In some embodiments, the image model may be used to perform super-resolution on single frames of a sequence of frames (e.g., a video) (e.g., a first or initial frame or other individual frames when previous frames are not available). For example, when performing image enhancement (e.g., super-resolution) on frames, the image model may be used on the first or initial frame of the sequence when previous low-resolution frames are not available. The image model may be a conditional diffusion model, where the conditioning is the first low-resolution frame (e.g. the frame to be enhanced), Any super-resolution method can be used. Using a diffusion model with a similar architecture to the video model allows for the reuse of weights. In these aspects, a video model including a video enhancement diffusion model described herein (e.g., a video model (500) including a video enhancement diffusion model (508), a video model (600) including a video enhancement diffusion model (608), etc.) can be applied to the remaining frames of a sequence of frames (e.g., frames occurring after a first or initial frame).

전술한 바와 같이, 비디오 향상 확산 모델은 이전에 향상된 프레임들도 고려할 수 있으며, 이 경우 컨디셔닝 은 다음과 같을 수 있다: . 도 7은 이러한 표기법을 사용하는 비디오 향상 확산 모델을 포함하는 시스템(700)의 예시적인 예를 예시하는 다이어그램이다. 도 7에 도시된 바와 같이, 흐름 추정 엔진(714)은 입력으로서 이전의 저해상도 프레임(705)(로 표시됨) 및 현재 저해상도(704)(로 표시됨)를 수신한다. 이전의 저해상도 프레임(705)은 현재 입력 저해상도 프레임(704) 이전의 프레임일 수 있다. 흐름 추정 엔진(714)은 도 5 및 도 6과 관련하여 설명된 것과 유사한, 현재 입력 저해상도 프레임(704)과 이전 저해상도 프레임(705) 사이의 (이전 저해상도 프레임(705)의 각각의 픽셀에 대해, 변위 값들과 같은 광학 흐름 값들을 포함하는 광학 흐름 프레임 으로서 도시된) 광학 흐름(725)을 추정하거나 결정할 수 있다.As mentioned above, the video enhancement diffusion model can also consider previously enhanced frames, in which case the conditioning can be as follows: . Figure 7 is a diagram illustrating an exemplary example of a system (700) including a video enhancement diffusion model using this notation. As illustrated in Figure 7, the flow estimation engine (714) takes as input a previous low-resolution frame (705) ( ) and currently low resolution (704) ( ) is received. The previous low-resolution frame (705) may be a frame prior to the current input low-resolution frame (704). The flow estimation engine (714) generates an optical flow frame (for each pixel of the previous low-resolution frame (705) including optical flow values such as displacement values) between the current input low-resolution frame (704) and the previous low-resolution frame (705), similar to that described with respect to FIGS. 5 and 6. The optical flow (725) can be estimated or determined as shown.

워핑 엔진(716)은 결정된 광학 흐름(725)을 사용하여 이전 업샘플링된 프레임(703)(예를 들어, 시스템(700)에 의해 업샘플링됨)을 워핑할 수 있고, 워핑된 이전 업샘플링된 프레임(717)의 생성을 초래한다. 예를 들어, 워핑 엔진(716)은 이전 업샘플링된 프레임(703)의 각각의 픽셀을 광학 흐름(725) 내의 각각의 픽셀에 대해 표시된 개별 광학 흐름 벡터(또는 �� 벡터)에 의해 표시된 양만큼 (예를 들어, 수평 및 수직 방향으로) 조정(예를 들어, 그 위치를 이동)할 수 있다.The warping engine (716) can warp a previous upsampled frame (703) (e.g., upsampled by the system (700)) using the determined optical flow (725), resulting in the generation of a warped previous upsampled frame (717). For example, the warping engine (716) can adjust (e.g., shift) each pixel of the previous upsampled frame (703) by an amount indicated by an individual optical flow vector (or motion vector) represented for each pixel in the optical flow (725) (e.g., in the horizontal and vertical directions).

현재 저해상도 프레임(704), 워핑된 이전 업샘플링된 프레임(717), 및 입력 잡음(706)은 비디오 향상 확산 모델(708)에 입력될 수 있다. 비디오 향상 확산 모델(708)은 현재 저해상도 프레임(704), 입력 잡음(706), 및 워핑된 이전 업샘플링된 프레임(717)에 기초하여 출력 업샘플링된 프레임(710)을 생성할 수 있다. 본 명세서에 설명된 바와 같이, 현재 저해상도 프레임(704) 및 이전 저해상도 프레임(705)을 사용하여 광학 흐름(725)을 추정하고 광학 흐름(725)을 사용하여 이전 업샘플링된 프레임(703)을 워핑함으로써, 업샘플링된 프레임(710)은 이전 업샘플링된 프레임(703)에 더 가깝다(예를 들어, 더 시간적으로 일관성있다). 그 결과, 출력 업샘플링된 프레임(710)은 현재 저해상도 프레임(704)보다 더 높은 해상도를 가지며, 이전 프레임들(이전 업샘플링된 프레임(703)을 포함함)에 대해 높은 지각 품질 및 높은 시간적 일관성을 갖는다.A current low-resolution frame (704), a warped previous upsampled frame (717), and input noise (706) can be input to a video enhancement diffusion model (708). The video enhancement diffusion model (708) can generate an output upsampled frame (710) based on the current low-resolution frame (704), the input noise (706), and the warped previous upsampled frame (717). As described herein, by estimating an optical flow (725) using the current low-resolution frame (704) and the previous low-resolution frame (705) and warping the previous upsampled frame (703) using the optical flow (725), the upsampled frame (710) is closer to (e.g., more temporally consistent with) the previous upsampled frame (703). As a result, the output upsampled frame (710) has a higher resolution than the current low-resolution frame (704) and has high perceptual quality and high temporal consistency with respect to previous frames (including the previous upsampled frame (703)).

일부 양태들에서, 시간적 일관성을 개선하기 위해, 이전에-향상된 프레임(예를 들어, 프레임 )은, 예를 들어, 광학 흐름, (예를 들어, 변형가능한 컨볼루션들을 갖는) 피처-기반 워핑, 이들의 임의의 조합을 사용하여, 및/또는 다른 시간 모델링을 사용하여, 비디오 향상 확산 모델에 입력으로서 현재 프레임을 제공하기 전에, 현재 프레임(프레임 , 예컨대, 도 5의 입력 저해상도 프레임(504), 도 6의 입력 저해상도 프레임(604) 등)과 정렬될 수 있다. 일부 경우들에서, 확산 모델은 또한 광학 흐름 필드(예를 들어, 모션 벡터 필드) 상에서 직접 또는 이전에 정렬된(이전에 워핑된) 프레임 상에서 컨디셔닝될 수 있다. 일부 예들에서, (예를 들어, 광학 흐름 추정기로서 순환 전쌍 필드 변환들(RAFT)을 사용하는) 사전-트레이닝된 뉴럴 네트워크 광학 흐름 추정기가 (예를 들어, 흐름 추정 엔진(514) 및/또는 흐름 추정 엔진(614)으로서) 향상되지 않은 프레임들 사이의 (예를 들어, 입력 저해상도 프레임(504)과 이전 저해상도 프레임(512) 사이의) 모션을 추정하거나 결정하기 위해 사용될 수 있다. 향상되지 않은 프레임들이 정확한 해상도를 갖는 것을 보장하고 흐름 추정기의 재트레이닝을 회피하기 위해, 광학 흐름을 결정하기 전에 (예를 들어, 비디오 향상 확산 모델에 의해 또는 비디오 향상 확산 모델에 입력되기 전에) 저해상도 프레임들의 바이큐빅 업샘플링이 수행될 수 있다. 이전에 설명된 바와 같이, 추정된 광학 흐름은 후방 워핑을 사용하여 이전에 향상된 프레임에 직접 적용되고, 정렬된 향상된 프레임 을 초래할 수 있다.In some aspects, to improve temporal consistency, previously-enhanced frames (e.g., frame ) before providing the current frame as input to a video enhancement diffusion model, using, for example, optical flow, feature-based warping (e.g., with deformable convolutions), any combination thereof, and/or other temporal modeling. , for example, the input low-resolution frame (504) of FIG. 5, the input low-resolution frame (604) of FIG. 6, etc.). In some cases, the diffusion model may also be conditioned directly on the optical flow field (e.g., the motion vector field) or on previously aligned (previously warped) frames. In some examples, a pre-trained neural network optical flow estimator (e.g., using recurrent all-pair field transforms (RAFT) as the optical flow estimator) may be used (e.g., as the flow estimation engine (514) and/or the flow estimation engine (614)) to estimate or determine motion between non-enhanced frames (e.g., between the input low-resolution frame (504) and the previous low-resolution frame (512)). To ensure that the non-enhanced frames have the correct resolution and avoid retraining the flow estimator, bicubic upsampling of the low-resolution frames can be performed before determining the optical flow (e.g., by or before being input to the video enhancement diffusion model). As previously described, the estimated optical flow is directly applied to the previously enhanced frames using backward warping, and the aligned enhanced frames may result.

일부 양태들에서, 확산 모델을 트레이닝하는 것은 2개의 트레이닝 스테이지들을 포함할 수 있다: 1) 격리된 프레임들(예를 들어, 비디오의 초기 또는 제1 프레임 또는 다른 개별 프레임들)을 향상시키기 위해 이미지 모델을 트레이닝하는 것, 및 2) 이미지 모델 가중치들을 사용하여(예를 들어, 이미지 모델의 트레이닝된 가중치들을 비디오 모델에 대한 시작 포인트로서 사용하여) 웜-스타트된(warm-started) 비디오 모델을 트레이닝하는 것. 하나의 예시적인 예에서, 제1 트레이닝 스테이지는 대략 500,000개의 단계를 포함할 수 있고, 제2 트레이닝 스테이지는 대략 500,000개의 단계를 포함할 수 있다. 일부 경우들에서, 비디오 모델을 트레이닝할 때, 완전히 트레이닝된 이미지 모델의 출력이 컨디셔닝으로서 사용될 수 있다. 이것은 트레이닝 데이터세트의 모든 프레임들을 초해상도화하는(super-resolving) 결과를 초래할 수 있으며, 이는 일부 예들에서 비용이 많이 들 수 있다. 다른 경우들에서, 실측 이미지는 향상된 이전 프레임 에 대한 프록시를 생성하기 위해 고정된 다운샘플링 및 업샘플링 동작으로 교란될 수 있다.In some embodiments, training a diffusion model may include two training stages: 1) training an image model to enhance isolated frames (e.g., the initial or first frame of a video or other individual frames), and 2) training a warm-started video model using the image model weights (e.g., using the trained weights of the image model as a starting point for the video model). In one illustrative example, the first training stage may include approximately 500,000 steps, and the second training stage may include approximately 500,000 steps. In some cases, when training the video model, the output of the fully trained image model may be used as conditioning. This may result in super-resolving all frames of the training dataset, which may be expensive in some instances. In other cases, the ground truth image may be a previous frame that has been enhanced. can be perturbed by fixed downsampling and upsampling operations to generate proxies for .

이미지 모델로부터 샘플링하기 위해, 리스페이싱 절차가 수행될 수 있고, (예를 들어, 몇 개의 샘플링 단계들이 사용될 때 그 강건성으로 인해) DDIM(denoising diffusion implicit model) 스케줄이 사용될 수 있다. 예를 들어, 리스페이싱은 동일한 연속 함수의 상이한 이산화(discretization)일 수 있다. 하나의 예시적인 예에서, 함수 f(x) = x (선형)를 사용하는 것은, 시간이 0과 1 사이의 연속적인 수로 표현되는 스케줄일 수 있고, 여기서 0은 이미지 분포 및 1은 잡음 분포를 나타낸다. 역방향 프로세스에 T개의 단계들이 있으면, 연속적인 시간 0 내지 1은 이산적인 시간 0 내지 T를 맵핑할 수 있다. 함수는 다수의 포인트들(예를 들어, 100포인트, 1000포인트, 2포인트 등)로 샘플링함으로써 이산화될 수 있다. 포인트들의 수는 확산 프로세스에서 취해질 단계들의 수에 대응한다. 단계들이 적을수록 이 확산 단계의 입력과 출력의 차이가 커진다. 너무 많은 단계들은 충분히 활용하지 못하고(underutilize) 불필요하게 계산을 증가시킬 수 있고, 너무 적은 단계들은 모델이 그러한 큰 차이들을 표현할 수 없게 하고 에러를 초래하게 할 수 있다. DDIM은 트레이닝(예를 들어, 1000개 단계)과 평가(예를 들어, 100개 단계) 사이의 단계들을 불일치시키켜 보다 견고하게 하는 리스페이싱 절차이다. 하나의 예시적인 예에서, 75개의 샘플링 단계들의 디폴트가 사용될 수 있다.To sample from the image model, a respacing procedure can be performed, and a denoising diffusion implicit model (DDIM) schedule can be used (e.g., due to its robustness when several sampling steps are used). For example, the respacing can be a different discretization of the same continuous function. In one illustrative example, using the function f(x) = x (linear) can be a schedule in which times are represented by continuous numbers between 0 and 1, where 0 represents the image distribution and 1 represents the noise distribution. If there are T steps in the backward process, continuous times 0 and 1 can map to discrete times 0 and T. The function can be discretized by sampling at a number of points (e.g., 100 points, 1000 points, 2 points, etc.). The number of points corresponds to the number of steps taken in the diffusion process. The fewer the steps, the greater the difference between the input and output of this diffusion step. Too many steps can underutilize and unnecessarily increase computation, while too few steps can prevent the model from representing such large differences and lead to errors. DDIM is a respacing procedure that increases robustness by disparating the steps between training (e.g., 1,000 steps) and evaluation (e.g., 100 steps). In one illustrative example, the default of 75 sampling steps can be used.

일부 경우들에서, 비디오 모델로부터 샘플링하기 위해, 이전에 향상된 프레임들로부터의 레이턴트들이 샘플링 속도를 높이기 위해 사용될 수 있다. 예를 들어, 이전 레이턴트들을 사용하는 것은 유사하거나 동일한 프레임들을 갖는 비디오들에 대한 샘플링 단계들을 감소시키는 데 유리할 수 있다(예를 들어, 연속적인 향상된 프레임들이 정확하게 동일한 스틸 비디오의 ��단��인 경우에, 제로 샘플링 단계들이 필요하고, 는 재사용될 수 있다). 이전 프레임으로부터 레이턴트들을 재사용하는 것은 시간적 일관성에도 도움이 될 수 있다. 예를 들어, 이미지 모델들 (x)의 경우, 동일한 레이턴트 에서 시작하는 2 개의 샘플들은 로 더 가까이 이동하고, 결정론적 트랜지션들이 사용될 때 동일할 수 있다. 결정론적 샘플링 방식 및 아키텍처가 사용되는 것 및 시간단계 T에서 샘플링된 레이턴트를 가정하면, )는 주어진 컨디셔닝에 대한 샘플을 정확하게 결정한다. 직관적으로, 이는 모션 벡터 가 잡음이 어떻게 재사용되어야 하는지에 관한 정보를 제공한다는 것을 의미한다. 모션이 거의 또는 전혀 발생하지 않는다면, 재샘플링할 필요가 없을 가능성이 높지만, 큰 모션이 발생한다면, 새로운 잡음 벡터를 샘플링해야 한다.In some cases, latencies from previously enhanced frames can be used to increase the sampling rate when sampling from a video model. For example, using previous latencies can be advantageous for reducing sampling steps for videos with similar or identical frames (e.g., in the extreme case of still video where consecutive enhanced frames are exactly the same, zero sampling steps are needed, Reusing latencies from previous frames can also help with temporal consistency. For example, image models For (x), the same latency Two samples starting from , and can be the same when deterministic transitions are used. Assuming a deterministic sampling scheme and architecture is used and the latency is sampled at time step T, ) accurately determines the samples for a given conditioning. Intuitively, this is the motion vector This means that the noise vector provides information about how the noise should be reused. If there is little or no motion, there is likely no need for resampling, but if there is significant motion, a new noise vector must be sampled.

일부 경우들에서, 본 명세서에 설명된 비디오 초해상도 기법들은 최적화된 하드웨어 가속도를 활용할 수 있다. 예를 들어, 모바일 컴퓨팅 디바이스(예를 들어, 스마트폰, 태블릿 컴퓨터 등) 및/또는 다른 에지 컴퓨팅 디바이스와 같은 컴퓨팅 디바이스에 의해 구현되는 머신 러닝(ML) 가속기들은 모바일 ML 가속기들로 지칭될 수 있다. ML 가속기들은 컴퓨팅 디바이스 상에서 실행되는 ML 모델을 사용하여 추론을 수행하는 것과 연관된 다양한 계산들을 가속화하기 위한 특수화된 마이크로프��세서들로서 구현될 수 있다. ML 가속기들은 다양한 유형들의 ML 동작들 및/또는 다양한 ML 네트워크들 및 아키텍처들에 대한 가속을 수행할 수 있는 범용 하드웨어로서 제공될 수 있다. 예를 들어, 스마트폰에 포함된 모바일 ML 가속기는 이미지 프로세싱, 자연어 프로세싱(NLP), 음성 인식 등을 가속화하는 데 사용될 수 있다. 상이한 ML 동작들, 네트워크들, 및/또는 아키텍처들은 다양한 양의 입력 채널들과 연관될 수 있다. 예를 들어, 이미지 프로세싱 머신 러닝 네트워크들은 3-채널 입력들(예를 들어, RGB 이미지 입력을 위해 하나의 적색 채널, 하나의 청색 채널, 및 하나의 녹색 채널)을 활용할 수 있다. NLP 머신 러닝 네트워크들은 수십 개 이상의 채널들을 갖는 입력들을 활용할 수 있다. 예를 들어, NLP 머신 러닝 네트워크는 상이한 단어 임베딩들, 어휘들, 구절들 등에 대해 별개의 채널들을 활용할 수 있다.In some cases, the video super-resolution techniques described herein can utilize optimized hardware acceleration. For example, machine learning (ML) accelerators implemented by computing devices, such as mobile computing devices (e.g., smartphones, tablet computers, etc.) and/or other edge computing devices, may be referred to as mobile ML accelerators. ML accelerators may be implemented as specialized microprocessors for accelerating various computations associated with performing inference using ML models running on the computing device. ML accelerators may be provided as general-purpose hardware capable of performing acceleration for various types of ML operations and/or various ML networks and architectures. For example, a mobile ML accelerator incorporated in a smartphone may be used to accelerate image processing, natural language processing (NLP), speech recognition, etc. Different ML operations, networks, and/or architectures may be associated with different quantities of input channels. For example, image processing machine learning networks can utilize three-channel inputs (e.g., one red channel, one blue channel, and one green channel for RGB image input). NLP machine learning networks can utilize inputs with dozens or more channels. For example, an NLP machine learning network may utilize separate channels for different word embeddings, vocabularies, phrases, and so on.

도 8은 본 명세서에 설명된 기법들을 사용하여 하나 이상의 프레임들에 프레임 향상을 수행하도록 프레임 데이터를 프로세싱하기 위한 프로세스(800)의 예를 예시하는 플로우차트이다. 프로세스(800)는 컴퓨팅 디바이스(또는 장치) 또는 컴퓨팅 디바이스의 컴포넌트(예를 들어, 칩셋, 코덱 등)에 의해 수행될 수 있다. 컴퓨팅 디바이스는 모바일 디바이스(예를 들어, 모바일 폰), 시계와 같은 네트워크 접속 웨어러블, XR(extended reality) 디바이스(예를 들어, VR(extended reality) 디바이스 또는 AR(virtual reality) 디바이스), 차량 또는 차량의 컴포넌트 또는 시스템, 또는 다른 타입의 컴퓨팅 디바이스일 수 있다. 프로세스(800)의 동작들은 하나 이상의 프로세서들(예를 들어, 도 1의 CPU(102), GPU(104), DSP(106) 및/또는 NPU(108), 도 9의 프로세서(910), 또는 다른 프로세서(들)) 상에서 실행되고 작동되는 소프트웨어 컴포넌트들로서 구현될 수 있다. 추가로, 프로세스(800)에서 컴퓨팅 디바이스에 의한 신호들의 송신 및 수신은, 예를 들어, 컴퓨팅 디바이스의 하나 이상의 안테나들, 하나 이상의 트랜시버들(예컨대, 무선 트랜시버(들)), 및/또는 다른 컴포넌트들에 의해 인에이블될 수 있다.FIG. 8 is a flowchart illustrating an example of a process (800) for processing frame data to perform frame enhancement on one or more frames using the techniques described herein. The process (800) may be performed by a computing device (or apparatus) or a component (e.g., a chipset, a codec, etc.) of a computing device. The computing device may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device (e.g., an extended reality (VR) device or a virtual reality (AR) device), a vehicle or a component or system of a vehicle, or another type of computing device. The operations of the process (800) may be implemented as software components that run and operate on one or more processors (e.g., the CPU (102), GPU (104), DSP (106) and/or NPU (108) of FIG. 1 , the processor (910) of FIG. 9 , or other processor(s)). Additionally, transmission and reception of signals by the computing device in process (800) may be enabled by, for example, one or more antennas, one or more transceivers (e.g., wireless transceiver(s)), and/or other components of the computing device.

블록(802)에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정할 수 있다. 예를 들어, 제1 이전 프레임은 비디오에서 현재 프레임 바로 이전의 프레임일 수 있다(예를 들어, 현재 프레임이 프레임 x_t인 경우, 프레임 x_t-1). 예시적인 예로서 도 5를 참조하면, 흐름 추정 엔진(514)은 현재 입력 저해상도 프레임(504)과 이전 저해상도 프레임(505) 사이의 광학 흐름을 추정하거나 결정할 수 있다. 일부 경우들에서, 광학 흐름은 현재 프레임의 각각의 픽셀에 대한 개별 모션 벡터(예를 들어, 현재 프레임의 제1 픽셀에 대한 제1 모션 벡터, 현재 프레임의 제2 픽셀에 대한 제2 모션 벡터, 현재 프레임의 제3 픽셀에 대한 제3 모션 벡터 등)를 포함한다.At block (802), the computing device (or a component thereof) can determine optical flow between a current frame having a first resolution and a first previous frame having the first resolution. For example, the first previous frame can be the frame immediately preceding the current frame in the video (e.g., frame x _t-1 , if the current frame is frame x _t ). As an illustrative example, referring to FIG. 5, the flow estimation engine (514) can estimate or determine optical flow between a current input low-resolution frame (504) and a previous low-resolution frame (505). In some cases, the optical flow includes individual motion vectors for each pixel of the current frame (e.g., a first motion vector for a first pixel of the current frame, a second motion vector for a second pixel of the current frame, a third motion vector for a third pixel of the current frame, etc.).

블록(804)에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성할 수 있으며, 여기서 제2 해상도는 제1 해상도보다 높다. 예를 들어, 제2 이전 프레임은 제1 이전 프레임의 업샘플링된 버전일 수 있다. 예시적인 예로서 도 5를 참조하면, 워핑 엔진(516)은 이전 업샘플링된 프레임(503)을 워핑하여 워핑된 이전 업샘플링된 프레임(517)을 생성할 수 있다.At block (804), the computing device (or a component thereof) may warp a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having a second resolution, wherein the second resolution is higher than the first resolution. For example, the second previous frame may be an upsampled version of the first previous frame. As an illustrative example, referring to FIG. 5 , the warping engine (516) may warp a previous upsampled frame (503) to generate a warped previous upsampled frame (517).

블록(806)에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는, 확산 머신 러닝 모델(예를 들어, 본 명세서에 설명된 비디오 향상 확산 모델, 예컨대 비디오 향상 확산 모델(508), 비디오 향상 확산 모델(608) 등)을 사용하여, 잡음 프레임, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도(예를 들어, 현재 프레임의 업샘플링된 버전)를 갖는 출력 프레임을 생성할 수 있다. 예시적인 예로서 도 5를 참조하면, 비디오 향상 확산 모델(508)은 워핑된 이전 업샘플링된 프레임(517), 입력 저해상도 프레임(504), 및 입력 잡음(506)을 프로세싱하여 출력 업샘플링된 프레임(510)을 생성할 수 있다. 일부 양태들에서, 잡음 프레임은 가우시안 잡음 분포로부터 샘플링된다. 위에서 언급된 바와 같이, 일부 경우들에서, 광학 흐름은 현재 프레임의 각각의 픽셀에 대한 개별 모션 벡터를 포함한다. 이러한 경우들에서, 제2 이전 프레임을 워핑하기 위해, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 광학 흐름의 각각의 개별 모션 벡터에 의해 표시된 양만큼 제2 이전 프레임의 각각의 픽셀을 조정할 수 있다.At block (806), the computing device (or a component thereof) may process the noisy frame, the current frame, and the warped previous frame using a diffusion machine learning model (e.g., a video enhancement diffusion model described herein, such as video enhancement diffusion model (508), video enhancement diffusion model (608), etc.) to generate an output frame having a second resolution (e.g., an upsampled version of the current frame). As an illustrative example, referring to FIG. 5 , the video enhancement diffusion model (508) may process the warped previous upsampled frame (517), the input low-resolution frame (504), and the input noise (506) to generate an output upsampled frame (510). In some aspects, the noisy frame is sampled from a Gaussian noise distribution. As noted above, in some cases, the optical flow includes individual motion vectors for each pixel of the current frame. In these cases, to warp the second previous frame, the computing device (or a component thereof) may adjust each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

일부 양태들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 워핑된 잡음 프레임을 생성하기 위해 결정된 광학 흐름에 기초하여 이전 잡음 프레임을 워핑할 수 있다. 예시적인 예로서 도 6을 참조하면, 잡음 워핑 엔진(624)은 워핑된 잡음(619)을 생성하기 위해 광학 흐름 추정 엔진(614)으로부터 결정된 광학 흐름에 기초하여 이전 잡음 입력(622)을 워핑할 수 있다. 이러한 양태들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 (예를 들어, 도 6에 도시된 바와 같이) 확산 머신 러닝 모델을 사용하여 워핑된 잡음 프레임을 프로세싱하는 것에 추가로 기초하여 출력 프레임을 생성할 수 있다. 일부 경우들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 출력 프레임을 생성하기 위해 잡음 프레임 대신에 워핑된 잡음 프레임을 사용할 수 있다. 예를 들어, 컴퓨팅 디바이스(또는 그 컴포넌트)는 확산 머신 러닝 모델을 사용하여, 워핑된 잡음 프레임, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성할 수 있다. 일부 경우들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 출력 프레임을 생성하기 위해 잡음 프레임과 워핑된 잡음 프레임의 조합(예를 들어, 선형 조합)을 사용할 수 있다. 예를 들어, 컴퓨팅 디바이스(또는 그 컴포넌트)는, 확산 머신 러닝 모델을 사용하여, 잡음 프레임과 워핑된 잡음 프레임의 조합, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성할 수 있다.In some aspects, the computing device (or a component thereof) may warp a previous noise frame based on the determined optical flow to generate a warped noise frame. As an illustrative example, referring to FIG. 6, the noise warping engine (624) may warp a previous noise input (622) based on the determined optical flow from the optical flow estimation engine (614) to generate warped noise (619). In such aspects, the computing device (or a component thereof) may further generate an output frame based on processing the warped noise frame using a diffusion machine learning model (e.g., as illustrated in FIG. 6). In some cases, the computing device (or a component thereof) may use the warped noise frame instead of the noise frame to generate the output frame. For example, the computing device (or a component thereof) may process the warped noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having a second resolution. In some cases, the computing device (or a component thereof) may use a combination (e.g., a linear combination) of the noise frame and the warped noise frame to generate the output frame. For example, the computing device (or a component thereof) may process the combination of the noise frame and the warped noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate the output frame having the second resolution.

일부 경우들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 비디오의 복수의 연속적인 프레임들을 순차적으로(또는 순환적으로) 프로세싱할 수 있다. 이전에 설명된 바와 같이, (예를 들어, 다수의 프레임들을 병렬로 업샘플링하는 대신에) 연속적인 프레임들의 순환 업샘플링은 낮은 지연을 제공할 수 있고, 따라서 낮은 레이턴시 애플리케이션을 초래할 수 있다. 일부 양태들에서, 컴퓨팅 디바이스(또는 그의 컴포넌트)는 확산 머신 러닝 모델의 이전 확산 레이턴트를 재사용하고, 확산 머신 러닝 모델의 시간단계들 사이에 적어도 하나의 샘플링 단계를 재사용하고, 및/또는 확산 머신 러닝 모델의 시간단계들 사이에 하나 이상의 샘플링 단계들을 스킵할 수 있다. 전술한 바와 같이, 이러한 정보를 재사용하는 것은 순환적/순차적 프로세싱을 수행하는 데 있어서의 지연을 보상할 수 있다.In some cases, a computing device (or a component thereof) may sequentially (or cyclically) process multiple consecutive frames of a video. As previously described, cyclically upsampling consecutive frames (e.g., instead of upsampling multiple frames in parallel) may provide lower latency and thus result in low-latency applications. In some aspects, the computing device (or a component thereof) may reuse previous diffusion latencies of the diffusion machine learning model, reuse at least one sampling step between time steps of the diffusion machine learning model, and/or skip one or more sampling steps between time steps of the diffusion machine learning model. As described above, reusing such information may compensate for the delay in performing the cyclic/sequential processing.

일부 경우들에서, 컴퓨팅 디바이스 또는 장치는 하나 이상의 입력 디바이스들, 하나 이상의 출력 디바이스들, 하나 이상의 프로세서들, 하나 이상의 마이크로프로세서들, 하나 이상의 마이크로컴퓨터들, 하나 이상의 카메라들, 하나 이상의 센서들, 및/또는 여기에 설명된 프로세스들의 단계들을 수행하도록 구성되는 다른 컴포넌트(들)와 같은 다양한 컴포넌트들을 포함할 수 있다. 일부 예들에서, 컴퓨팅 디바이스는 디스플레이, 데이터를 통신 및/또는 수신하도록 구성된 네트워크 인터페이스, 이들의 임의의 조합, 및/또는 다른 컴포넌트(들)를 포함할 수 있다. 네트워크 인터페이스는 인터넷 프로토콜(IP) 기반 데이터 또는 다른 타입의 데이터를 통신 및/또는 수신하도록 구성될 수 있다.In some instances, a computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) configured to perform steps of the processes described herein. In some examples, a computing device may include a display, a network interface configured to communicate and/or receive data, any combination thereof, and/or other component(s). The network interface may be configured to communicate and/or receive Internet Protocol (IP)-based data or other types of data.

컴퓨팅 디바이스의 컴포넌트들은 회로부로 구현될 수 있다. 예를 들어, 컴포넌트들은 하나 이상의 프로그래밍가능 전자 회로들(예를 들어, 마이크로프로세서들, 그래픽 프로세싱 유닛(graphics processing unit, GPU)들, 디지털 신호 프로세서(digital signal processor, DSP)들, 중앙 프로세싱 유닛(central processing unit, CPU)들 및/또는 다른 적합한 전자 회로들)을 포함할 수 있는 전자 회로들 또는 다른 전자 하드웨어를 포함할 수 있고 그리고/또는 이들을 사용하여 구현될 수 있으며, 그리고/또는 본 명세서에서 설명되는 다양한 동작들을 수행하기 위해 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 임의의 조합을 포함할 수 있고 그리고/또는 이들을 사용하여 구현될 수 있다.Components of a computing device may be implemented as circuitry. For example, the components may include and/or be implemented using electronic circuits or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof to perform the various operations described herein.

프로세스(800)는 논리 흐름도로서 예시되며, 그 동작은 하드웨어, 컴퓨터 명령들, 또는 이들의 조합에서 구현될 수 있는 동작들의 시퀀스를 표현한다. 컴퓨터 명령들과 관련하여, 동작들은 하나 이상의 컴퓨터 판독가능 저장 매체들 상에 저장된 컴퓨터 실행가능 명령들을 표현하며, 이러한 명령들은 하나 이상의 프로세서들에 의해 실행될 때, 열거된 동작들을 수행한다. 일반적으로, 컴퓨터 실행가능 명령들은 특정 기능들을 수행하거나 특정 데이터 유형들을 구현하는 루틴들, 프로그램들, 오브젝트들, 컴포넌트들, 데이터 구조들 등을 포함한다. 동작들이 설명되는 순서는 제한으로서 해석되는 것으로 의도되지 않으며, 임의의 수의 설명되는 동작들이 임의의 순서로 그리고/또는 병렬로 조합되어 프로세스들을 구현할 수 있다.Process (800) is illustrated as a logic flow diagram, the operations of which represent a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof. With respect to computer instructions, the operations represent computer-executable instructions stored on one or more computer-readable storage media, which, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be limiting, and any number of the described operations may be combined in any order and/or in parallel to implement the processes.

추가적으로, 프로세스(800) 및/또는 본 명세서에서 설명된 임의의 다른 프로세스는 실행가능 명령들로 구성된 하나 이상의 컴퓨터 시스템들의 제어 하에 수행될 수 있으며, 하나 이상의 프로세서들 상에서 집합적으로 실행하는 코드(예컨대, 실행가능 명령들, 하나 이상의 컴퓨터 프로그램들, 또는 하나 이상의 애플리케이션들)로서, 하드웨어에 의해, 또는 이들의 조합으로 구현될 수 있다. 위에서 언급된 바와 같이, 코드는 예를 들어, 하나 이상의 프로세서들에 의해 실행가능한 복수의 명령들을 포함하는 컴퓨터 프로그램의 형태로, 컴퓨터 판독가능 또는 머신 판독가능 저장 매체 상에 저장될 수 있다. 컴퓨터 판독가능 또는 머신 판독가능 저장 매체는 비일시적일 수 있다.Additionally, process (800) and/or any other process described herein may be performed under the control of one or more computer systems comprised of executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) that collectively execute on one or more processors, by hardware, or a combination thereof. As noted above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-transitory.

도 9는 본 명세서에 설명된 다양한 기법들을 구현할 수 있는 예시적인 컴퓨팅 디바이스의 예시적인 컴퓨팅 디바이스 아키텍처(900)를 예시한다. 일부 예들에서, 컴퓨팅 디바이스는 모바일 디바이스, 웨어러블 디바이스, 확장 현실 디바이스(예컨대, VR(virtual reality) 디바이스, AR(augmented reality) 디바이스 또는 MR(mixed reality) 디바이스), 개인용 컴퓨터, 랩톱 컴퓨터, 비디오 서버, 차량(또는 ��량의 ��퓨팅 디바이스) 또는 다른 디바이스를 포함할 수 있다. 컴퓨팅 디바이스 아키텍처(900)의 컴포넌트들은 버스와 같은 연결부(905)를 사용하여 서로 전기적으로 통신하는 것으로 도시된다. 예시적인 컴퓨팅 디바이스 아키텍처(900)는 프로세싱 유닛(CPU 또는 프로세서)(910), 및 ROM(read only memory)(920) 및 RAM(random-access memory)(925)과 같은 컴퓨팅 디바이스 메모리(915)를 포함하는 다양한 컴퓨팅 디바이스 컴포넌트들을 프로세서(910)에 커플링하는 컴퓨팅 디바이스 연결부(905)를 포함한다.FIG. 9 illustrates an exemplary computing device architecture (900) of an exemplary computing device that may implement various techniques described herein. In some examples, the computing device may include a mobile device, a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a video server, a vehicle (or a computing device in a vehicle), or other device. Components of the computing device architecture (900) are shown as electrically communicating with one another using a connection (905), such as a bus. The exemplary computing device architecture (900) includes a computing device connection (905) that couples various computing device components to the processor (910), including a processing unit (CPU or processor) (910), and computing device memory (915), such as read-only memory (ROM) (920) and random-access memory (RAM) (925).

컴퓨팅 디바이스 아키텍처(900)는, 프로세서(910)와 직접 연결된, 이에 매우 근접한, 또는 이의 일부로서 통합된 고속 메모리의 캐시를 포함할 수 있다. 컴퓨팅 디바이스 아키텍처(900)는 프로세서(910)에 의한 신속한 액세스를 위해 메모리(915) 및/또는 저장 디바이스(930)로부터 캐시(912)로 데이터를 카피(copy)할 수 있다. 이러한 방식으로, 캐시는 데이터를 기다리는 동안 프로세서(910) 지연들을 회피하는 성능 부스트를 제공할 수 있다. 이들 및 다른 엔진들은 다양한 액션들을 수행하도록 프로세서(910)를 제어하거나 또는 제어하도록 구성될 수 있다. 다른 컴퓨팅 디바이스 메모리(915)가 또한 사용을 위해 이용가능할 수 있다. 메모리(915)는 상이한 성능 특성들을 갖는 다수의 상이한 타입들의 메모리를 포함할 수 있다. 프로세서(910)는 프로세서(910)를 제어하도록 구성된 저장 디바이스(930)에 저장된 서비스 1(932), 서비스 2(934) 및 서비스 3(936)과 같은 임의의 범용 프로세서 및 하드웨어 또는 소프트웨어 서비스뿐만 아니라 소프트웨어 명령들이 프로세서 설계에 통합되는 특수 목적 프로세서를 포함할 수 있다. 프로세서(910)는 다수의 코어들 또는 프로세서들, 버스, 메모리 제어기, 캐시 등을 포함하는 자족형(self-contained) 시스템일 수 있다. 멀티 코어 프로세서는 대칭 또는 비대칭일 수 있다.The computing device architecture (900) may include a cache of high-speed memory directly connected to, in close proximity to, or integrated as part of the processor (910). The computing device architecture (900) may copy data from the memory (915) and/or storage device (930) to the cache (912) for rapid access by the processor (910). In this manner, the cache may provide a performance boost by avoiding processor (910) delays while waiting for data. These and other engines may control, or be configured to control, the processor (910) to perform various actions. Other computing device memories (915) may also be available for use. The memory (915) may include a number of different types of memory with different performance characteristics. The processor (910) may include any general purpose processor and hardware or software services, such as Service 1 (932), Service 2 (934), and Service 3 (936), stored in a storage device (930) configured to control the processor (910), as well as special purpose processors in which software instructions are incorporated into the processor design. The processor (910) may be a self-contained system including multiple cores or processors, a bus, a memory controller, a cache, etc. A multi-core processor may be symmetric or asymmetric.

컴퓨팅 디바이스 아키텍처(900)와의 사용자 상호작용을 가능하게 하기 위해, 입력 디바이스(945)는 스피치를 위한 마이크로폰, 제스처 또는 그래픽 입력을 위한 터치 감지 스크린, 키보드, 마우스, 모션 입력, 스피치 등과 같은 임의의 수의 입력 메커니즘들을 나타낼 수 있다. 출력 디바이스(935)는 또한, 디스플레이, 프로젝터, 텔레비전, 스피커 디바이스 등과 같은, 당업자에게 알려진 다수의 출력 메커니즘들 중 하나 이상일 수 있다. 일부 경우들에서, 멀티모달(multimodal) 컴퓨팅 디바이스들은, 사용자가 컴퓨팅 디바이스 아키텍처(900)와 통신하기 위해 다수의 타입들의 입력을 제공하는 것을 가능하게 할 수 있다. 통신 인터페이스(940)는 일반적으로, 사용자 입력 및 컴퓨팅 디바이스 출력을 통제 및 관리할 수 있다. 임의의 특정 하드웨어 배열에 대해 동작하는 것에 제한이 없으며, 따라서, 여기에서의 기본 특징들은 이들이 개발됨에 따라 개선된 하드웨어 또는 펌웨어 배열들로 쉽게 대체될 수 있다.To enable user interaction with the computing device architecture (900), the input device (945) may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, a keyboard, a mouse, motion input, speech, and the like. The output device (935) may also be one or more of a number of output mechanisms known to those skilled in the art, such as a display, a projector, a television, a speaker device, and the like. In some cases, multimodal computing devices may enable a user to provide multiple types of input to communicate with the computing device architecture (900). The communication interface (940) may generally control and manage user input and computing device output. It is not limited to operating with any particular hardware arrangement, and thus, the basic features herein may be readily replaced with improved hardware or firmware arrangements as they are developed.

저장 디바이스(930)는 비휘발성 메모리이고 하드 디스크 또는, 자기 카세트들, 플래시 메모리 카드들, 솔리드 스테이트 메모리 디바이스들, 디지털 다기능 디스크들, 카트리지들, 랜덤 액세스 메모리(RAM)들(925), 판독 전용 메모리(ROM)(920) 및 이들의 하이브리드들과 같은, 컴퓨터에 의해 액세스가능한 데이터를 저장할 수 있는 다른 타입들의 컴퓨터 판독가능 매체들일 수 있다. 저장 디바이스(930)는 프로세서(910)를 제어하기 위한 서비스들(932, 934, 936)을 포함할 수 있다. 다른 하드웨어 또는 소프트웨어 모듈들 또는 엔진들이 고려된다. 저장 디바이스(930)는 컴퓨팅 디바이스 연결부(905)에 연결될 수 있다. 일 양태에서, 특정 기능을 수행하는 하드웨어 모듈은 그 기능을 수행하기 위해, 프로세서(910), 연결부(905), 출력 디바이스(935) 등과 같은, 필요한 하드웨어 컴포넌트들과 연관되어 컴퓨터 판독가능 매체에 저장된 소프트웨어 컴포넌트를 포함할 수 있다.The storage device (930) may be a non-volatile memory and may be a hard disk or other types of computer-readable media capable of storing data accessible by a computer, such as magnetic cassettes, flash memory cards, solid-state memory devices, digital versatile disks, cartridges, random access memories (RAMs) (925), read-only memories (ROMs) (920), and hybrids thereof. The storage device (930) may include services (932, 934, 936) for controlling the processor (910). Other hardware or software modules or engines are contemplated. The storage device (930) may be coupled to the computing device connection (905). In one aspect, a hardware module that performs a particular function may include a software component stored on a computer-readable medium in association with necessary hardware components, such as the processor (910), the connection (905), the output device (935), etc., to perform the function.

본 개시내용의 양태들은 하나 이상의 능동 깊이 감지 시스템들을 포함하거나 그들에 커플링된 (보안 시스템들, 스마트폰들, 태블릿들, 랩톱 컴퓨터들, 차량들, 드론들, 또는 다른 디바이스들과 같은) 임의의 적합한 전자 디바이스에 적용가능하다. 하나의 광 프로젝터를 갖거나 그에 커플링된 디바이스에 관하여 하기에서 설명되지만, 본 개시내용의 양상들은 임의의 수의 광 프로젝터들을 갖는 디바이스들에 적용가능하고, 따라서 특정 디바이스들로 제한되지 않는다.Aspects of the present disclosure are applicable to any suitable electronic device (such as security systems, smartphones, tablets, laptop computers, vehicles, drones, or other devices) that includes or is coupled to one or more active depth sensing systems. While described below with respect to a device having or coupled to a single optical projector, aspects of the present disclosure are applicable to devices having any number of optical projectors and are therefore not limited to specific devices.

용어 "디바이스"는 하나 또는 특정 수의 물리적 물체(예컨대, 하나의 스마트폰, 하나의 제어기, 하나의 프로세싱 시스템 등)로 제한되지 않는다. 본 명세서에서 사용된 바와 같이, 디바이스는 본 개시내용의 적어도 일부 부분들을 구현할 수 있는 하나 이상의 부분들을 갖는 임의의 전자 디바이스일 수 있다. 하기의 설명 및 예들이 본 개시내용의 다양한 양태들을 설명하기 위해 용어 "디바이스"를 사용하지만, 용어 "디바이스"는 특정 구성, 타입, 또는 수의 물체들로 ��한되지 않는다. 추가적으로, 용어 "시스템"은 다수의 컴포넌트들 또는 특정 양태들로 제한되지 않는다. 예를 들어, 시스템은 하나 이상의 인쇄 회로 보드들 또는 다른 기판들 상에서 구현될 수 있고, 이동가능 또는 정적 컴포넌트들을 가질 수 있다. 하기의 설명 및 예들이 본 개시내용의 다양한 양태들을 설명하기 위해 용어 "시스템"을 사용하��만, 용어 "시스템"�� 특정 ��성, 타입, 또는 수의 물체들로 제한되지 않는다.The term "device" is not limited to one or a specific number of physical objects (e.g., one smartphone, one controller, one processing system, etc.). As used herein, a device can be any electronic device having one or more parts capable of implementing at least some portions of the present disclosure. Although the following description and examples use the term "device" to describe various aspects of the present disclosure, the term "device" is not limited to a specific configuration, type, or number of objects. Additionally, the term "system" is not limited to a number of components or specific aspects. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have movable or static components. Although the following description and examples use the term "system" to describe various aspects of the present disclosure, the term "system" is not limited to a specific configuration, type, or number of objects.

본 명세서에서 제공되는 양태들 및 예들의 철저한 이해를 제공하기 위해, 특정 세부사항들이 위의 설명에서 제공된다. 그러나, 양태들이 이러한 특정 상세들 없이도 실시될 수 있음이 당업자들에 의해 이해될 것이다. 설명의 명료화를 위해, 일부 사례들에 있어서, 본 기술은 디바이스들, 디바이스 컴포넌트들, 소프트웨어에서 구현된 방법에서의 단계들 또는 루틴들, 또는 하드웨어와 소프트웨어의 조합들을 포함하는 기능 블록들을 포함하는 개별 기능 블록들을 포함하는 것으로서 제시될 수 있다. 도면들에 도시된 그리고/또는 본 명세서에서 설명되는 것들 이외의 추가적인 컴포넌트들이 사용될 수 있다. 예를 들어, 불필요한 세부사항으로 양태들을 모호하게 하지 않도록 회로들, 시스템들, 네트워크들, 프로세스들 및 다른 컴포넌트들은 블록도 형태의 컴포넌트들로서 도시될 수 있다. 다른 경우들에는, 양태들을 모호하게 하는 것을 회피하기 위해, 잘 알려진 회로들, 프로세스들, 알고리즘들, 구조들 및 기법들이 불필요한 세부사항 없이 도시될 수 있다.To provide a thorough understanding of the aspects and examples provided herein, specific details are provided in the above description. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details. For clarity of explanation, in some instances, the present technology may be presented as including individual functional blocks, including devices, device components, steps or routines in methods implemented in software, or functional blocks comprising combinations of hardware and software. Additional components other than those depicted in the drawings and/or described herein may be used. For example, circuits, systems, networks, processes, and other components may be depicted as components in block diagram form so as not to obscure the aspects with unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be depicted without unnecessary detail so as to avoid obscuring the aspects.

개별 양태들은 위에서, 플로우차트, 흐름도, 데이터 흐름도, 구조도, 또는 블록도로서 도시된 프로세스 또는 방법으로서 설명될 수 있다. 플로우차트가 순차적인 프로세스로서 동작들을 설명할 수 있지만, 동작들의 대부분은 병렬로 또는 동시에 수행될 수 있다. 또한, 동작들의 순서는 재배열될 수 있다. 프로세스는 프로세서의 동작들이 완료될 때 종료되지만 도면에 포함되지 않은 추가적인 단계들을 가질 수 있다. 프로세스는 방법, 함수, 절차, 서브루틴, 서브프로그램 등에 대응할 수 있다. 프로세스가 함수에 대응할 때, 그 종료는 호출 함수 또는 메인 함수로의 함수의 리턴에 대응할 수 있다.Individual aspects may be described as processes or methods, as depicted above, in the form of a flowchart, flow diagram, data flow diagram, architecture diagram, or block diagram. While a flowchart may depict operations as a sequential process, most operations may be performed in parallel or concurrently. Furthermore, the order of operations may be rearranged. A process terminates when the processor's operations are completed, but may have additional steps not included in the diagram. A process may correspond to a method, function, procedure, subroutine, subprogram, etc. When a process corresponds to a function, its termination may correspond to the function's return to the calling function or the main function.

전술된 예들에 따른 프로세스들 및 방법들은 컴퓨터 판독가능 매체들로부터 저장되거나 아니면 컴퓨터 판독가능 매체들로부터 이용가능한 컴퓨터 실행가능 명령들을 사용하여 구현될 수 있다. 그러한 명령들은 예를 들어, 범용 컴퓨터, 특수 목적 컴퓨터 또는 프로세싱 디바이스로 하여금, 특정 기능 또는 기능들의 그룹을 수행하게 하거나 또는 달리 이를 수행하도록 구성하는 명령들 및 데이터를 포함할 수 있다. 사용되는 컴퓨터 자원들의 부분들은 네트워크를 통해 액세스가능할 수 있다. 컴퓨터 실행가능 명령들은, 예를 들어, 어셈블리 언어, 펌웨어, 소스 코드 등과 같은 바이너리들, 중간 포맷 명령들일 수 있다.The processes and methods according to the examples described above may be implemented using computer-executable instructions stored on or available from computer-readable media. Such instructions may include, for example, instructions and data that cause a general-purpose computer, special-purpose computer, or processing device to perform, or otherwise configure, a particular function or group of functions. Portions of the computer resources used may be accessible via a network. The computer-executable instructions may be, for example, binaries, intermediate format instructions, such as assembly language, firmware, source code, etc.

용어 "컴퓨터 판독가능 매체"는, 휴대용 또는 비휴대용 저장 디바이스들, 광학 저장 디바이스들, 및 명령(들) 및/또는 데이터를 저장, 포함, 또는 반송할 수 있는 다양한 다른 매체들을 포함하지만, 이들로 제한되지는 않는다. 컴퓨터 판독가능 매체는 데이터가 저장될 수 있고 무선으로 또는 유선 접속들을 통해 전파되는 캐리어들 및/또는 일시적 전자 신호들을 ��함하지 않는 비일시적 매체를 포함할 수 있다. 비일시적 매체의 예들은, 특히 자기 디스크 또는 테이프, 플래시 메모리와 같은 광학 저장 매체, 메모리 또는 메모리 디바이스들, 자기 또는 광학 디스크들, 플래시 메모리, 비휘발성 메모리가 제공된 USB 디바이스들, 네트워크화된 스토리지 디바이스들, 콤팩트 디스크(CD) 또는 DVD(digital versatile disk), 또는 이들의 임의의 적절한 조합을 포함할 수 있지만, 이들로 제한되지 않는다. 컴퓨터 판독가능 매체 상에는 프로시저, 함수, 서브프로그램, 프로그램, 루틴, 서브루틴, 모듈, 엔진, 소프트웨어 패키지, 클래스, 또는 명령들, 데이터 구조들 또는 프로그램 명령문들의 임의의 조합을 표현할 수 있는 코드 및/또는 머신 실행가능 명령들이 저장될 수 있다. 코드 세그먼트는 정보, 데이터, 아규먼트(argument)들, 파라미터들, 또는 메모리 콘텐츠를 전달 및/또는 수신함으로써 다른 코드 세그먼트 또는 하드웨어 회로에 커플링될 수 있다. 정보, 인수들, 파라미터들, 데이터 등은, 메모리 공유, 메시지 전달, 토큰 전달, 네트워크 송신 등을 포함하는 임의의 적합한 수단을 통해 전달, 포워딩, 또는 송신될 수 있다.The term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instructions(s) and/or data. Computer-readable media can include non-transitory media on which data can be stored and which do not include carriers and/or transitory electronic signals that are propagated wirelessly or via wired connections. Examples of non-transitory media can include, but are not limited to, magnetic disks or tapes, optical storage media such as flash memory, memory or memory devices, magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, compact discs (CDs) or digital versatile disks (DVDs), or any suitable combination thereof, among others. A computer-readable medium may store code and/or machine-executable instructions that may represent a procedure, function, subprogram, program, routine, subroutine, module, engine, software package, class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means, including memory sharing, message passing, token passing, network transmission, etc.

일부 양태들에서, 컴퓨터 판독가능 저장 디바이스들, 매체들, 및 메모리들은 비트 스트림 등을 포함하는 무선 신호 또는 케이블을 포함할 수 있다. 그러나 언급될 때, 비일시적 컴퓨터 판독가능 저장 매체들은 에너지, 캐리어 신호들, 전자파들, 및 신호들 그 자체와 같은 매체들을 명시적으로 배제한다.In some aspects, computer-readable storage devices, media, and memories may include wireless signals or cables, including bit streams, etc. However, when referred to, non-transitory computer-readable storage media explicitly excludes media such as energy, carrier signals, electromagnetic waves, and the signals themselves.

이들 개시내용들에 따른 프로세스들 및 방법들을 구현하는 디바이스들은 하드웨어, 소프트웨어, 펌웨어, 미들웨어, 마이크로코드, 하드웨어 기술 언어들, 또는 이들의 임의의 조합을 포함할 수 있고, 다양한 폼 팩터들 중 임의의 폼 팩터를 취할 수 있다. 소프트웨어, 펌웨어, 미들웨어 또는 마이크로코드로 구현될 때, 필요한 작업들을 수행하기 위한 프로그램 코드 또는 ��드 세그먼트들(예를 들어, 컴퓨터 프로그램 제품)은 컴퓨터 판독가능 또는 머신 판독가능 매체에 저장될 수 있다. 프로세서(들)는 필요한 작업들을 수행할 수 있다. 폼 팩터들의 통상적인 예들은 랩톱들, 스마트 폰들, 모바일 폰들, 태블릿 디바이스들 또는 다른 소형 폼 팩터 개인용 컴퓨터들, 개인 휴대 정보 단말들, 랙마운트 디바이스들, 독립형 디바이스들 등을 포함한다. 본 명세서에서 설명되는 기능은 또한 주변 기기들 또는 애드인(add-in) 카드들로 구현될 수 있다. 그러한 기능은 또한 추가 예로서, 단일 디바이스에서 실행되는 상이한 프로세스들 또는 상이한 칩들 사이의 회로 기판 상에서 구현될 수 있다.Devices implementing the processes and methods according to these disclosures may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented as software, firmware, middleware, or microcode, program code or code segments (e.g., a computer program product) for performing the necessary tasks may be stored on a computer-readable or machine-readable medium. The processor(s) may perform the necessary tasks. Typical examples of form factors include laptops, smart phones, mobile phones, tablet devices, or other small form factor personal computers, personal digital assistants, rack-mounted devices, standalone devices, etc. The functionality described herein may also be implemented as peripherals or add-in cards. Such functionality may also, as a further example, be implemented on a circuit board between different processors or different chips executing on a single device.

명령들, 그러한 명령들을 전달하기 위한 매체들, 명령들을 실행하기 위한 컴퓨팅 자원들, 및 그러한 컴퓨팅 자원들을 지원하기 위한 다른 구조들은 본 개시내용에서 설명되는 기능들을 제공하기 위한 예시적인 수단들이다.Commands, media for conveying such commands, computing resources for executing the commands, and other structures for supporting such computing resources are exemplary means for providing the functions described in this disclosure.

전술한 설명에서, 본 출원의 양태들은 그의 특정 양태들을 참조하여 설명되지만, 당업자들은 본 출원이 이에 제한되지 않는다는 것을 인식할 것이다. 따라서, 본 출원의 예시적인 양태들이 본 명세서에서 상세히 설명되었지만, 본 발명의 개념들은 달리 다양하게 구현 및 채용될 수 있고, 첨부된 청구항들은 종래 기술에 의해 제한된 것을 제외하면, 그러한 변형들을 포함하는 것으로 해석되어야 한다고 이해되어야 한다. 위에서 설명된 애플리케이션의 다양한 특징들 및 양상들은 개별적으로 또는 공동으로 사용될 수 있다. 추가로, 양태들은, 본 명세서의 더 넓은 사상 및 범주를 벗어나지 않으면서, 본 명세서에서 설명되는 것들 이외의 임의의 수의 환경들 및 애플리케이션들에서 활용될 수 있다. 그에 따라, 본 명세서 및 도면들은 제한적인 것이 아니라 예시적인 것으로 간주되어야 한다. 예시의 목적들로, 방법들은 특정 순서로 설명되었다. 대안적인 양태들에서, 방법들은 설명된 것과는 상이한 순서로 수행될 수 있음이 인식되어야 한다.In the foregoing description, aspects of the present application have been described with reference to specific embodiments thereof; however, those skilled in the art will recognize that the present application is not limited thereto. Accordingly, while exemplary embodiments of the present application have been described in detail herein, it should be understood that the concepts of the present invention may be variously implemented and employed, and the appended claims are intended to encompass such variations, except as limited by prior art. The various features and aspects of the application described above may be utilized individually or jointly. Additionally, the aspects may be utilized in any number of environments and applications other than those described herein, without departing from the broader spirit and scope of the present disclosure. Accordingly, the present specification and drawings are to be considered illustrative, rather than restrictive. For purposes of illustration, the methods have been described in a particular order. It should be recognized that, in alternative embodiments, the methods may be performed in an order different from that described.

당업자는, 미만("<") 및 초과(">") 기호들 또는 본 명세서에 사용된 용어가 본 상세설명의 범위로부터 벗어나지 않고서, 이하("") 및 이상("") 기호들로, 각각, 대체될 수 있음을 이해할 것이다.Those skilled in the art will understand that the less than (“<”) and greater than (“>”) symbols or terms used herein do not depart from the scope of this detailed description, and are not limited to (“ ") and above(" ") can be replaced with symbols, respectively.

컴포넌트들이 특정 동작을 수행하도록 "구성된" 것으로 기술되는 경우, 그러한 구성은 예를 들어, 전자 회로 또는 다른 하드웨어를 설계하여 그 동작을 수행하는 것에 의해, 프로그래밍 가능한 전자 회로(예를 들어, 마이크로프로세서 또는 다른 적절한 전자 회로)를 프로그래밍하여 그 동작을 수행하는 것에 의해 또는 이들의 임의의 조합에 의해, 달성될 수 있다.When components are described as being "configured" to perform a particular action, such configuration may be accomplished, for example, by designing electronic circuitry or other hardware to perform the action, by programming programmable electronic circuitry (e.g., a microprocessor or other suitable electronic circuitry) to perform the action, or by any combination thereof.

어구 "~ 에 커플링된(coupled to)"은 다른 컴포넌트에 직접적으로 또는 간접적으로 물리적으로 접속된 임의의 컴포넌트, 및/또는, 다른 컴포넌트와 직접적으로 또는 간접적으로 통신하는 (예를 들어, 유선 또는 무선 접속, 및/또는 다른 적합한 통신 인터페이스를 통해 다른 컴포넌트에 접속된) 임의의 컴포넌트를 지칭한다.The phrase "coupled to" refers to any component that is physically connected, directly or indirectly, to another component, and/or any component that is in direct or indirect communication with another component (e.g., connected to another component via a wired or wireless connection, and/or other suitable communication interface).

세트 "중의 적어도 하나" 및/또는 세트 "중의 하나 이상"을 인용하는 청구항 언어 또는 기타 언어는 그 세트 중의 하나의 멤버 또는 그 세트의 다수의 멤버들이 (임의의 조합으로) 청구항을 만족하는 것을 나타낸다. 예를 들어, "A 및 B 중 적어도 하나" 또는 "A 또는 B 중 적어도 하나"를 인용하는 청구항 언어는 A, B, 또는 A 및 B를 의미한다. 다른 예에서, "A, B, 및 C 중 적어도 하나" 또는 "A, B, 또는 C 중 적어도 하나"를 인용하는 청구항 언어는 A, B, C, 또는 A 및 B, 또는 A 및 C, 또는 B 및 C, 또는 A 및 B 및 C를 의미한다. 언어, 세트 "중 적어도 하나" 및/또는 세트 중 "하나 이상"은 세트를 그 세트에 열거된 항목들로 제한하지 않는다. 예를 들어, "A 및 B 중 적어도 하나" 또는 "A 또는 B 중 적어도 하나"를 인용하는 청구항 언어는 A, B, 또는 A 및 B를 의미할 수 있으며, A 및 B의 세트에 열거되지 않은 항목들을 추가적으로 포함할 수 있다. 어구들 "적어도 하나" 및 "하나 이상"은 본 명세서에서 상호교환가능하게 사용된다.Claim language or other language reciting "at least one of" and/or "one or more of" a set indicates that one member of that set or multiple members of that set (in any combination) satisfy the claim. For example, claim language reciting "at least one of A and B" or "at least one of A or B" means A, B, or A and B. In other examples, claim language reciting "at least one of A, B, and C" or "at least one of A, B, or C" means A, B, C, or A and B, or A and C, or B and C, or A and B and C. The language "at least one of" a set and/or "one or more of" a set does not limit the set to the items listed in that set. For example, claim language reciting "at least one of A and B" or "at least one of A or B" can mean A, B, or A and B, and can additionally include items not listed in the set of A and B. The phrases "at least one" and "one or more" are used interchangeably herein.

"하도록 구성된 적어도 하나의 프로세서", "하도록 구성되고 있는 적어도 하나의 프로세서", "하도록 구성된 하나 이상의 프로세서", "하도록 구성되고 있는 하나 이상의 프로세서" 등을 인용하는 청구항 언어 또는 다른 언어는 하나의 프로세서 또는 다수의 프로세서들이 (임의의 조합으로) 연관된 동작(들)을 수행할 수 있음을 나타낸다. 예를 들어, "적어도 하나의 프로세서는 X, Y 및 Z하도록 구성되는"을 언급하는 청구항 언어는 단일 프로세서가 동작들 X, Y 및 Z를 수행하는 데 사용될 수 있다는 것; 또는 다수의 프로세서들이 함께 X, Y, 및 Z를 수행하도록 그 다수의 프로세서들이 각각 동작들 X, Y, 및 Z의 특정 서브세트를 태스크로 받는다는 것; 또는 다수의 프로세서들의 그룹이 함께 작업하여 동작들 X, Y, Z를 수행한다는 것을 의미한다. 다른 예에서, "적어도 하나의 프로세서는 X, Y, 및 Z하도록 구성되는"을 언급하는 청구항 언어는 임의의 단일 프로세서가 동작들 X, Y, 및 Z의 적어도 서브세트만을 수행할 수 있다는 것을 의미할 수 있다.Claim language or other language that recite "at least one processor configured to," "at least one processor being configured to," "one or more processors configured to," "one or more processors being configured to," etc., indicates that one processor or multiple processors (in any combination) can perform the associated action(s). For example, claim language that states "at least one processor configured to do X, Y, and Z" can mean that a single processor can be used to perform actions X, Y, and Z; or that multiple processors work together to perform X, Y, and Z, each of which is tasked with a particular subset of actions X, Y, and Z; or that a group of multiple processors work together to perform actions X, Y, and Z. In another example, claim language that states "at least one processor configured to do X, Y, and Z" can mean that any single processor can perform only at least a subset of actions X, Y, and Z.

기능들(예를 들어, 방법의 단계들)을 수행하는 하나 이상의 엘리먼트들을 참조하는 경��, ��나의 엘리먼트가 �� 기능들을 수행할 수 있거나, 또는 하나 초과의 엘리먼트가 기능들을 집합적으로 수행할 수 있다. 하나 초과의 엘리먼트가 기능들을 집합적으로 수행할 때, 각각의 기능은 이들 엘리먼트들 각각에 의해 수행될 필요가 없고(예를 들어, 상이한 기능들이 상이한 엘리먼트들에 의해 수행될 수 있음) 및/또는 각각의 기능은 단지 하나의 엘리먼트에 의해 전체적으로 수행될 필요가 없다(예를 들어, 상이한 엘리먼트들이 기능의 상이한 서브-기능들을 수행할 수 있음). 유사하게, 다른 엘리먼트(예를 들어, 장치)로 하여금 기능들을 수행하게 하도록 구성된 하나 이상의 엘리먼트들을 참조하는 경우, 하나의 엘리먼트는 다른 엘리먼트로 하여금 모든 기능들을 수행하게 하도록 구성될 수 있거나, 또는 하나 초과의 엘리먼트는 집합적으로 다른 엘리먼트로 하여금 기능들을 수행하게 하도록 구성될 수 있다.When referring to one or more elements that perform functions (e.g., steps of a method), a single element may perform all of the functions, or more than one element may collectively perform the functions. When more than one element collectively performs functions, each function need not be performed by each of those elements separately (e.g., different functions may be performed by different elements) and/or each function need not be performed entirely by just one element (e.g., different elements may perform different sub-functions of the function). Similarly, when referring to one or more elements that are configured to cause another element (e.g., a device) to perform functions, a single element may be configured to cause the other element to perform all of the functions, or more than one element may be configured to collectively cause the other element to perform the functions.

기능들을 수행하거나 기능들(예를 들어, 방법의 단계들)을 수행하도록 구성되는 엔티티(예를 들어, 본 명세서에 설명된 임의의 엔티티 또는 디바이스)를 참조하는 경우, 엔티티는 하나 이상의 엘리먼트들로 하여금 (개별적으로 또는 집합적으로) 기능들을 수행하게 하도록 구성될 수 있다. 엔티티의 하나 이상의 컴포넌트들은 적어도 하나의 메모리, 적어도 하나의 프로세서, 적어도 하나의 통신 인터페이스, 기능들 중 하나 이상(또는 전부)을 수행하도록 구성된 다른 컴포넌트, 및/또는 이들의 임의의 조합을 포함할 수 있다. 기능들을 수행하는 엔티티에 대한 참조에서, 엔티티는 하나의 컴포넌트가 모든 기능들을 수행하게 하거나, 또는 하나 초과의 컴포넌트가 기능들을 집합적으로 수행하게 하도록 구성될 수 있다. 엔티티가 하나 초과의 컴포넌트로 하여금 기능들을 집합적으로 수행하게 하도록 구성될 때, 각각의 기능은 이들 컴포넌트들 각각에 의해 수행될 필요가 없고(예를 들어, 상이한 기능들이 상이한 컴포넌트들에 의해 수행될 수 있음) 및/또는 각각의 기능은 단지 하나의 컴포넌트에 의해 전체적으로 수행될 필요가 없다(예를 들어, 상이한 컴포넌트들이 기능의 상이한 서브-기능들을 수행할 수 있다).When referring to an entity (e.g., any entity or device described herein) that performs functions or is configured to perform functions (e.g., steps of a method), the entity may be configured to cause one or more elements (individually or collectively) to perform the functions. One or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more (or all) of the functions, and/or any combination thereof. In reference to an entity that performs functions, the entity may be configured to cause one component to perform all of the functions, or to cause more than one component to collectively perform the functions. When an entity is configured to have more than one component collectively perform functions, each function need not be performed by each of these components separately (e.g., different functions can be performed by different components) and/or each function need not be performed entirely by just one component (e.g., different components can perform different sub-functions of the function).

본 명세서에 개시된 양태들과 관련하여 설명된 다양한 예시적인 로직 블록들, 모듈들, 엔진들, 회로들 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 조합들로 구현될 수 있다. 하드웨어와 소프트웨어의 이러한 상호교환가능성을 명확히 예시하기 위해, 다양한 예시적인 컴포넌트들, 블록들, 모듈들, 엔진들, 회로들, 및 단계들은 이들의 기능 관점들에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는지는 특정 애플리케이션 및 전체 시스템에 대해 부과된 설계 제약들에 의존한다. 당업자들은 설명된 기능을 각각의 특정 애플리케이션에 대해 다양한 방식들로 구현할 수 있지만, 그러한 구현 결정들이 본 출원의 범위로부터 벗어나게 하는 것으로 해석되지는 않아야 한다.The various exemplary logic blocks, modules, engines, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various exemplary components, blocks, modules, engines, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

본 명세서에서 설명되는 기법들은 또한 전자 하드웨어, 컴퓨터 소프트웨어, 펌웨어, 또는 이들의 임의의 조합으로 구현될 수 있다. 그러한 기법들은 무선 통신 디바이스 핸드셋들 및 다른 디바이스들에서의 적용을 포함하여 다수의 용도들을 갖는 범용 컴퓨터들, 무선 통신 디바이스 핸드셋들 또는 집적 회로 디바이스들과 같은 다양한 디바이스들 중 임의의 디바이스에서 구현될 수 있다. 모듈들 또는 컴포넌트들로서 설명되는 임의의 특징들은 통합된 로직 디바이스로 함께 또는 개별적이지만 상호운용가능한 로직 디바이스들로서 별개로 구현될 수 있다. 소프트웨어로 구현된다면, 이 기법들은 적어도 부분적으로는, 실행될 때 위에서 설명된 방법들 중 하나 이상을 수행하는 명령들을 포함하는 프로그램 코드를 포함하는 컴퓨터 판독가능 데이터 저장 매체에 의해 실현될 수 있다. 컴퓨터 판독가능 데이터 저장 매체는 패키징 재료들을 포함할 수 있는 컴퓨터 프로그램 제품의 일부를 형성할 수 있다. 컴퓨터 판독가능 매체는 메모리 또는 데이터 저장 매체들, 이를테면 RAM(random access memory), 이를테면 SDRAM(synchronous dynamic random access memory), ROM(read-only memory), NVRAM(non-volatile random access memory), EEPROM(electrically erasable programmable read-only memory), FLASH 메모리, 자기 또는 광 데이터 저장 매체들 등을 포함할 수 있다. 추가로 또는 대안으로, 이 기법들은 적어도 부분적으로는, 명령들 또는 데이터 구조들의 형태로 프로그램 코드를 운반 또는 전달하고 컴퓨터, 이를테면 전파 신호들 또는 파들 의해 액세스, 판독 및/또는 실행될 수 있는 컴퓨터 판독가능 통신 매체에 의해 실현될 수 있다.The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices, such as general-purpose computers, wireless communication device handsets, or integrated circuit devices, which have numerous uses, including applications in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together as an integrated logic device or separately as separate but interoperable logic devices. If implemented in software, the techniques may be realized, at least in part, by a computer-readable data storage medium comprising program code comprising instructions that, when executed, perform one or more of the methods described above. The computer-readable data storage medium may form part of a computer program product, which may include packaging materials. The computer-readable medium may include memory or data storage media, such as random access memory (RAM), synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, etc. Additionally or alternatively, the techniques may be realized, at least in part, by a computer-readable communication medium that carries or transmits program code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer, such as radio signals or waves.

프로그램 코드는 하나 이상의 프로세서들, 이를테면 하나 이상의 DSP(digital signal processor)들, 범용 마이크로프로세서들, ASIC(application specific integrated circuit)들, FPGA(field programmable logic array)들 또는 다른 대등한 집적 또는 이산 로직 회로를 포함할 수 있는 프로세서에 의해 실행될 수 있다. 그러한 프로세서는 본 개시내용에서 설명되는 기법들 중 임의의 기법을 수행하도록 구성될 수 있다. 범용 프로세서는 마이크로프로세서일 수 있지만; 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수 있다. 프로세서는 또한 컴퓨팅 디바이스들의 조합, 예를 들어, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들, DSP 코어와 조합된 하나 이상의 마이크로프로세서들, 또는 임의의 다른 그러한 구성으로서 구현될 수 있다. 그에 따라, 본 명세서에서 사용된 바와 같은 용어 "프로세서"는 전술한 구조, 전술한 구조의 임의의 조합, 또는 본 명세서에서 설명된 기법들의 구현에 적합한 임의의 다른 구조 또는 장치 중 임의의 것을 지칭할 수 있다.The program code may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other comparable integrated or discrete logic circuits. Such a processor may be configured to perform any of the techniques described in this disclosure. A general purpose processor may be a microprocessor; however, alternatively, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in combination with a DSP core, or any other such configuration. Accordingly, the term "processor," as used herein, may refer to any of the foregoing structures, any combination of the foregoing structures, or any other structure or device suitable for implementing the techniques described herein.

본 개시내용의 예시적인 양태들은 다음을 포함한다:Exemplary aspects of the present disclosure include:

양태 1. 현재 프레임의 이미지 데이터를 프로세싱하는 방법으로서, 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하는 단계; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하는 단계 - 제2 해상도는 제1 해상도보다 높음 -; 및 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임, 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하는 단계를 포함한다.Aspect 1. A method for processing image data of a current frame, comprising: determining an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; warping a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having a second resolution, wherein the second resolution is higher than the first resolution; and using a diffusion machine learning model, processing a noise frame, the current frame, and the warped previous frame to generate an output frame having the second resolution.

양태 2. 양태 1에 있어서, 워핑된 잡음 프레임을 생성하기 위해 결정된 광학 흐름에 기초하여 이전 잡음 프레임을 워핑하는 단계를 더 포함하며, 상기 출력 프레임은 워핑된 잡음 프레임을 확산 머신 러닝 모델을 사용하여 프로세싱하는 것에 추가로 기초하여 생성되는, 방법.Aspect 2. A method according to aspect 1, further comprising the step of warping a previous noise frame based on the determined optical flow to generate a warped noise frame, wherein the output frame is generated further based on processing the warped noise frame using a diffusion machine learning model.

양태 3. 양태 1 또는 양태 2 중 어느 하나에 있어서, 상기 광학 흐름은 상기 현재 프레임의 각각의 픽셀에 대한 개별 모션 벡터를 포함하는, 방법.Aspect 3. A method according to either aspect 1 or aspect 2, wherein the optical flow comprises an individual motion vector for each pixel of the current frame.

양태 4. 양태 3에 있어서, 상기 제2 이전 프레임을 워핑하는 단계는 광학 흐름의 각각의 개별 모션 벡터에 의해 표시된 양만큼 제2 이전 프레임의 각각의 픽셀을 조정하는 단계를 포함하는, 방법.Aspect 4. The method of aspect 3, wherein the step of warping the second previous frame comprises adjusting each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

양태 5. 양태 1 내지 양태 4 중 어느 하나에 있어서, 출력 프레임은 현재 프레임의 업샘플링된 버전인, 방법.Aspect 5. A method according to any one of aspects 1 to 4, wherein the output frame is an upsampled version of the current frame.

양태 6. 양태 1 내지 양태 5 중 어느 하나에 있어서, 제2 이전 프레임은 제1 이전 프레임의 업샘플링된 버전인, 방법.Aspect 6. A method according to any one of aspects 1 to 5, wherein the second previous frame is an upsampled version of the first previous frame.

양태 7. 양태 1 내지 양태 6 중 어느 하나에 있어서, 제1 이전 프레임은 비디오에서 현재 프레임의 바로 이전의 프레임인, 방법.Aspect 7. A method according to any one of aspects 1 to 6, wherein the first previous frame is a frame immediately preceding the current frame in the video.

양태 8. 양태 7에 있어서, 상기 비디오의 복수의 연속적인 프레임들을 순차적으로 프로세싱하는 단계를 더 포함하는, 방법.Aspect 8. A method according to aspect 7, further comprising the step of sequentially processing a plurality of consecutive frames of the video.

양태 9. 양태 1 내지 양태 8 중 어느 하나에 있어서, 확산 머신 러닝 모델의 이전 확산 레이턴트를 재사용하는 단계; 확산 머신 러닝 모델의 시간단계들 사이의 적어도 하나의 샘플링 단계를 재사용하는 단계; 또는 확산 머신 러닝 모델의 시간단계들 사이의 하나 이상의 샘플링 단계들을 스킵하는 단계 중 적어도 하나를 더 포함하는, 방법.Aspect 9. A method according to any one of aspects 1 to 8, further comprising at least one of the steps of reusing a previous diffusion latency of the diffusion machine learning model; reusing at least one sampling step between time steps of the diffusion machine learning model; or skipping one or more sampling steps between time steps of the diffusion machine learning model.

양태 10. 양태 1 내지 양태 9 중 어느 하나에 있어서, 상기 잡음 프레임은 가우시안 잡음 분포로부터 샘플링되는, 방법.Aspect 10. A method according to any one of aspects 1 to 9, wherein the noise frame is sampled from a Gaussian noise distribution.

양태 11. 현재 프레임의 이미지 데이터를 프로세싱하기 위한 장치로서, 이미지 데이터를 저장하도록 구성된 적어도 하나의 메모리; 및 적어도 하나의 메모리에 커플링된 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서��: 제1 해상도를 갖는 현재 프레임과 제1 해상도를 갖는 제1 이전 프레임 사이의 광학 흐름을 결정하고; 결정된 광학 흐름에 기초하여 제2 해상도를 갖는 제2 이전 프레임을 워핑하여 제2 해상도를 갖는 워핑된 이전 프레임을 생성하고 - 제2 해상도는 제1 해상도보다 높음 -; 그리고 확산 머신 러닝 모델을 사용하여, 잡음 프레임, 현재 프레임 및 워핑된 이전 프레임을 프로세싱하여 제2 해상도를 갖는 출력 프레임을 생성하도록 구성된다.Aspect 11. A device for processing image data of a current frame, comprising: at least one memory configured to store the image data; and at least one processor coupled to the at least one memory, wherein the at least one processor is configured to: determine an optical flow between a current frame having a first resolution and a first previous frame having the first resolution; warp a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and process the noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

양태 12. 양태 11에 있어서, 상기 적어도 하나의 프로세서는, 결정된 광학 흐름에 기초하여 이전 잡음 프레임을 워핑하여 워핑된 잡음 프레임을 생성하고; 그리고 워핑된 잡음 프레임을 확산 머신 러닝 모델을 사용하여 프로세싱하는 것에 추가로 기초하여 출력 프레임을 생성하도록 구성되는, 장치.Aspect 12. A device according to aspect 11, wherein the at least one processor is configured to generate an output frame by warping a previous noise frame based on the determined optical flow to generate a warped noise frame; and further processing the warped noise frame using a diffusion machine learning model.

양태 13. 양태 11 또는 양태 12 중 어느 하나에 있어서, 상기 광학 흐름은 상기 현재 프레임의 각각의 픽셀에 대한 개별 모션 벡터를 포함하는, 장치.Aspect 13. A device according to any one of aspects 11 or 12, wherein the optical flow comprises an individual motion vector for each pixel of the current frame.

양태 14. 양태 13에 있어서, 제2 이전 프레임을 워핑하기 위해, 상기 적어도 하나의 프로세서는 광학 흐름의 각각의 개별 모션 벡터에 의해 표시된 양만큼 제2 이전 프레임의 각각의 픽셀을 조정하도록 구성되는, 장치.Aspect 14. The device of aspect 13, wherein, to warp the second previous frame, the at least one processor is configured to adjust each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

양태 15. 양태 11 내지 양태 14 중 어느 하나에 있어서, 출력 프레임은 현재 프레임의 업샘플링된 버전인, 장치.Aspect 15. A device according to any one of aspects 11 to 14, wherein the output frame is an upsampled version of the current frame.

양태 16. 양태 11 내지 양태 15 중 어느 하나에 있어서, 제2 이전 프레임은 제1 이전 프레임의 업샘플링된 버전인, 장치.Aspect 16. A device according to any one of aspects 11 to 15, wherein the second previous frame is an upsampled version of the first previous frame.

양태 17. 양태 11 내지 양태 16 중 어느 하나에 있어서, 제1 이전 프레임은 비디오에서 현재 프레임의 바로 이전의 프레임인, 장치.Aspect 17. A device according to any one of aspects 11 to 16, wherein the first previous frame is a frame immediately preceding the current frame in the video.

양태 18. 양태 17에 있어서, 상기 적어도 하나의 프로세서는 상기 비디오의 복수의 연속적인 프레임들을 순차적으로 프로세싱하도록 구성되는, 장치.Aspect 18. A device according to aspect 17, wherein the at least one processor is configured to sequentially process a plurality of consecutive frames of the video.

양태 19. 양태 11 내지 양태 18 중 어느 하나에 있어서, 상기 적어도 하나의 프로세서는 확산 머신 러닝 모델의 이전 확산 레이턴트를 재사용하는 것; 확산 머신 러닝 모델의 시간단계들 사이의 적어도 하나의 샘플링 단계를 재사용하는 것; 또는 확산 머신 러닝 모델의 시간단계들 사이의 하나 이상의 샘플링 단계들을 스킵하는 것 중 적어도 하나를 위해 구성되는, 장치.Aspect 19. A device according to any one of aspects 11 to 18, wherein the at least one processor is configured to at least one of: reuse a previous diffusion latency of the diffusion machine learning model; reuse at least one sampling step between time steps of the diffusion machine learning model; or skip one or more sampling steps between time steps of the diffusion machine learning model.

양태 20. 양태 11 내지 양태 19 중 어느 하나에 있어서, 상기 잡음 프레임은 가우시안 잡음 분포로부터 샘플링되는, 장치.Aspect 20. A device according to any one of aspects 11 to 19, wherein the noise frame is sampled from a Gaussian noise distribution.

양태 21. 저장된 명령들을 포함하는 비일시적 컴퓨터 판독가능 저장 매체로서, 명령들은, 적어도 하나의 프로세서에 의해 실행될 때, 적어도 하나의 프로세서로 하여금, 양태 1 내지 양태 10 중 어느 하나에 따른 동작들을 수행한다.Aspect 21. A non-transitory computer-readable storage medium comprising stored instructions, the instructions, when executed by at least one processor, causing the at least one processor to perform operations according to any one of aspects 1 to 10.

양태 22. 이미지 데이터를 프로세싱하기 위한 장치로서, 양태 1 내지 양태 10 중 어느 하나에 따른 동작들을 수행하기 위한 하나 이상의 수단들을 포함한다.Aspect 22. A device for processing image data, comprising one or more means for performing operations according to any one of aspects 1 to 10.

Claims

As a device for processing image data of the current frame,
At least one memory configured to store the image data; and
At least one processor coupled to at least one memory, wherein the at least one processor comprises:
Determining the optical flow between the current frame having the first resolution and the first previous frame having the first resolution;
Warping a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and
A device for processing image data of a current frame, the device configured to process a noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

In the first paragraph, the at least one processor,
Generating a warped noise frame by warping the previous noise frame based on the determined optical flow; and
A device for processing image data of a current frame, configured to generate the output frame further based on processing the warped noise frame using the diffusion machine learning model.

A device for processing image data of a current frame, wherein the optical flow comprises an individual motion vector for each pixel of the current frame.

A device for processing image data of a current frame, wherein in the third aspect, to warp the second previous frame, the at least one processor is configured to adjust each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

A device for processing image data of a current frame, wherein the output frame is an upsampled version of the current frame.

A device for processing image data of a current frame, wherein the second previous frame is an upsampled version of the first previous frame.

A device for processing image data of a current frame, wherein the first previous frame is a frame immediately preceding the current frame in a video.

A device for processing image data of a current frame, wherein the at least one processor is configured to sequentially process a plurality of consecutive frames of the video.

In the first paragraph,
At least one processor,
Reusing the previous diffusion latency of the above diffusion machine learning model;
Reusing at least one sampling step between time steps of the above diffusion machine learning model; or
A device for processing image data of a current frame, configured for at least one of skipping one or more sampling steps between the time steps of the diffusion machine learning model.

A device for processing image data of a current frame, wherein the noise frame is sampled from a Gaussian noise distribution in the first paragraph.

As a method of processing image data of the current frame,
A step of determining an optical flow between the current frame having a first resolution and a first previous frame having the first resolution;
A step of generating a warped previous frame having a second resolution by warping a second previous frame having a second resolution based on the determined optical flow, wherein the second resolution is higher than the first resolution; and
A method for processing image data of a current frame, comprising the step of processing a noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

In Article 11,
Further comprising the step of generating a warped noise frame by warping a previous noise frame based on the determined optical flow,
A method for processing image data of a current frame, wherein the output frame is generated further based on processing the warped noise frame using the diffusion machine learning model.

A method for processing image data of a current frame, wherein the optical flow comprises an individual motion vector for each pixel of the current frame.

A method for processing image data of a current frame, wherein the step of warping the second previous frame comprises the step of adjusting each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

A method for processing image data of a current frame, wherein the output frame is an upsampled version of the current frame.

A method for processing image data of a current frame, wherein the second previous frame is an upsampled version of the first previous frame.

A method for processing image data of a current frame, wherein the first previous frame is a frame immediately preceding the current frame in a video.

A method for processing image data of a current frame, further comprising the step of sequentially processing a plurality of consecutive frames of the video in the 17th paragraph.

In Article 11,
A step of reusing the previous diffusion latency of the above diffusion machine learning model;
a step of reusing at least one sampling step between time steps of the diffusion machine learning model; or
A method for processing image data of a current frame, further comprising at least one step of skipping one or more sampling steps between the time steps of the diffusion machine learning model.

A method for processing image data of a current frame, wherein the noise frame is sampled from a Gaussian noise distribution in claim 11.

A non-transitory computer-readable storage medium containing stored instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to:
Determine the optical flow between a current frame having a first resolution and a first previous frame having the first resolution;
Warping a second previous frame having a second resolution based on the determined optical flow to generate a warped previous frame having the second resolution, wherein the second resolution is higher than the first resolution; and
A non-transitory computer-readable storage medium for processing a noise frame, the current frame, and the warped previous frame using a diffusion machine learning model to generate an output frame having the second resolution.

In paragraph 21, the instructions, when executed by the at least one processor, cause the at least one processor to:
Warping the previous noise frame based on the determined optical flow to generate a warped noise frame; and
A non-transitory computer-readable storage medium that generates the output frame further based on processing the warped noise frame using the diffusion machine learning model.

A non-transitory computer-readable storage medium in claim 21, wherein the optical flow includes individual motion vectors for each pixel of the current frame.

A non-transitory computer-readable storage medium, wherein, in order to warp the second previous frame, the instructions, when executed by the at least one processor, cause the at least one processor to adjust each pixel of the second previous frame by an amount indicated by each individual motion vector of the optical flow.

A non-transitory computer-readable storage medium in claim 21, wherein the output frame is an upsampled version of the current frame.

A non-transitory computer-readable storage medium in claim 21, wherein the second previous frame is an upsampled version of the first previous frame.

A non-transitory computer-readable storage medium in claim 21, wherein the first previous frame is a frame immediately preceding the current frame in the video.

A non-transitory computer-readable storage medium in claim 27, wherein the instructions, when executed by the at least one processor, cause the at least one processor to sequentially process a plurality of consecutive frames of the video.

In paragraph 21, the instructions, when executed by the at least one processor, cause the at least one processor to:
Reusing the previous diffusion latency of the above diffusion machine learning model;
Reusing at least one sampling step between time steps of the above diffusion machine learning model; or
A non-transitory computer-readable storage medium that causes at least one of skipping one or more sampling steps between the time steps of the diffusion machine learning model.

A non-transitory computer-readable storage medium in claim 21, wherein the noise frame is sampled from a Gaussian noise distribution.