WO2025057379A1

WO2025057379A1 - Posture estimation system, posture estimation method, and program

Info

Publication number: WO2025057379A1
Application number: PCT/JP2023/033602
Authority: WO
Inventors: 克彦松浦; 祥悟佐藤; 泰史奥村; 徹悟稲田
Original assignee: Sony Interactive Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2025-03-20
Anticipated expiration: 2026-03-14

Abstract

According to the present invention, a posture estimation system for estimating a posture using key points more appropriately acquires information indicating a portion constituting an object and hidden by a hand (S204, S402), determines three-dimensional positions of a plurality of key points for estimating the posture of the object on the basis of the information (S208, S405), trains a machine learning model for estimating the positions of a plurality of key points determined in an input image (S203, S407), acquires the estimated positions of key points in an image capturing the object and the hand on the basis of an output produced by the trained machine learning model in response to receiving the image, and determines the estimated posture of the object in a three-dimensional space on the basis of the estimated positions of the key points.

Description

Posture estimation system, posture estimation method, and program

　本発明は、姿勢推定システム、姿勢推定方法及びプログラムに関する。 The present invention relates to a posture estimation system, a posture estimation method, and a program.

　物体が撮影された画像からその物体のキーポイントの位置を推定し、その推定されたキーポイントからその物体の姿勢を推定する手法がある。物体のキーポイントの３次元位置は予め決定されている。例えば、キーポイントの画像��の位置を予測する機械学習モデルが学習され、その機械学習モデルを用いて、撮影された画像からキーポイントの画像内の位置が推定される。 There is a technique for estimating the positions of an object's keypoints from an image of the object, and then estimating the object's pose from the estimated keypoints. The three-dimensional positions of the object's keypoints are determined in advance. For example, a machine learning model is trained to predict the positions of keypoints in an image, and the machine learning model is used to estimate the positions of keypoints in an image from a captured image.

　Sida Peng et alは、2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)において、論文PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimationを発表している。この論文では、３Ｄモデルから生成される入力画像と、正解の出力画像とを含む訓練データにより機械学習モデルを学習させ、さらにその機械学習モデルに撮影された画像が入力された際の出力に基づいて姿勢推定に用いるキーポイントの画像上の位置を算出することが開示されている。 Sida Peng et al. published a paper entitled PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). This paper discloses that a machine learning model is trained using training data including input images generated from a 3D model and correct output images, and that the positions of key points in the image used for pose estimation are calculated based on the output when a captured image is input to the machine learning model.

　姿勢推定をする際に、例えば手によって物体が隠れた場合など、画像からキーポイントの位置を推定することが難しいケースがある。そのことに起因して、姿勢推定の精度の低下、または、処理速度の低下が生じる恐れがあった。 When estimating pose, there are cases where it is difficult to estimate the position of key points from an image, for example when an object is hidden by a hand. This can lead to a decrease in the accuracy of pose estimation or a decrease in processing speed.

　本発明は上記実情に鑑みてなされたものであって、その目的は、姿勢の推定をより適切に実施することを可能にする技術を提供することにある。 The present invention was made in consideration of the above situation, and its purpose is to provide a technology that enables posture estimation to be performed more appropriately.

　上記課題を解決するために、本発明に係る姿勢推定システムは、１または複数のプロセッサを含み、��記１または複数のプロセッサは、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの３次元位置を決定し、入力された画像における前記決定された複数のキーポイントの位置を推定するための機械学習モデルを学習させ、前記学習された機械学習モデルにオブジェクトおよび手が撮影された画像が入力された際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの３次元空間における推定された姿勢を決定する。 In order to solve the above problems, the posture estimation system of the present invention includes one or more processors, which acquire information indicating parts of an object that are hidden by a hand, determine the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, train a machine learning model for estimating the positions of the determined number of key points in an input image, acquire positions of estimated key points in the image based on the output when an image of an object and a hand is input to the trained machine learning model, and determine an estimated posture of the object in three-dimensional space based on the positions of the estimated key points.

　本発明の一形態では、前記手により隠される部分を示す情報は、前記オブジェクトが前記手により把持された複数の画像であり、前記１または複数のプロセッサは、所定の手順により決定された複数のキーポイント候補が、前記オブジェクトが前記手により把持された前記複数の画像において前記手により隠される頻度に基づいて、前記複数のキーポイントの３次元位置を決定してよい。 In one form of the invention, the information indicating the portion obscured by the hand is a plurality of images in which the object is held by the hand, and the one or more processors may determine three-dimensional positions of the plurality of keypoint candidates determined by a predetermined procedure based on the frequency with which the hand obscures the plurality of keypoint candidates in the plurality of images in which the object is held by the hand.

　本発明の一形態では、前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって前記手により把持される部分であってよい。 In one embodiment of the present invention, the information indicating the portion hidden by the hand may be the portion of the object indicated by the user that is being held by the hand.

　本発明の一形態では、前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって、タグが対応付けられる部分を示す情報であり、前記１または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に当該タグに応じた処理を実行してよい。 In one embodiment of the present invention, the information indicating the portion hidden by the hand is information indicating the portion of the object designated by the user and associated with a tag, and the one or more processors may determine whether the portion associated with the tag is being operated by the hand based on an image of the object and hand, and execute processing according to the tag if it is determined that the portion is being operated.

　本発明の一形態では、前記１または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に、前記手による操作の大きさに基づいて当該タグに応じた処理を実行してよい。 In one embodiment of the present invention, the one or more processors may determine whether a part associated with the tag is being operated by the hand based on an image of the object and hand, and if it is determined that the part is being operated, may execute processing corresponding to the tag based on the magnitude of the operation by the hand.

　また、本発明に係る姿勢推定方法は、１または複数のプロセッサにより、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの３次元位置を決定し、入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得し、前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの３次元空間における姿勢を推定する。 In addition, the posture estimation method of the present invention involves using one or more processors to acquire information indicating parts of an object that are hidden by the hand, determining the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image, acquiring the positions of the estimated key points in the image based on the output when an image of an object and a hand is input to the acquired machine learning model, and estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.

　また、本発明に係るプログラムは、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得する取得手段、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの３次元位置を決定するキーポイント決定手段、入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得するモデル取得手段、前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得する位置取得手段、および、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの３次元空間における姿勢を推定する姿勢推定手段、としてコンピュータを機能させる。 The program of the present invention causes a computer to function as: an acquisition means for acquiring information indicating parts of an object that are hidden by a hand; a key point determination means for determining three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information; a model acquisition means for acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image; a position acquisition means for acquiring positions of the estimated key points in the image based on an output when an image of an object and a hand is input to the acquired machine learning model; and a posture estimation means for estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.

　本発明によれば、キーポイントを用いた姿勢の推定をより適切に実施することができる。 The present invention makes it possible to more appropriately estimate posture using keypoints.

本発明の一実施形態に係る情報処理システムの構成の一例を示す図である。1 is a diagram illustrating an example of a configuration of an information processing system according to an embodiment of the present invention. 本発明の一実施形態に係る情報処理システムで実装される機能の一例を示すブロック図である。2 is a block diagram showing an example of functions implemented in an information processing system according to an embodiment of the present invention; FIG. 撮影されたオブジェクトの画像の一例を示す図である。FIG. 2 is a diagram showing an example of a photographed image of an object. 機能タグと関連付けられるタグ領域の一例を説明する図である。10A and 10B are diagrams illustrating an example of a tag area associated with a function tag. 情報処理システムの処理を概略的に示すフロー図である。FIG. 2 is a flow chart illustrating an outline of processing of the information processing system. キーポイントの決定および推定モデルの学習の処理の一例を示すフロー図である。FIG. 11 is a flow diagram illustrating an example of a process for determining keypoints and training an estimation model. オブジェクトから生成されるキーポイントの候補の一例を説明する図である。FIG. 10 is a diagram illustrating an example of key point candidates generated from an object. 訓練データを生成し推定モデルを学習させる処理の一例を示すフロー図である。FIG. 11 is a flow diagram showing an example of a process for generating training data and learning an estimation model. 正解データの一例を示す図である。FIG. 11 is a diagram illustrating an example of correct answer data. キーポイントの決定および推定モデルの学習の処理の他の一例を示すフロー図である。FIG. 11 is a flow diagram showing another example of a process for determining keypoints and training an estimation model.

　以下、本発明の一実施形態について図面に基づき詳細に説明する。本実施形態では、オブジェクトが撮影された画像を入力し、その姿勢を推定し、推定された姿勢に応じた画像を描画する情報処理システムに発明を適用した場合について説明する。 Below, one embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, a case will be described in which the invention is applied to an information processing system that inputs an image of an object, estimates its posture, and draws an image according to the estimated posture.

　この情報処理システムは、オブジェクトが撮影された画像からそのオブジェクトの推定される姿勢を示す情報を出力する機械学習モデルを含んでいる。 This information processing system includes a machine learning model that outputs information indicating the estimated pose of an object from an image in which the object is captured.

　図１は、本発明の一実施形態にかかる情報処理システムの構成の一例を示す図である。本実施形態にかかる情報処理システムは、情報処理装置１０を含む。情報処理装置１０は、例えば、ゲームコンソールやパーソナルコンピュータ、ＶＲヘッドセットなどのコンピュータである。図１に示すように、情報処理装置１０は、例えば、プロセッサ１１、記憶部１２、通信部１３、操作部１６、表示部１８、撮影部２０を含んでいる。情報処理システムは１台の情報処理装置１０により構成されてもよいし、情報処理装置１０を含む複数の装置により構成されてもよいし、例えば撮影部２０または表示部１８が情報処理装置１０と別の筐体に配置されてもよい。 FIG. 1 is a diagram showing an example of the configuration of an information processing system according to one embodiment of the present invention. The information processing system according to this embodiment includes an information processing device 10. The information processing device 10 is, for example, a computer such as a game console, a personal computer, or a VR headset. As shown in FIG. 1, the information processing device 10 includes, for example, a processor 11, a memory unit 12, a communication unit 13, an operation unit 16, a display unit 18, and an imaging unit 20. The information processing system may be composed of one information processing device 10, or may be composed of multiple devices including the information processing device 10, and for example, the imaging unit 20 or the display unit 18 may be located in a housing separate from the information processing device 10.

　プロセッサ１１は、例えば情報処理装置１０にインストールされるプログラムに従って動作するＣＰＵ等のプログラム制御デバイスである。 The processor 11 is, for example, a program-controlled device such as a CPU that operates according to a program installed in the information processing device 10.

　記憶部１２は、ＲＯＭやＲＡＭ等のメモリ素子やソリッドステートドライブのような外部記憶装置のうち少なくとも一部からなる。記憶部１２には、プロセッサ１１によって実行されるプログラムなどが記憶される。 The storage unit 12 is composed of at least a portion of a memory element such as a ROM or RAM, or an external storage device such as a solid-state drive. The storage unit 12 stores programs executed by the processor 11, etc.

　通信部１３は、例えばネットワークインタフェースカードのような、有線通信又は無線通信用の通信インタフェースであり、インターネット等のコンピュータネットワークを介して、他のコンピュータや端末との間でデータを授受する。 The communication unit 13 is a communication interface for wired or wireless communication, such as a network interface card, and transmits and receives data between other computers and terminals via a computer network such as the Internet.

　操作部１６は、例えば、キーボード、マウス、タッチパネル、ゲームコンソールのコントローラ等の入力デバイスであって、ユーザの操作入力を受け付けて、その内容を示す信号をプロセッサ１１に出力する。 The operation unit 16 is an input device such as a keyboard, mouse, touch panel, or game console controller, and receives operation input from the user and outputs a signal indicating the content of the input to the processor 11.

　表示部１８は、液晶ディスプレイ等の表示デバイスであって、プロセッサ１１の指示に従って各種の画像を表示する。表示部１８は、ＶＲヘッドセットに内蔵されてもよいし、外部の表示デバイスに対して映像信号を出力するデバイスであってもよい。 The display unit 18 is a display device such as a liquid crystal display, and displays various images according to instructions from the processor 11. The display unit 18 may be built into the VR headset, or may be a device that outputs a video signal to an external display device.

　撮影部２０は、イメージセンサを含む撮影デバイスである。撮影部２０は、可視のＲＧＢ画像を取得可能なカメラであってよい。撮影部２０は、可視のＲＧＢ画像と、そのＲＧＢ画像と同期した深度情報とを取得可能なカメラであってもよい。本実施形態にかかる撮影部２０は、例えば動画像の撮影が可能なカメラであってもよいし、ＶＲヘッドセットに内蔵されたカメラであってもよい。撮影部２０は情報処理装置１０の外部にあってもよく、この場合は情報処理装置１０と撮影部２０とが、通信部１３または後述の入出力部を介して接続されてよい。 The imaging unit 20 is a photographing device including an image sensor. The imaging unit 20 may be a camera capable of acquiring visible RGB images. The imaging unit 20 may be a camera capable of acquiring visible RGB images and depth information synchronized with the RGB images. The imaging unit 20 in this embodiment may be, for example, a camera capable of capturing moving images, or may be a camera built into a VR headset. The imaging unit 20 may be outside the information processing device 10, in which case the information processing device 10 and the imaging unit 20 may be connected via the communication unit 13 or an input/output unit described below.

　なお、情報処理装置１０は、マイクやスピーカなどといった音声入出力デバイスを含んでいてもよい。また、情報処理装置１０は、例えば、ネットワークボードなどの通信インタフェース、ＤＶＤ－ＲＯＭやＢｌｕ－ｒａｙ（登録商標）ディスクなどの光ディスクを読み取る光ディスクドライブ、外部機器とデータの入出力をするための入出力部（ＵＳＢ（Universal Serial Bus）ポート）を含んでいてもよい。 The information processing device 10 may also include audio input/output devices such as a microphone and a speaker. The information processing device 10 may also include, for example, a communication interface such as a network board, an optical disk drive that reads optical disks such as DVD-ROMs and Blu-ray (registered trademark) disks, and an input/output unit (USB (Universal Serial Bus) port) for inputting and outputting data to and from external devices.

　図２は、本発明の一実施形態に係る情報処理システムで実装される機能の一例を示すブロック図である。図２に示すように、情報処理システムは、機能的に、姿勢推定部２５、タグ処理部２９、画像描画部３０、形状モデル取得部３１、遮蔽情報取得部３２、学習制御部３５を含む。姿勢推定部２５は、機能的に、推定モデル２６、位置取得部２７、および姿勢決定部２８を含む。学習制御部３５は、機能的に、キーポイント決定部３６、推定学習部３７を含む。推定モデル２６は、機械学習モデルの一種である。 FIG. 2 is a block diagram showing an example of functions implemented in an information processing system according to one embodiment of the present invention. As shown in FIG. 2, the information processing system functionally includes a posture estimation unit 25, a tag processing unit 29, an image rendering unit 30, a shape model acquisition unit 31, an occlusion information acquisition unit 32, and a learning control unit 35. The posture estimation unit 25 functionally includes an estimation model 26, a position acquisition unit 27, and a posture determination unit 28. The learning control unit 35 functionally includes a key point determination unit 36 and an estimation learning unit 37. The estimation model 26 is a type of machine learning model.

　これらの機能は、主にプロセッサ１１及び記憶部１２により実装される。より具体的には、これらの機能は、コンピュータである情報処理装置１０にインストールされた、以上の機能に対応する実行命令を含むプログラムをプロセッサ１１で実行することにより実装されてよい。また、このプログラムは、例えば、光学的ディスク、磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介して情報処理装置１０に供給されてもよい。 These functions are mainly implemented by the processor 11 and the storage unit 12. More specifically, these functions may be implemented by having the processor 11 execute a program that is installed in the information processing device 10, which is a computer, and that includes execution instructions corresponding to the above functions. In addition, this program may be supplied to the information processing device 10 via, for example, a computer-readable information storage medium such as an optical disk, a magnetic disk, or a flash memory, or via the Internet, etc.

　なお、本実施形態にかかる情報処理システムに、必ずしも図２に示す機能のすべてが実装されていなくてもよく、また、図２に示す機能以外の機能が実装されていてもよい。 Note that the information processing system according to this embodiment does not necessarily have to implement all of the functions shown in FIG. 2, and may also implement functions other than those shown in FIG. 2.

　姿勢推定部２５は、推定モデル２６に入力画像が入力された際に出力される情報に基づいて、対象となるオブジェクトの姿勢を推定する。入力画像は、撮影部２０により撮影されたオブジェクトの画像である。推定モデル２６は、機械学習モデルであり、訓練データにより学習され、学習済の推定モデル２６は、入力データが入力されると、推定結果としてデータを出力する。 The posture estimation unit 25 estimates the posture of the target object based on the information output when an input image is input to the estimation model 26. The input image is an image of the object captured by the imaging unit 20. The estimation model 26 is a machine learning model that is trained using training data, and when input data is input, the trained estimation model 26 outputs data as an estimation result.

　図３は、撮影されたオブジェクトの画像の一例を示す図である。図３に示される対象オブジェクト５１は、例えば手５３によって保持されており、撮影部２０により撮影される。 FIG. 3 is a diagram showing an example of an image of a photographed object. The target object 51 shown in FIG. 3 is held, for example, by a hand 53, and is photographed by the photographing unit 20.

　学習済の推定モデル２６には、対象となるオブジェクトが撮影された画像の情報が入力され、推定モデル２６はそのオブジェクトの姿勢推定のためのキーポイントの位置を示す情報を出力する。より具体的には、推定モデル２６は、オブジェクトに対して設定される複数のキーポイントのそれぞれについてキーポイントの位置を示す画像を出力する。推定モデル２６は、キーポイントごと、またはキーポイント候補ごとに存在してよい。 Information on an image of a target object is input to the trained estimation model 26, and the estimation model 26 outputs information indicating the positions of keypoints for estimating the posture of the object. More specifically, the estimation model 26 outputs an image indicating the positions of each of a number of keypoints set for the object. An estimation model 26 may exist for each keypoint or each keypoint candidate.

　推定モデル２６の訓練データは、対象となるオブジェクトの３次元形状モデルによりレンダリングされた複数の学習画像と、学習画像におけるオブジェクトのキーポイントの位置を示す正解データとを含む。キーポイントは、オブジェクト内にある仮想的な点であって、姿勢の算出に用いる点である。推定モデル２６が出力するデータは、各点がその点とキーポイントとの位置関係（例えば相対方向）を示す位置画像であってもよいし、各点がキーポイントが存在する確率を示すヒートマップである位置画像であってもよい。推定モデル２６の学習の詳細については後述する。 The training data for the estimation model 26 includes multiple learning images rendered by a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the learning images. Keypoints are virtual points within an object that are used to calculate the pose. The data output by the estimation model 26 may be a position image in which each point indicates the positional relationship between that point and a keypoint (e.g., relative direction), or a position image that is a heat map in which each point indicates the probability that a keypoint exists. The learning of the estimation model 26 will be described in detail later.

　入力画像は、撮影部２０により撮影されたオブジェクトの画像が加工された画像であってもよい。例えば対象となるオブジェクトを除く領域がマスクされた画像であってもよいし、画像におけるオブジェクトのサイズが所定の大きさになるように拡大または縮小された画像であってもよい。 The input image may be an image that has been processed from an image of an object captured by the image capture unit 20. For example, it may be an image in which the area excluding the target object is masked, or an image that has been enlarged or reduced so that the size of the object in the image is a predetermined size.

　位置取得部２７は、学習済の推定モデル２６にオブジェクトおよび手が撮影された画像が入力された際の推定モデル２６の出力に基づいて、入力画像におけるキーポイントの２次元位置を決定する。例えば、位置取得部２７は、推定モデル２６から出力される位置画像に基づいて、入力画像におけるキーポイントの２次元位置の候補を決定する。位置取得部２７は、例えば、位置画像のうちの任意の２点の組み合わせのそれぞれからキーポイントの候補点を算出し、複数の候補点に対して位置画像の各点が示す方向と合致しているかを示すスコアを生成する。位置取得部２７はそのスコアが最も大きい候補点をキーポイントの位置と推定してよい。また位置取得部２７は、キーポイントごとに上記の処理を繰り返す。 The position acquisition unit 27 determines the two-dimensional position of the keypoint in the input image based on the output of the estimation model 26 when an image of an object and a hand is input to the trained estimation model 26. For example, the position acquisition unit 27 determines candidates for the two-dimensional position of the keypoint in the input image based on the position image output from the estimation model 26. For example, the position acquisition unit 27 calculates candidate points for the keypoint from each combination of any two points in the position image, and generates a score indicating whether the multiple candidate points match the direction indicated by each point in the position image. The position acquisition unit 27 may estimate the candidate point with the largest score as the position of the keypoint. The position acquisition unit 27 also repeats the above process for each keypoint.

　姿勢決定部２８は、入力画像におけるキーポイントの２次元位置を示す情報と対象となるオブジェクトの３次元形状モデルにおけるキーポイントの３次元位置を示す情報とに基づいて、そのオブジェクトの姿勢を推定し、推定された姿勢を示す姿勢データを出力する。オブジェクトの姿勢は、公知のアルゴリズムによって推定される。例えば、姿勢推定についてのPerspective-n-Point（ＰＮＰ）問題の解法（例えばＥＰｎＰ）により推定されてよい。また、姿勢決定部２８はオブジェクトの姿勢だけでなく入力画像におけるオブジェクトの位置も推定してよく、姿勢データにその位置を示す情報が含まれてもよい。 The pose determination unit 28 estimates the pose of the object based on information indicating the two-dimensional positions of keypoints in the input image and information indicating the three-dimensional positions of keypoints in a three-dimensional shape model of the target object, and outputs pose data indicating the estimated pose. The pose of the object is estimated using a known algorithm. For example, it may be estimated using a solution to the Perspective-n-Point (PNP) problem for pose estimation (e.g., EPnP). Furthermore, the pose determination unit 28 may estimate not only the pose of the object but also the position of the object in the input image, and the pose data may include information indicating that position.

　撮影部２０は、予めキャリブレーションによってカメラ内部パラメータが取得されているものとする。このパラメータは、ＰｎＰ問題を解く際に用いられる。 The internal camera parameters of the image capture unit 20 are assumed to have been acquired in advance through calibration. These parameters are used when solving the PnP problem.

　推定モデル２６、位置取得部２７、姿勢決定部２８の詳細は、PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimationの論文に記載されたものであってよい。 Details of the estimation model 26, the position acquisition unit 27, and the attitude determination unit 28 may be those described in the paper PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation.

　タグ処理部２９は、オブジェクトの部分であって、機能タグが対応付けられる部分を示す情報と、オブジェクトおよび手が撮影された画像に基づいて、機能タグが対応付けられる部分が手により操作されているか判定する。タグ処理部２９は、その部分が操作されていると判定された場合に、その機能タグに応じた処理を実行する。タグ処理部２９は、その部分が操作されていると判定された場合に、手による操作の大きさに基づいてタグに応じた処理を実行してもよい。 The tag processing unit 29 determines whether the part associated with the functional tag is being operated by the hand, based on information indicating the part of the object to which the functional tag is associated, and an image of the object and the hand. If it is determined that the part is being operated, the tag processing unit 29 executes processing according to the functional tag. If it is determined that the part is being operated, the tag processing unit 29 may execute processing according to the tag based on the size of the hand operation.

　図４は、機能タグと関連付けられるタグ領域６１，６２の一例を説明する図である。タグ領域６１，６２のそれぞれは、対象オブジェクト５１のうちの一部の領域である。タグ領域６１，６２は、対象オブジェクト５１の表面上の領域であってもよいし、内部を含む立体的な領域であってもよい。タグ領域６１，６２は、それぞれ互いに異なる機能タグと関連付けられている。例えば、タグ領域６１はスイッチの機能タグと関連付けられ、タグ領域６２は把持する箇所の機能タグと関連付けられてよい。タグ領域６１，６２は、ユーザが把持可能な領域、仮想タッチスクリーンを表示しタッチインタラクションができる領域、ライト用途で光源やパーティクルを噴出させる領域のいずれかであってよく、それぞれその領域に対応する機能タグと関連付けられてよい。 FIG. 4 is a diagram illustrating an example of tag regions 61, 62 associated with functional tags. Each of the tag regions 61, 62 is a part of the target object 51. The tag regions 61, 62 may be regions on the surface of the target object 51, or may be three-dimensional regions including the interior. The tag regions 61, 62 are associated with different functional tags. For example, the tag region 61 may be associated with the functional tag of a switch, and the tag region 62 may be associated with the functional tag of a part to be grasped. The tag regions 61, 62 may be any of a region that can be grasped by the user, a region that displays a virtual touch screen and allows touch interaction, and a region that emits a light source or particles for lighting purposes, and each may be associated with a functional tag corresponding to that region.

　画像描画部３０は、推定されたオブジェクトの姿勢に基づいて、画像を描画する。画像描画部３０は、推定されたオブジェクトの姿勢と、３次元形状モデルとに基づいて、そのオブジェクトの３次元画像を描画してもよい。画像描画部３０は、推定されたオブジェクトの姿勢に基づいて、例えばＶＲ画像のオブジェクトといった描画用のオブジェクトの姿勢を決定し、その描画用のオブジェクトを描画してもよい。 The image rendering unit 30 renders an image based on the estimated orientation of the object. The image rendering unit 30 may render a three-dimensional image of the object based on the estimated orientation of the object and a three-dimensional shape model. The image rendering unit 30 may determine the orientation of an object to be rendered, such as an object in a VR image, based on the estimated orientation of the object, and render the object to be rendered.

　形状モデル取得部３１は、撮影部２０により対象となるオブジェクトが撮影された複数の撮影画像を取得する。形状モデル取得部３１は、その複数の撮影画像から、オブジェクトの３次元形状モデルを生成し取得する。より具体的には、形状モデル取得部３１は、複数の撮影画像のそれぞれについて局所的な特徴を示す複数の特徴ベクトルを抽出し、複数の撮影画像から抽出された互いに対応する複数の特徴ベクトルと撮影画像においてその特徴ベクトルが抽出された位置とからその特徴ベクトルが抽出された点の３次元位置を求める。そして、形状モデル取得部３１はその３次元位置に基づいてオブジェクトの３次元形状モデルを取得する。この方法は、いわゆるＳｆＭやVisual SLAMを実現するソフトウェアでも用いられる公知の方法であるので、詳細の説明は省略する。 The shape model acquisition unit 31 acquires multiple captured images of a target object captured by the imaging unit 20. The shape model acquisition unit 31 generates and acquires a three-dimensional shape model of the object from the multiple captured images. More specifically, the shape model acquisition unit 31 extracts multiple feature vectors indicating local features for each of the multiple captured images, and determines the three-dimensional position of the point from which the feature vector was extracted from the multiple corresponding feature vectors extracted from the multiple captured images and the position from which the feature vector was extracted in the captured image. The shape model acquisition unit 31 then acquires a three-dimensional shape model of the object based on the three-dimensional position. This method is a well-known method that is also used in software that realizes so-called SfM and Visual SLAM, so a detailed explanation will be omitted.

　遮蔽情報取得部３２は、対象となるオブジェクトを構成する部分であって、手により隠される部分を示す情報を取得する。このとき、手はそのオブジェクトを持っているものとする。手により隠される部分を示す情報は、より具体的には、対象となるオブジェクトが手により把持された複数の画像、および、ユーザから手により把持される部分として指定された対象となるオブジェクトの部分を示す情報のうち��なくとも一部である。 The occlusion information acquisition unit 32 acquires information indicating the parts of the target object that are hidden by the hand. At this time, it is assumed that the hand is holding the object. More specifically, the information indicating the parts hidden by the hand is at least a part of a plurality of images in which the target object is held by the hand, and information indicating the parts of the target object that are specified by the user as the parts that are held by the hand.

　遮蔽情報取得部３２は、手により隠される部分を示す情報として、撮影部２０により撮影された、対象となるオブジェクトが手により把持された複数の画像を取得してよい。 The occlusion information acquisition unit 32 may acquire multiple images of the target object being held by the hand, captured by the image capture unit 20, as information indicating the portion obscured by the hand.

　遮蔽情報取得部３２は、手により隠される部分を示す情報として、ユーザにより指定されたオブジェクトの部分であって、手により把持される部分を示す情報を取得してもよい。オブジェクトの部分としてタグ領域６１，６２が指定されてよい。 The occlusion information acquisition unit 32 may acquire information indicating a part of an object designated by a user and held by the hand as information indicating a part hidden by the hand. Tag regions 61, 62 may be designated as parts of the object.

　遮蔽情報取得部３２は、手により持たれる領域を推定する学習済の機械学習モデルに対象となるオブジェクトの情報を入力し、その機械学習モデルの出力によりそのオブジェクトの部分を特定してもよい。この機械学習モデルについては公知であるのでその詳細の説明を省略する。 The occlusion information acquisition unit 32 may input information about the target object into a trained machine learning model that estimates the area held by the hand, and identify parts of the object using the output of the machine learning model. This machine learning model is publicly known, so a detailed description of it will be omitted.

　学習制御部３５は、対象となるオブジェクトの３次元形状モデルに基づいて、そのオブジェクトのキーポイントを決定するとともに推定モデル２６を学習させる。 The learning control unit 35 determines the key points of the target object based on the three-dimensional shape model of the object and trains the estimation model 26.

　キーポイント決定部３６は、対象となるオブジェクトの３次元形状モデルと、手により隠される部分を示す情報に基づいて、対象となるオブジェクトの姿勢を推定するための複数のキーポイントの３次元位置を決定してよい。手により隠される部分を示す情報がオブジェクトが手により把持された複数の画像である場合には、キーポイント決定部３６は、所定の手法により決定された複数のキーポイント候補が、その複数の画像において手により隠される頻度に基づいて、複数のキーポイントを決定し、その決定されたキーポイントの３次元位置を決定してよい。 The key point determination unit 36 may determine the three-dimensional positions of multiple key points for estimating the posture of the target object based on a three-dimensional shape model of the target object and information indicating the parts hidden by the hands. If the information indicating the parts hidden by the hands is multiple images of the object being held by the hands, the key point determination unit 36 may determine multiple key points based on the frequency with which multiple key point candidates determined by a predetermined method are hidden by the hands in the multiple images, and may determine the three-dimensional positions of the determined key points.

　キーポイント決定部３６は、例えば公知のFarthest Point アルゴリズムにより複数のキーポイント候補のセットを生成してよい。キーポイントの数Ｎは例えば４以上の整数であればよく。キーポイント候補の数はキーポイントの数Ｎより大きい整数（例えばキーポイントの数の１．３倍以上）であればよい。 The keypoint determination unit 36 may generate a set of multiple keypoint candidates, for example, by using the well-known Farthest Point algorithm. The number of keypoints N may be an integer greater than or equal to 4. The number of keypoint candidates may be an integer greater than the number of keypoints N (for example, greater than or equal to 1.3 times the number of keypoints).

　手により隠される部分を示す情報が、ユーザにより指示されたオブジェクトの部分であ��て、手により把持される部分を示す情報である場合には、キーポイント決定部３６は、その部分に基づいて、複数のキーポイント候補から複数のキーポイントを決定してよく、その決定されたキーポイントの３次元位置を決定してよい。 If the information indicating the part hidden by the hand is information indicating the part of the object indicated by the user and held by the hand, the key point determination unit 36 may determine multiple key points from multiple key point candidates based on that part, and may determine the three-dimensional positions of the determined key points.

　キーポイント決定部３６は、キーポイントによる姿勢推定における信頼度にさらに基づいて、複数のキーポイント候補から複数のキーポイントを決定してよい。信頼度の算出方法については後述する。 The keypoint determination unit 36 may determine multiple keypoints from multiple keypoint candidates based on the reliability of the posture estimation using the keypoints. The method of calculating the reliability will be described later.

　推定学習部３７は、入力された画像における、決定された複数のキーポイントの位置を推定するための機械学習モデルである推定モデル２６を学習させる。より具体的には、推定学習部３７は、推定モデル２６の学習に用いる訓練データを生成し、その訓練データにより推定モデル２６を学習させる。 The estimation learning unit 37 trains the estimation model 26, which is a machine learning model for estimating the positions of the determined multiple key points in the input image. More specifically, the estimation learning unit 37 generates training data used to train the estimation model 26, and trains the estimation model 26 using the training data.

　訓練データは、対象となるオブジェクトの３次元形状モデルによりレンダリングされた複数の学習画像と、学習画像におけるオブジェクトのキーポイントの位置を示す正解データとを含む。少なくとも初期の訓練データにおいて、推定学習部３７による正解データの生成の対象となるキーポイントは、キーポイント候補のセットに含まれるものであってよい。推定学習部３７は、初期のセットに含まれるすべてのキーポイント候補について、正解データを生成し、推定モデル２６を学習させてよい。 The training data includes a number of training images rendered using a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the training images. At least in the initial training data, the keypoints for which the estimation learning unit 37 generates ground truth data may be those included in a set of keypoint candidates. The estimation learning unit 37 may generate ground truth data for all keypoint candidates included in the initial set, and train the estimation model 26.

　推定学習部３７は、より具体的には、レンダリングされたオブジェクトの姿勢に基づいて学習画像におけるキーポイント候補の位置を決定し、キーポイント候補のそれぞれについて、その位置に応じた正解の位置画像を生成してよい。なお、訓練データは、オブジェクトが撮影された学習画像と、いわゆるＳｆＭやVisual SLAMにより推定される学習画像内のオブジェクトの姿勢から生成される位置画像とを含んでもよい。 More specifically, the estimation learning unit 37 may determine the positions of keypoint candidates in the learning images based on the pose of the rendered object, and generate a correct position image for each of the keypoint candidates according to its position. The training data may include learning images in which the object is photographed, and position images generated from the pose of the object in the learning images estimated by so-called SfM or Visual SLAM.

　本実施形態では、推定学習部３７は、キーポイント候補のそれぞれについて推定モデル２６を学習させる。また選択されたキーポイント候補についての推定モデル２６は、キーポイントの推定モデル２６として、入力画像に対する姿勢推定（推論処理）に利用される。 In this embodiment, the estimation learning unit 37 trains an estimation model 26 for each of the keypoint candidates. The estimation model 26 for the selected keypoint candidate is used as the keypoint estimation model 26 for pose estimation (inference processing) for the input image.

　以下では、情報処理システムの処理について説明する。図５は、情報処理システムの処理を概略的に示すフロー図である。 The processing of the information processing system is explained below. Figure 5 is a flow diagram that shows an overview of the processing of the information processing system.

　はじめに情報処理システムは、対象となるオブジェクトが撮影された画像に基づいて、公知の手法により、そのオブジェクトの３次元形状モデルを生成する（Ｓ１０１）。 First, the information processing system generates a three-dimensional shape model of a target object using a known method based on an image of the object (S101).

　そして情報処理システムに含まれる学習制御部３５は、３次元形状モデルおよび手により隠される部分を示す情報に基づいて、キーポイントの位置を決定するとともに、姿勢推定のための推定モデル２６を学習させる（Ｓ１０２）。 Then, the learning control unit 35 included in the information processing system determines the positions of the key points based on the three-dimensional shape model and information indicating the parts hidden by the hand, and trains the estimation model 26 for pose estimation (S102).

　推定モデル２６が学習されると、姿勢推定部２５はオブジェクトが撮影された入力画像を学習済の推定モデル２６に入力し（Ｓ１０３）、その推定モデル２６が出力するデータを取得する。そして、その推定モデル２６の出力に基づいて、画像中のキーポイントの２次元位置を決定する（Ｓ１０４）。 Once the estimation model 26 has been trained, the posture estimation unit 25 inputs an input image of an object into the trained estimation model 26 (S103) and obtains data output by the estimation model 26. Then, based on the output of the estimation model 26, the two-dimensional positions of key points in the image are determined (S104).

　より具体的には、推定モデル２６の出力が、各点がキーポイントとの相対方向を示す位置画像である場合には、姿勢推定部２５に含まれる位置取得部２７は、位置画像の各点からキーポイントの位置の候補を算出し、その候補に基づいてキーポイントの位置を決定する。推定モデル２６の出力がヒートマップの位置画像である場合には、位置取得部２７は公知の方法により最も確率の高い点の位置をキーポイントの位置として決定する。 More specifically, if the output of the estimation model 26 is a position image in which each point indicates the relative direction to a keypoint, the position acquisition unit 27 included in the posture estimation unit 25 calculates candidates for the position of the keypoint from each point of the position image, and determines the position of the keypoint based on the candidates. If the output of the estimation model 26 is a position image of a heat map, the position acquisition unit 27 determines the position of the most probable point as the position of the keypoint using a known method.

　姿勢推定部２５は、決定されたキーポイントの２次元位置と、３次元形状モデルにおけるそのキーポイントの３次元位置とに基づいて、オブジェクトの姿勢を推定する（Ｓ１０５）。 The posture estimation unit 25 estimates the posture of the object based on the two-dimensional positions of the determined keypoints and the three-dimensional positions of those keypoints in the three-dimensional shape model (S105).

　またタグ処理部２９は、入力画像から手のポーズを示す情報を取得する（Ｓ１０６）。手のポーズとして、撮影された手指の入力画像または３次元空間における関節点の座標が取得��れてよい。手のポーズの取得において、画像と、関節点を示す正解データとにより学習された機械学習モデルが用いられてよい。入力画像は可視画像だけでなく深度画像も含んでよい。手のポーズを取得する手法は公知であるので、詳細な説明は省略する。 The tag processing unit 29 also acquires information indicating the hand pose from the input image (S106). As the hand pose, an input image of a photographed finger or coordinates of joint points in three-dimensional space may be acquired. In acquiring the hand pose, a machine learning model trained using an image and ground truth data indicating the joint points may be used. The input image may include not only a visible image but also a depth image. Techniques for acquiring the hand pose are well known, so detailed explanations are omitted.

　タグ処理部２９は、取得された手のポーズを示す情報に基づいて、機能タグに対応するオブジェクトの部分に手が触れているか判定する（Ｓ１０７）。タグ処理部２９は、手のいずれかの関節点の３次元座標と機能タグに関連付けられた部分（例えばタグ領域６１，６２）との距離が閾値以下であるか否かによって手が触れているか判定してもよい。 The tag processing unit 29 determines whether the hand is touching a part of the object corresponding to the function tag based on the acquired information indicating the hand pose (S107). The tag processing unit 29 may determine whether the hand is touching based on whether the distance between the three-dimensional coordinates of any joint point of the hand and a part associated with the function tag (e.g., tag areas 61, 62) is equal to or less than a threshold.

　機能タグに対応する部分に手が触れていると判定された場合には（Ｓ１０７）、タグ処理部２９は、その部分に対応する機能タグに応じた処理を実行する（Ｓ１０８）。一方、機能タグに対応する部分に手が触れていないと判定された場合には、Ｓ１０８の処理はスキップされる。 If it is determined that the hand is touching the part corresponding to the function tag (S107), the tag processing unit 29 executes processing according to the function tag corresponding to that part (S108). On the other hand, if it is determined that the hand is not touching the part corresponding to the function tag, the processing of S108 is skipped.

　その後、画像描画部３０は、推定された姿勢に基づいて画像を描画し（Ｓ１０８）、描画された画像を表示部１８に表示させる。画像の表示先は他のディスプレイであってもよい。 Then, the image drawing unit 30 draws an image based on the estimated posture (S108) and displays the drawn image on the display unit 18. The image may also be displayed on another display.

　図５の例ではＳ１０３からＳ１０９の処理が１回行われる記載となっているが、実際には、Ｓ１０３からＳ１０９の処理が繰り返し実行され、オブジェクトの移動に応じて姿勢の推定および画像の描画がリアルタイムに行われてよい。 In the example of FIG. 5, the process from S103 to S109 is described as being performed once, but in reality, the process from S103 to S109 may be executed repeatedly, and the posture may be estimated and the image may be drawn in real time in response to the movement of the object.

　図６は、キーポイントの決定および推定モデル２６の学習の処理の一例を示すフロー図である。図６は、図３におけるＳ１０２の処理をより詳細に説明する図である。 FIG. 6 is a flow diagram showing an example of the process of determining key points and learning the estimation model 26. FIG. 6 is a diagram explaining the process of S102 in FIG. 3 in more detail.

　はじめにキーポイント決定部３６は、複数のキーポイント候補を生成する（Ｓ２０１）。より具体的には、キーポイント決定部３６は、オブジェクトの３次元形状モデル（より具体的には３次元形状モデルに含まれる頂点の情報）から、複数のキーポイント候補およびその３次元位置を、例えば公知のFarthest Point アルゴリズムにより生成してよい。 First, the key point determination unit 36 generates multiple key point candidates (S201). More specifically, the key point determination unit 36 may generate multiple key point candidates and their three-dimensional positions from a three-dimensional shape model of the object (more specifically, information on vertices included in the three-dimensional shape model), for example, by using the well-known Farthest Point algorithm.

　図７は、オブジェクトから生成されるキーポイント候補の一例を説明する図である。図７では図３，４とは別のオブジェクトを対象とする場合について生成されるキーポイントの例を示している。図７では説明の容易のため、７つのキーポイント候補Ｋ１～Ｋ７が記載されているが、より多くのキーポイント候補が生成されてよい。 FIG. 7 is a diagram illustrating an example of keypoint candidates generated from an object. FIG. 7 shows an example of keypoints generated when a different object from those in FIGS. 3 and 4 is targeted. For ease of explanation, seven keypoint candidates K1 to K7 are shown in FIG. 7, but more keypoint candidates may be generated.

　キーポイント候補が生成されると、推定学習部３７は、推定モデル２６の訓練データを生成する（Ｓ２０２）。訓練データは、３次元形状モデルに基づいてレンダリングされた訓練画像と、訓練画像におけるキーポイント候補のそれぞれの位置を示す正解データとを含む。 Once the keypoint candidates are generated, the estimation learning unit 37 generates training data for the estimation model 26 (S202). The training data includes a training image rendered based on the three-dimensional shape model and ground truth data indicating the positions of each of the keypoint candidates in the training image.

　図８は、訓練データを生成する処理の一例を示すフロー図である。図８はＳ２０２の処理をより詳細に説明する図である。はじめに推定学習部３７は、オブジェクトの３次元形状モデルのデータを取得する（Ｓ３０１）。そして、推定学習部３７はレンダリングのための複数の視点を取得する（Ｓ３０２）。より厳密には、推定学習部３７はレンダリングのための複数のカメラ視点と、カメラ視点に応じた撮影方向とを取得する。複数のカメラ視点は３次元形状モデルの原点からの距離が一定となる位置に設けられてよく、撮影方向はカメラ視点から３次元形状モデルの原点に向かう方向である。 FIG. 8 is a flow diagram showing an example of a process for generating training data. FIG. 8 is a diagram explaining the process of S202 in more detail. First, the estimation learning unit 37 acquires data of a three-dimensional shape model of an object (S301). Then, the estimation learning unit 37 acquires multiple viewpoints for rendering (S302). More precisely, the estimation learning unit 37 acquires multiple camera viewpoints for rendering and shooting directions corresponding to the camera viewpoints. The multiple camera viewpoints may be provided at positions at a constant distance from the origin of the three-dimensional shape model, and the shooting direction is a direction from the camera viewpoints toward the origin of the three-dimensional shape model.

　視点が取得されると、推定学習部３７は３次元形状モデルに基づいて、視点のそれぞれについてオブジェクトの画像をレンダリングする（Ｓ３０３）。画像は公知の手法によりレンダリングされてよい。 Once the viewpoints are acquired, the estimation learning unit 37 renders an image of the object for each viewpoint based on the three-dimensional shape model (S303). The images may be rendered using a known method.

　画像がレンダリングされると、推定学習部３７はレンダリングされた画像を訓練画像��して、視点とともに訓練データに追加する（Ｓ３０４）。ここで推定学習部３７は、レンダリングされた画像に対して所定のデータ拡張を実施し、変換された画像を訓練画像としてもよい。データ拡張手法において、例えば、レンダリングされた画像に対して、画像の輝度、彩度、色相のうち少なくとも一部に対する擾乱を与えたり、画像の一部を切り抜いて元と同じサイズにリサイズする、といった変換がレンダリングされた画像に対して行われてもよい。 Once the image is rendered, the estimation learning unit 37 adds the rendered image as a training image to the training data together with the viewpoint (S304). Here, the estimation learning unit 37 may perform a predetermined data augmentation on the rendered image and use the converted image as the training image. In the data augmentation method, for example, the rendered image may be transformed by disturbing at least a portion of the luminance, saturation, and hue of the image, or by cropping out a portion of the image and resizing it to the same size as the original.

　推定学習部３７は、さらに視点付きのオブジェクトの撮影画像を訓練画像に追加してもよい。この撮影画像は、３次元形状モデルの生成に用いられた撮影画像であってよい。撮影画像のカメラ視点は３次元形状モデルの生成の際に取得されたカメラ視点であってよい。 The estimation learning unit 37 may further add a photographed image of the object with a viewpoint to the training image. This photographed image may be the photographed image used to generate the three-dimensional shape model. The camera viewpoint of the photographed image may be the camera viewpoint acquired when generating the three-dimensional shape model.

　訓練画像が整備されると、推定学習部３７は、訓練画像のそれぞれについて、キーポイント候補の３次元位置と、訓練画像の視点とに基づいて、訓練画像におけるキーポイントの位置を示す正解データを生成する（Ｓ３０５）。推定学習部３７は、訓練画像ごとに、キーポイント候補のそれぞれに対して正解データを生成する。 Once the training images are prepared, the estimation learning unit 37 generates correct answer data indicating the positions of the keypoints in each training image based on the three-dimensional positions of the keypoint candidates and the viewpoint of the training image for each training image (S305). The estimation learning unit 37 generates correct answer data for each keypoint candidate for each training image.

　図９は、正解データの一例を模式的に示す図である。正解データは、訓練画像におけるオブジェクトのキーポイントの２次元位置を示す情報であり、各点がその点とキーポイントとの位置関係（例えば方向）を示す位置画像であってよい。 FIG. 9 is a diagram showing an example of the correct answer data. The correct answer data is information indicating the two-dimensional positions of key points of an object in a training image, and may be a position image in which each point indicates the positional relationship (e.g., direction) between that point and the key point.

　位置画像は、キーポイントの種類ごとに生成されてよい。位置画像は、各点におけるその点とキーポイントとの相対的な方向を示す。図９に示される位置画像では、各点の値に応じたパターンが記載され、各点の値は、その点の座標とキーポイントの座標との方向を示している。図９はあくまで模式的な図であり、各点の実際の値は連続的に変化する。図９の位置画像は、各点におけるその点を基準としたキーポイントの相対的な方向を示すVector Field画像である。 A position image may be generated for each type of keypoint. The position image indicates the relative direction of each point between that point and the keypoint. In the position image shown in Figure 9, a pattern is depicted according to the value of each point, and the value of each point indicates the direction between the coordinates of that point and the coordinates of the keypoint. Figure 9 is merely a schematic diagram, and the actual value of each point changes continuously. The position image in Figure 9 is a Vector Field image that indicates the relative direction of the keypoint with respect to that point.

　図８に示す処理により、訓練画像と正解データとを含む訓練データが生成される。 The process shown in Figure 8 generates training data that includes training images and correct answer data.

　訓練データが生成されると、推定学習部３７は、訓練データによりキーポイント候補ごとの推定モデル２６を学習させる（Ｓ２０３）。学習済の推定モデル２６は、例えば以下に示す手法により、手により隠されるオブジェクトの部分を検出するために用いられる。 Once the training data is generated, the estimation learning unit 37 uses the training data to train an estimation model 26 for each keypoint candidate (S203). The trained estimation model 26 is used to detect parts of an object that are hidden by a hand, for example, by the method described below.

　推定モデル２６が学習されると、キーポイント決定部３６は、ユーザに対して、撮影部２０の前で、手により把持されたオブジェクトを動かす指示を出力する。ユーザはその指示に従い把持したオブジェクトを撮影部２０の前で動かす。 Once the estimation model 26 has been trained, the key point determination unit 36 outputs an instruction to the user to move the object being held in the user's hand in front of the image capture unit 20. The user follows the instruction to move the object being held in front of the image capture unit 20.

　そして、キーポイント決定部３６は、撮影部２０により撮影された、手に把持されたオブジェクトの画像を取得し、さらにその画像におけるオブジェクトの姿勢を取得する（Ｓ２０４）。キーポイント決定部３６は、動画を構成しオブジェクトが撮影された画像を取得してよい。オブジェクトの姿勢の取得において、キーポイント決定部３６は、画像を学習済の推定モデル２６に入力した際に出力された情報に基づいてキーポイント候補の２次元位置を決定し、そのキーポイント候補の２次元位置と３次元形状モデルにおける位置とに基づいて、姿勢決定部２８と同様の処理によりその姿勢を取得してよい。なお、取得された姿勢と、以前の画像から取得された姿勢との違いが閾値以下である場合、または、オブジェクトの画像と以前に撮影された画像とが類似する場合には、キーポイント決定部３６はその画像を破棄し、Ｓ２０４の処理を繰り返してよい。 Then, the key point determination unit 36 acquires an image of the object held in the hand captured by the image capture unit 20, and further acquires the posture of the object in the image (S204). The key point determination unit 36 may acquire an image of the object that constitutes a video. In acquiring the posture of the object, the key point determination unit 36 may determine the two-dimensional position of a key point candidate based on information output when the image is input to the trained estimation model 26, and acquire the posture by processing similar to that of the posture determination unit 28 based on the two-dimensional position of the key point candidate and its position in the three-dimensional shape model. Note that if the difference between the acquired posture and the posture acquired from the previous image is equal to or less than a threshold value, or if the image of the object and the previously captured image are similar, the key point determination unit 36 may discard the image and repeat the processing of S204.

　なお、キーポイントの推定に失敗するなどの理由から、姿勢の取得ができなかった場合には、キーポイント決定部３６は、ユーザにオブジェクトを指定された姿勢となるよう調��させることで、オブジェクトの画像および姿勢を取得してもよい。キーポイント決定部３６は、ＶＲヘッドセットなどに指定されたオブジェクトのレンダリング画像を表示させ、そのレンダリング画像と重なるように把持しているオブジェクトの位置及び姿勢を調整させてよい。 If the posture cannot be acquired due to reasons such as failure to estimate key points, the key point determination unit 36 may acquire an image and posture of the object by having the user adjust the object so that it assumes a specified posture. The key point determination unit 36 may display a rendering image of the specified object on a VR headset or the like, and adjust the position and posture of the object being held so that it overlaps with the rendering image.

　画像が取得されると、キーポイント決定部３６は、その画像のそれぞれから手の領域を抽出する（Ｓ２０５）。手の領域の抽出は、単に色に基づいて行われてもよいし、公知の学習済の機械学習モデルによって行われてもよい。 Once the images are acquired, the key point determination unit 36 extracts hand regions from each of the images (S205). The hand regions may be extracted simply based on color, or may be extracted using a publicly known trained machine learning model.

　キーポイント決定部３６は、キーポイント候補のそれぞれが、抽出された手の領域に隠されるか判定する（Ｓ２０６）。キーポイント決定部３６は、画像におけるキーポイント候補の位置が、抽出された手の領域内にある場合に、そのキーポイント候補が隠されると判定してよい。 The key point determination unit 36 determines whether each of the key point candidates is occluded in the extracted hand region (S206). The key point determination unit 36 may determine that a key point candidate is occluded when the position of the key point candidate in the image is within the extracted hand region.

　そして、キーポイント決定部３６は、繰り返し終了条件を満たしているか確認する（Ｓ２０８）。繰り返し終了条件は、これまでに判定の対象となった画像の数が閾値以上であることであってもよいし、オブジェクトを囲む仮想的な球の表面を複数の部分に分割した場合にすべての部分が姿勢に対応付けられることであってもよい。画像から取得された姿勢が示す方向にある部分が、姿勢に対応付けられた部分であってよい。 Then, the key point determination unit 36 checks whether a repetition end condition is met (S208). The repetition end condition may be that the number of images that have been subject to judgment so far is equal to or greater than a threshold value, or that when the surface of a virtual sphere surrounding the object is divided into multiple parts, all parts are associated with the posture. The part that is in the direction indicated by the posture obtained from the image may be the part associated with the posture.

　繰り返し終了条件を満たさない場合には（Ｓ２０７のＮ）、Ｓ２０４以降の処理が繰り返される。一方、繰り返し終了条件を満たす場合には（Ｓ２０７のＹ）、キーポイント決定部３６は、キーポイント候補のそれぞれが隠されたと判定された頻度と、姿勢推定の信頼度とに基づいて、キーポイントを決定する（Ｓ２０８）。 If the repetition end condition is not met (N in S207), the processes from S204 onwards are repeated. On the other hand, if the repetition end condition is met (Y in S207), the keypoint determination unit 36 determines keypoints based on the frequency with which each of the keypoint candidates is determined to be occluded and the reliability of the pose estimation (S208).

　より具体的には、キーポイント決定部３６は、キーポイント候補のそれぞれが隠されたと判定された頻度に基づいて、複数のキーポイント候補からキーポイントの仮のセットを選択する。初期の仮のセットとして、キーポイント候補のうち隠される頻度が低いものから所定の数のキーポイントが選択されてよい。キーポイント決定部３６は、そのキーポイントについて、Ｓ２０４で取得された画像に対して姿勢推定部２５により姿勢を推定した場合の姿勢推定の信頼度を取得する。 More specifically, the keypoint determination unit 36 selects a provisional set of keypoints from the multiple keypoint candidates based on the frequency with which each of the keypoint candidates is determined to be occluded. As the initial provisional set, a predetermined number of keypoints that are less frequently occluded among the keypoint candidates may be selected. The keypoint determination unit 36 obtains the reliability of the pose estimation for the keypoints when the pose estimation unit 25 estimates the pose for the image acquired in S204.

　信頼度は、姿勢決定部２８により推定されたオブジェクトの姿勢と、その正解の姿勢とに基づいて決定されてよい。例えば、キーポイント決定部３６は、画像からＳＬＡＭ技術等により求められた姿勢を正解として算出し、その正解の姿勢と、推定された姿勢との差に基づいて信頼度を算出してよい。 The reliability may be determined based on the pose of the object estimated by the pose determination unit 28 and the correct pose. For example, the key point determination unit 36 may calculate the pose obtained from the image using SLAM technology or the like as the correct answer, and calculate the reliability based on the difference between the correct pose and the estimated pose.

　また、キーポイント決定部３６は、仮のキーポイントにより推定された姿勢と、キーポイント候補の３次元位置とに基づいて、画像におけるキーポイント候補のそれぞれの位置を再投影し、再投影された位置を記憶部１２に��納してよい。この場合、キーポイント決定部３６は、キーポイント候補のそれぞれについて、推定モデル２６の出力により推定される位置と、再投影された位置との距離の平均を信頼度として算出してよい。 The keypoint determination unit 36 may also reproject the position of each of the keypoint candidates in the image based on the pose estimated by the tentative keypoints and the three-dimensional positions of the keypoint candidates, and store the reprojected positions in the storage unit 12. In this case, the keypoint determination unit 36 may calculate, as the reliability, the average of the distances between the position estimated by the output of the estimation model 26 and the reprojected position for each of the keypoint candidates.

　キーポイント決定部３６は、その信頼度が閾値より高い場合にはそのセットのキーポイントを正式なキーポイントとして決定し、そうでない場合には複数のキーポイント候補からこれまでのセットと異なるキーポイントからなる仮のセットを選択し、信頼度の取得以降の処理を繰り返す。新たな仮のセットは、例えば、元のセット内のキーポイントのうち、ランダムに選択されたキーポイントを、未選択のキーポイント候補のうちいずれかと交換することにより生成されてよい。交換の対象となるキーポイント候補は、例えば、頻度の小ささおよび既存のセット内のキーポイントとの距離の大きさに基づいて算出されるスコアに基づいて決定されてよい。 If the reliability is higher than a threshold, the keypoint determination unit 36 determines the keypoints in the set as official keypoints; if not, it selects a provisional set consisting of keypoints different from the previous set from the multiple keypoint candidates, and repeats the process from obtaining the reliability onwards. The new provisional set may be generated, for example, by replacing a randomly selected keypoint from the original set with one of the unselected keypoint candidates. The keypoint candidate to be replaced may be determined, for example, based on a score calculated based on low frequency and distance from the keypoints in the existing set.

　キーポイントのセットが決定されると、学習制御部３５は、その決定されたキーポイントのセットを用いて姿勢を推定するように姿勢推定部２５を設定する（Ｓ２０９）。なお、実際の姿勢推定に用いられる推定モデル２６はそのキーポイント（キーポイント候補）について学習済のものであってよい。 Once the set of key points is determined, the learning control unit 35 sets the posture estimation unit 25 to estimate the posture using the determined set of key points (S209). Note that the estimation model 26 used for actual posture estimation may be one that has already been trained on the key points (key point candidates).

　Ｓ２０４からＳ２０８の処理により、姿勢推定の精度を推定する際に悪影響を及ぼす蓋然性の高い、手により隠されやすいキーポイントを効率的に除外し、姿勢推定をより効率的に実施することが可能になる。また姿勢推定の信頼度も用いることで、例えばキーポイントが狭いエリアに集中することに起因する、姿勢推定の精度の低下を防ぐことができる。また、姿勢推定への寄与が低いキーポイント候補を予め排除できるため、姿勢推定における制度と処理速度とを両立させ、処理を効率化することができる。 The processing from S204 to S208 makes it possible to efficiently eliminate keypoints that are likely to be obscured by hands and that are likely to have a negative effect on pose estimation accuracy, making it possible to perform pose estimation more efficiently. In addition, by using the reliability of pose estimation, it is possible to prevent a decrease in pose estimation accuracy, for example, due to keypoints concentrating in a small area. In addition, because keypoint candidates that have a low contribution to pose estimation can be eliminated in advance, it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.

　なお、Ｓ２０８の処理の対象となるキーポイント候補は、隠される頻度が閾値以下のものであってもよい。ここで、キーポイント候補の数がキーポイントの数に所定の数（例えば２）を足した値より少ない場合には、キーポイント決定部３６は、隠される頻度が閾値を超えるキーポイント候補と交換するための追加のキーポイント候補を生成し、交換されたキーポイント候補についてＳ２０２以降の処理が再実行されてもよい。 The keypoint candidates that are the subject of the processing in S208 may be those whose frequency of obscuration is equal to or less than a threshold. Here, if the number of keypoint candidates is less than the number of keypoints plus a predetermined number (e.g., 2), the keypoint determination unit 36 may generate additional keypoint candidates to replace the keypoint candidates whose frequency of obscuration exceeds the threshold, and re-execute the processing from S202 onwards for the replaced keypoint candidates.

　図６に示される処理では、実際に手により隠される部分を取得してい��が、��り��、手により隠され��部分を示��情報として、ユーザにより指定されたオブジェクトの部分であって、手により把持される部分を示す情報を取得してもよい。 In the process shown in FIG. 6, the part that is actually hidden by the hand is obtained, but instead, information indicating the part of the object specified by the user that is being held by the hand may be obtained as information indicating the part that is hidden by the hand.

　図１０は、キーポイントの決定および推定モデル２６の学習の処理の他の一例を示すフロー図である。この例では、ディスプレイに表示されるオブジェクトの画像に対して、ユーザが手により隠される部分をタグ領域として手動で指定する。 FIG. 10 is a flow diagram showing another example of the process of determining key points and learning the estimation model 26. In this example, the user manually specifies, as a tag region, a portion of an image of an object shown on a display that is obscured by a hand.

　はじめにキーポイント決定部３６は、３次元形状モデルに基づいて、オブジェクトの画像を表示させる（Ｓ４０１）。次に、ユーザの画像に対する操作に基づいて、オブジェクトのうちユーザが指定するタグ領域と、タグ領域に対して指定された機能タグとを取得する（Ｓ４０２）。 First, the key point determination unit 36 displays an image of an object based on a three-dimensional shape model (S401). Next, based on the user's operation on the image, it acquires a tag area of the object designated by the user and a function tag designated for the tag area (S402).

　キーポイント決定部３６は、オブジェクトの画像とともに塗料パレットを含むペイントツールのアイコンを表示させ、ユーザが塗料パレットにより指定した色を用いてスキャンしたモデルに着色させる処理を実行してよい。なお、ユーザは、オブジェクトを持ちながら任意の位置にペイントしてもよい。この場合、実際のオブジェクトの画像に同形状の仮想オブジェクトの画像を透過的に重畳させ、仮想オブジェクトに着色することにより、実物に塗られたように領域が可視化されてよい。また塗料の色と機能タグとが関連付けられてよい。 The key point determination unit 36 may display an icon of a paint tool including a paint palette along with an image of the object, and execute a process of coloring the scanned model with a color specified by the user using the paint palette. The user may paint any position while holding the object. In this case, an image of a virtual object of the same shape may be transparently superimposed on the image of the actual object, and the virtual object may be colored to visualize the area as if it were painted on the real thing. The paint color may also be associated with a functional tag.

　キーポイント決定部３６は、この着色された領域をタグ領域として取得してよい。また、キーポイント決定部３６は、ペイントツールによる着色の代わりに、ユーザがそれぞれ機能タグに対応する複数の仮想シールのうちいずれかを選び、選ばれた仮想シールをオブジェクトに貼り付けることでタグ領域を指定してもよい。 The key point determination unit 36 may acquire this colored area as a tag area. Alternatively, instead of coloring with a paint tool, the key point determination unit 36 may specify the tag area by having the user select one of a plurality of virtual stickers that each correspond to a functional tag and attach the selected virtual sticker to the object.

　キーポイント決定部３６は、Ｓ４０１，Ｓ４０２と並行して、複数のキーポイント候補を生成する（Ｓ４０３）。この処理はＳ２０１と同様であるので詳細の説明を省略する。 The keypoint determination unit 36 generates multiple keypoint candidates (S403) in parallel with S401 and S402. This process is similar to S201, so a detailed explanation will be omitted.

　キーポイント決定部３６は、キーポイント候補のそれぞれが、所定の条件を満たすタグ領域に隠れる蓋然性を算出する（Ｓ４０４）。所定の条件を満たすタグ領域は、例えば把持される領域として指定されたタグ領域や、スイッチなど手が触れる機能と対応づけられたタグ領域であってよい。 The key point determination unit 36 calculates the probability that each of the key point candidates is hidden in a tag area that satisfies a predetermined condition (S404). The tag area that satisfies the predetermined condition may be, for example, a tag area designated as an area to be grasped, or a tag area associated with a function that is touched by the hand, such as a switch.

　キーポイント決定部３６は、例えば、キーポイント候補から複数の（等方的な）方向にレイを飛ばし、さらにタグ領域に当たるレイの比率を示す値をそのキーポイント候補における蓋然性として算出してよい。またタグ領域が３次元的な領域である場合には、キーポイント決定部３６はキーポイント候補がその領域内にある場合に蓋然性の値を１とし、そうでない場合に蓋然性の値を０にしてよい。 The keypoint determination unit 36 may, for example, cast rays in multiple (isotropic) directions from the keypoint candidate, and further calculate a value indicating the ratio of rays that hit the tag region as the probability of that keypoint candidate. Furthermore, if the tag region is a three-dimensional region, the keypoint determination unit 36 may set the probability value to 1 if the keypoint candidate is within that region, and may set the probability value to 0 if it is not.

　キーポイント決定部３６は、キーポイント候補が隠れる蓋然性に基づいて、最終的な姿勢推定に用いるキーポイントを選択する（Ｓ４０５）。ここで、キーポイント決定部３６は、キーポイント候補のうち蓋然性が低いものから所定の数のキーポイントを選択してよい。 The keypoint determination unit 36 selects keypoints to be used for final pose estimation based on the probability that the keypoint candidates are occluded (S405). Here, the keypoint determination unit 36 may select a predetermined number of keypoints from the keypoint candidates with low probability.

　他には、キーポイント決定部３６は、信頼度と蓋然性とに基づいてキーポイントを選択してもよい。例えば、キーポイント決定部３６は仮のキーポイントを決定し、その仮のキーポイントに基づいて信頼度��算出してよい。キーポイント決定部３６は、キーポイントごとの信頼度と蓋然性とから、キーポイントとしての適性を示すスコアを生成し、そのスコアに基づいてキーポイントを決定する。信頼度の算出は、以下の手順で行われてもよい。はじめに、その仮のキーポイントについて学習された推定モデル２６を用いて、姿勢推定部２５が３次元形状モデルを生成する際に撮影された画像について姿勢を推定する。次に、キーポイント決定部３６はその推定された姿勢と、キーポイント候補の３次元位置とに基づいて、画像におけるキーポイント候補のそれぞれの位置を再投影する。そして、キーポイント決定部３６は、キーポイント候補のそれぞれについて、推定モデル２６の出力により推定される位置と、再投影された位置との距離の平均を信頼度として算出する。なお、Ｓ２０８と同様の手法で反復的にキーポイントを決定してもよい。 Alternatively, the keypoint determination unit 36 may select a keypoint based on the reliability and the probability. For example, the keypoint determination unit 36 may determine a tentative keypoint and calculate the reliability based on the tentative keypoint. The keypoint determination unit 36 generates a score indicating the suitability of the keypoint as a keypoint from the reliability and the probability of each keypoint, and determines the keypoint based on the score. The reliability may be calculated in the following manner. First, the estimation model 26 learned about the tentative keypoint is used to estimate the posture of the image captured when the posture estimation unit 25 generates the three-dimensional shape model. Next, the keypoint determination unit 36 reprojects the positions of each of the keypoint candidates in the image based on the estimated posture and the three-dimensional positions of the keypoint candidates. Then, the keypoint determination unit 36 calculates the average of the distance between the position estimated by the output of the estimation model 26 and the reprojected position as the reliability for each of the keypoint candidates. Note that the keypoints may be determined iteratively using a method similar to S208.

　キーポイントが決定されると、推定学習部３７はそのキーポイント候補のそれぞれについて推定モデル２６の学習のための訓練データを生成する（Ｓ４０６）。また推定学習部３７はキーポイントのそれぞれについて推定モデル２６を学習させる（Ｓ４０７）。なお、事前にキーポイント（キーポイント候補）について学習されている場合には、推定モデル２６の学習を重複して行わなくてもよい。 Once the keypoints are determined, the estimation learning unit 37 generates training data for learning the estimation model 26 for each of the keypoint candidates (S406). The estimation learning unit 37 also trains the estimation model 26 for each of the keypoints (S407). Note that if the keypoints (keypoint candidates) have been learned in advance, there is no need to redundantly train the estimation model 26.

　図１０に示される手法でも、手で把持した場合の姿勢推定に悪影響を及ぼす蓋然性の高い、手により隠されやすいキーポイントを効率的に除外し、姿勢推定をより効率的に実施することが可能になる。また、姿勢推定への寄与が低いキーポイント候補を予め排除できるため、姿勢推定における制度と処理速度とを両立させ、処理を効率化することができる。 The method shown in Figure 10 also makes it possible to efficiently eliminate keypoints that are likely to be hidden by the hand and thus have a high probability of adversely affecting pose estimation when the object is held in the hand, making it possible to perform pose estimation more efficiently. In addition, because it is possible to eliminate keypoint candidates that have a low contribution to pose estimation in advance, it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.

　また、図１０に示される手法では、何らかの機能を割り当てることにより手により隠される部分が特定される。そのため、手により隠される部分そのものを特定するための操作を省略することができ、利便性を向上させることができる。 In addition, in the method shown in FIG. 10, the part hidden by the hand is identified by assigning some kind of function. Therefore, the operation for identifying the part hidden by the hand itself can be omitted, improving convenience.

　なお、上記の具体的な数値及び図面中のオブジェクトや数値は例示��あり、これらの例には限定されず、必要に応じて改変されてよい。 Note that the specific numerical values above and the objects and numerical values in the drawings are merely examples, and are not limited to these examples and may be modified as necessary.

Claims

one or more processors;
The one or more processors:
acquiring information indicating a portion of an object that is hidden by a hand;
determining three-dimensional positions of a number of key points for estimating a pose of the object based on the acquired information;
training a machine learning model to estimate locations of the determined keypoints in an input image;
Obtaining estimated positions of key points in an image based on an output of the trained machine learning model when the image includes an object and a hand;
determining an estimated pose of the object in three-dimensional space based on the estimated keypoint locations;
Pose estimation system.

2. The posture estimation system according to claim 1,
the information indicating the portion obscured by the hand is a plurality of images of the object being grasped by the hand;
the one or more processors determine three-dimensional positions of the plurality of keypoints based on a frequency with which the plurality of keypoint candidates determined by a predetermined procedure are occluded by the hand in the plurality of images in which the object is held by the hand;
Pose estimation system.

2. The posture estimation system according to claim 1,
The information indicating the portion hidden by the hand is a portion of the object indicated by the user and grasped by the hand.
Pose estimation system.

4. The posture estimation system according to claim 3,
the information indicating the portion hidden by the hand is information indicating a portion of the object designated by a user and to which a tag is associated;
the one or more processors determine whether a part associated with the tag is being operated by the hand based on an image of the object and the hand, and execute a process corresponding to the tag when it is determined that the part is being operated.
Pose estimation system.

5. The posture estimation system according to claim 4,
the one or more processors determine whether a part associated with the tag is being operated by the hand based on an image of the object and the hand, and when it is determined that the part is being operated, executes a process corresponding to the tag based on a size of the operation by the hand.
Pose estimation system.

by one or more processors,
acquiring information indicating a portion of an object that is hidden by a hand;
determining three-dimensional positions of a number of key points for estimating a pose of the object based on the acquired information;
obtaining a trained machine learning model for estimating locations of the determined keypoints in an input image;
Obtaining estimated positions of key points in the image based on an output of the obtained machine learning model when an image of an object and a hand is input;
estimating a pose of the object in a three-dimensional space based on the estimated keypoint positions;
Pose estimation methods.

an acquisition means for acquiring information indicating a part of an object that is hidden by a hand;
a key point determining means for determining three-dimensional positions of a plurality of key points for estimating a pose of the object based on the acquired information;
a model acquisition means for acquiring a trained machine learning model for estimating positions of the determined plurality of keypoints in an input image;
A position acquisition means for acquiring positions of estimated key points in the image based on an output when an image of an object and a hand is input to the acquired machine learning model; and
a posture estimation means for estimating a posture of the object in a three-dimensional space based on the estimated positions of the key points;
A program that makes a computer function as a