WO2025057379A1 - Posture estimation system, posture estimation method, and program - Google Patents

Posture estimation system, posture estimation method, and program Download PDF

Info

Publication number
WO2025057379A1
WO2025057379A1 PCT/JP2023/033602 JP2023033602W WO2025057379A1 WO 2025057379 A1 WO2025057379 A1 WO 2025057379A1 JP 2023033602 W JP2023033602 W JP 2023033602W WO 2025057379 A1 WO2025057379 A1 WO 2025057379A1
Authority
WO
WIPO (PCT)
Prior art keywords
hand
image
pose
positions
keypoint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2023/033602
Other languages
French (fr)
Japanese (ja)
Inventor
克彦 松浦
祥悟 佐藤
泰史 奥村
徹悟 稲田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment Inc
Original Assignee
Sony Interactive Entertainment Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Interactive Entertainment Inc filed Critical Sony Interactive Entertainment Inc
Priority to PCT/JP2023/033602 priority Critical patent/WO2025057379A1/en
Publication of WO2025057379A1 publication Critical patent/WO2025057379A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present invention relates to a posture estimation system, a posture estimation method, and a program.
  • the three-dimensional positions of the object's keypoints are determined in advance. For example, a machine learning model is trained to predict the positions of keypoints in an image, and the machine learning model is used to estimate the positions of keypoints in an image from a captured image.
  • the present invention was made in consideration of the above situation, and its purpose is to provide a technology that enables posture estimation to be performed more appropriately.
  • the posture estimation system of the present invention includes one or more processors, which acquire information indicating parts of an object that are hidden by a hand, determine the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, train a machine learning model for estimating the positions of the determined number of key points in an input image, acquire positions of estimated key points in the image based on the output when an image of an object and a hand is input to the trained machine learning model, and determine an estimated posture of the object in three-dimensional space based on the positions of the estimated key points.
  • the information indicating the portion obscured by the hand is a plurality of images in which the object is held by the hand
  • the one or more processors may determine three-dimensional positions of the plurality of keypoint candidates determined by a predetermined procedure based on the frequency with which the hand obscures the plurality of keypoint candidates in the plurality of images in which the object is held by the hand.
  • the information indicating the portion hidden by the hand may be the portion of the object indicated by the user that is being held by the hand.
  • the information indicating the portion hidden by the hand is information indicating the portion of the object designated by the user and associated with a tag
  • the one or more processors may determine whether the portion associated with the tag is being operated by the hand based on an image of the object and hand, and execute processing according to the tag if it is determined that the portion is being operated.
  • the one or more processors may determine whether a part associated with the tag is being operated by the hand based on an image of the object and hand, and if it is determined that the part is being operated, may execute processing corresponding to the tag based on the magnitude of the operation by the hand.
  • the posture estimation method of the present invention involves using one or more processors to acquire information indicating parts of an object that are hidden by the hand, determining the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image, acquiring the positions of the estimated key points in the image based on the output when an image of an object and a hand is input to the acquired machine learning model, and estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.
  • the program of the present invention causes a computer to function as: an acquisition means for acquiring information indicating parts of an object that are hidden by a hand; a key point determination means for determining three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information; a model acquisition means for acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image; a position acquisition means for acquiring positions of the estimated key points in the image based on an output when an image of an object and a hand is input to the acquired machine learning model; and a posture estimation means for estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.
  • the present invention makes it possible to more appropriately estimate posture using keypoints.
  • FIG. 1 is a diagram illustrating an example of a configuration of an information processing system according to an embodiment of the present invention.
  • 2 is a block diagram showing an example of functions implemented in an information processing system according to an embodiment of the present invention;
  • FIG. FIG. 2 is a diagram showing an example of a photographed image of an object.
  • 10A and 10B are diagrams illustrating an example of a tag area associated with a function tag.
  • FIG. 2 is a flow chart illustrating an outline of processing of the information processing system.
  • FIG. 11 is a flow diagram illustrating an example of a process for determining keypoints and training an estimation model.
  • FIG. 10 is a diagram illustrating an example of key point candidates generated from an object.
  • FIG. 11 is a flow diagram showing an example of a process for generating training data and learning an estimation model.
  • FIG. 11 is a diagram illustrating an example of correct answer data.
  • FIG. 11 is a flow diagram showing another example of a process for determining keypoints and training an estimation model
  • This information processing system includes a machine learning model that outputs information indicating the estimated pose of an object from an image in which the object is captured.
  • FIG. 1 is a diagram showing an example of the configuration of an information processing system according to one embodiment of the present invention.
  • the information processing system includes an information processing device 10.
  • the information processing device 10 is, for example, a computer such as a game console, a personal computer, or a VR headset.
  • the information processing device 10 includes, for example, a processor 11, a memory unit 12, a communication unit 13, an operation unit 16, a display unit 18, and an imaging unit 20.
  • the information processing system may be composed of one information processing device 10, or may be composed of multiple devices including the information processing device 10, and for example, the imaging unit 20 or the display unit 18 may be located in a housing separate from the information processing device 10.
  • the processor 11 is, for example, a program-controlled device such as a CPU that operates according to a program installed in the information processing device 10.
  • the storage unit 12 is composed of at least a portion of a memory element such as a ROM or RAM, or an external storage device such as a solid-state drive.
  • the storage unit 12 stores programs executed by the processor 11, etc.
  • the communication unit 13 is a communication interface for wired or wireless communication, such as a network interface card, and transmits and receives data between other computers and terminals via a computer network such as the Internet.
  • the operation unit 16 is an input device such as a keyboard, mouse, touch panel, or game console controller, and receives operation input from the user and outputs a signal indicating the content of the input to the processor 11.
  • the display unit 18 is a display device such as a liquid crystal display, and displays various images according to instructions from the processor 11.
  • the display unit 18 may be built into the VR headset, or may be a device that outputs a video signal to an external display device.
  • the imaging unit 20 is a photographing device including an image sensor.
  • the imaging unit 20 may be a camera capable of acquiring visible RGB images.
  • the imaging unit 20 may be a camera capable of acquiring visible RGB images and depth information synchronized with the RGB images.
  • the imaging unit 20 in this embodiment may be, for example, a camera capable of capturing moving images, or may be a camera built into a VR headset.
  • the imaging unit 20 may be outside the information processing device 10, in which case the information processing device 10 and the imaging unit 20 may be connected via the communication unit 13 or an input/output unit described below.
  • the information processing device 10 may also include audio input/output devices such as a microphone and a speaker.
  • the information processing device 10 may also include, for example, a communication interface such as a network board, an optical disk drive that reads optical disks such as DVD-ROMs and Blu-ray (registered trademark) disks, and an input/output unit (USB (Universal Serial Bus) port) for inputting and outputting data to and from external devices.
  • a communication interface such as a network board
  • an optical disk drive that reads optical disks such as DVD-ROMs and Blu-ray (registered trademark) disks
  • USB Universal Serial Bus
  • FIG. 2 is a block diagram showing an example of functions implemented in an information processing system according to one embodiment of the present invention.
  • the information processing system functionally includes a posture estimation unit 25, a tag processing unit 29, an image rendering unit 30, a shape model acquisition unit 31, an occlusion information acquisition unit 32, and a learning control unit 35.
  • the posture estimation unit 25 functionally includes an estimation model 26, a position acquisition unit 27, and a posture determination unit 28.
  • the learning control unit 35 functionally includes a key point determination unit 36 and an estimation learning unit 37.
  • the estimation model 26 is a type of machine learning model.
  • these functions are mainly implemented by the processor 11 and the storage unit 12. More specifically, these functions may be implemented by having the processor 11 execute a program that is installed in the information processing device 10, which is a computer, and that includes execution instructions corresponding to the above functions.
  • this program may be supplied to the information processing device 10 via, for example, a computer-readable information storage medium such as an optical disk, a magnetic disk, or a flash memory, or via the Internet, etc.
  • the posture estimation unit 25 estimates the posture of the target object based on the information output when an input image is input to the estimation model 26.
  • the input image is an image of the object captured by the imaging unit 20.
  • the estimation model 26 is a machine learning model that is trained using training data, and when input data is input, the trained estimation model 26 outputs data as an estimation result.
  • FIG. 3 is a diagram showing an example of an image of a photographed object.
  • the target object 51 shown in FIG. 3 is held, for example, by a hand 53, and is photographed by the photographing unit 20.
  • Information on an image of a target object is input to the trained estimation model 26, and the estimation model 26 outputs information indicating the positions of keypoints for estimating the posture of the object. More specifically, the estimation model 26 outputs an image indicating the positions of each of a number of keypoints set for the object.
  • An estimation model 26 may exist for each keypoint or each keypoint candidate.
  • the training data for the estimation model 26 includes multiple learning images rendered by a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the learning images. Keypoints are virtual points within an object that are used to calculate the pose.
  • the data output by the estimation model 26 may be a position image in which each point indicates the positional relationship between that point and a keypoint (e.g., relative direction), or a position image that is a heat map in which each point indicates the probability that a keypoint exists.
  • the learning of the estimation model 26 will be described in detail later.
  • the input image may be an image that has been processed from an image of an object captured by the image capture unit 20.
  • it may be an image in which the area excluding the target object is masked, or an image that has been enlarged or reduced so that the size of the object in the image is a predetermined size.
  • the position acquisition unit 27 determines the two-dimensional position of the keypoint in the input image based on the output of the estimation model 26 when an image of an object and a hand is input to the trained estimation model 26. For example, the position acquisition unit 27 determines candidates for the two-dimensional position of the keypoint in the input image based on the position image output from the estimation model 26. For example, the position acquisition unit 27 calculates candidate points for the keypoint from each combination of any two points in the position image, and generates a score indicating whether the multiple candidate points match the direction indicated by each point in the position image. The position acquisition unit 27 may estimate the candidate point with the largest score as the position of the keypoint. The position acquisition unit 27 also repeats the above process for each keypoint.
  • the pose determination unit 28 estimates the pose of the object based on information indicating the two-dimensional positions of keypoints in the input image and information indicating the three-dimensional positions of keypoints in a three-dimensional shape model of the target object, and outputs pose data indicating the estimated pose.
  • the pose of the object is estimated using a known algorithm. For example, it may be estimated using a solution to the Perspective-n-Point (PNP) problem for pose estimation (e.g., EPnP).
  • the pose determination unit 28 may estimate not only the pose of the object but also the position of the object in the input image, and the pose data may include information indicating that position.
  • PNP Perspective-n-Point
  • the internal camera parameters of the image capture unit 20 are assumed to have been acquired in advance through calibration. These parameters are used when solving the PnP problem.
  • estimation model 26 the position acquisition unit 27, and the attitude determination unit 28 may be those described in the paper PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation.
  • the tag processing unit 29 determines whether the part associated with the functional tag is being operated by the hand, based on information indicating the part of the object to which the functional tag is associated, and an image of the object and the hand. If it is determined that the part is being operated, the tag processing unit 29 executes processing according to the functional tag. If it is determined that the part is being operated, the tag processing unit 29 may execute processing according to the tag based on the size of the hand operation.
  • FIG. 4 is a diagram illustrating an example of tag regions 61, 62 associated with functional tags.
  • Each of the tag regions 61, 62 is a part of the target object 51.
  • the tag regions 61, 62 may be regions on the surface of the target object 51, or may be three-dimensional regions including the interior.
  • the tag regions 61, 62 are associated with different functional tags.
  • the tag region 61 may be associated with the functional tag of a switch
  • the tag region 62 may be associated with the functional tag of a part to be grasped.
  • the tag regions 61, 62 may be any of a region that can be grasped by the user, a region that displays a virtual touch screen and allows touch interaction, and a region that emits a light source or particles for lighting purposes, and each may be associated with a functional tag corresponding to that region.
  • the image rendering unit 30 renders an image based on the estimated orientation of the object.
  • the image rendering unit 30 may render a three-dimensional image of the object based on the estimated orientation of the object and a three-dimensional shape model.
  • the image rendering unit 30 may determine the orientation of an object to be rendered, such as an object in a VR image, based on the estimated orientation of the object, and render the object to be rendered.
  • the shape model acquisition unit 31 acquires multiple captured images of a target object captured by the imaging unit 20.
  • the shape model acquisition unit 31 generates and acquires a three-dimensional shape model of the object from the multiple captured images. More specifically, the shape model acquisition unit 31 extracts multiple feature vectors indicating local features for each of the multiple captured images, and determines the three-dimensional position of the point from which the feature vector was extracted from the multiple corresponding feature vectors extracted from the multiple captured images and the position from which the feature vector was extracted in the captured image.
  • the shape model acquisition unit 31 then acquires a three-dimensional shape model of the object based on the three-dimensional position.
  • This method is a well-known method that is also used in software that realizes so-called SfM and Visual SLAM, so a detailed explanation will be omitted.
  • the occlusion information acquisition unit 32 acquires information indicating the parts of the target object that are hidden by the hand. At this time, it is assumed that the hand is holding the object. More specifically, the information indicating the parts hidden by the hand is at least a part of a plurality of images in which the target object is held by the hand, and information indicating the parts of the target object that are specified by the user as the parts that are held by the hand.
  • the occlusion information acquisition unit 32 may acquire multiple images of the target object being held by the hand, captured by the image capture unit 20, as information indicating the portion obscured by the hand.
  • the occlusion information acquisition unit 32 may acquire information indicating a part of an object designated by a user and held by the hand as information indicating a part hidden by the hand.
  • Tag regions 61, 62 may be designated as parts of the object.
  • the occlusion information acquisition unit 32 may input information about the target object into a trained machine learning model that estimates the area held by the hand, and identify parts of the object using the output of the machine learning model.
  • This machine learning model is publicly known, so a detailed description of it will be omitted.
  • the learning control unit 35 determines the key points of the target object based on the three-dimensional shape model of the object and trains the estimation model 26.
  • the key point determination unit 36 may determine the three-dimensional positions of multiple key points for estimating the posture of the target object based on a three-dimensional shape model of the target object and information indicating the parts hidden by the hands. If the information indicating the parts hidden by the hands is multiple images of the object being held by the hands, the key point determination unit 36 may determine multiple key points based on the frequency with which multiple key point candidates determined by a predetermined method are hidden by the hands in the multiple images, and may determine the three-dimensional positions of the determined key points.
  • the keypoint determination unit 36 may generate a set of multiple keypoint candidates, for example, by using the well-known Farthest Point algorithm.
  • the number of keypoints N may be an integer greater than or equal to 4.
  • the number of keypoint candidates may be an integer greater than the number of keypoints N (for example, greater than or equal to 1.3 times the number of keypoints).
  • the key point determination unit 36 may determine multiple key points from multiple key point candidates based on that part, and may determine the three-dimensional positions of the determined key points.
  • the keypoint determination unit 36 may determine multiple keypoints from multiple keypoint candidates based on the reliability of the posture estimation using the keypoints. The method of calculating the reliability will be described later.
  • the estimation learning unit 37 trains the estimation model 26, which is a machine learning model for estimating the positions of the determined multiple key points in the input image. More specifically, the estimation learning unit 37 generates training data used to train the estimation model 26, and trains the estimation model 26 using the training data.
  • the training data includes a number of training images rendered using a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the training images.
  • the keypoints for which the estimation learning unit 37 generates ground truth data may be those included in a set of keypoint candidates.
  • the estimation learning unit 37 may generate ground truth data for all keypoint candidates included in the initial set, and train the estimation model 26.
  • the estimation learning unit 37 may determine the positions of keypoint candidates in the learning images based on the pose of the rendered object, and generate a correct position image for each of the keypoint candidates according to its position.
  • the training data may include learning images in which the object is photographed, and position images generated from the pose of the object in the learning images estimated by so-called SfM or Visual SLAM.
  • the estimation learning unit 37 trains an estimation model 26 for each of the keypoint candidates.
  • the estimation model 26 for the selected keypoint candidate is used as the keypoint estimation model 26 for pose estimation (inference processing) for the input image.
  • Figure 5 is a flow diagram that shows an overview of the processing of the information processing system.
  • the information processing system generates a three-dimensional shape model of a target object using a known method based on an image of the object (S101).
  • the learning control unit 35 included in the information processing system determines the positions of the key points based on the three-dimensional shape model and information indicating the parts hidden by the hand, and trains the estimation model 26 for pose estimation (S102).
  • the posture estimation unit 25 inputs an input image of an object into the trained estimation model 26 (S103) and obtains data output by the estimation model 26. Then, based on the output of the estimation model 26, the two-dimensional positions of key points in the image are determined (S104).
  • the position acquisition unit 27 included in the posture estimation unit 25 calculates candidates for the position of the keypoint from each point of the position image, and determines the position of the keypoint based on the candidates. If the output of the estimation model 26 is a position image of a heat map, the position acquisition unit 27 determines the position of the most probable point as the position of the keypoint using a known method.
  • the posture estimation unit 25 estimates the posture of the object based on the two-dimensional positions of the determined keypoints and the three-dimensional positions of those keypoints in the three-dimensional shape model (S105).
  • the tag processing unit 29 also acquires information indicating the hand pose from the input image (S106).
  • the hand pose an input image of a photographed finger or coordinates of joint points in three-dimensional space may be acquired.
  • a machine learning model trained using an image and ground truth data indicating the joint points may be used.
  • the input image may include not only a visible image but also a depth image. Techniques for acquiring the hand pose are well known, so detailed explanations are omitted.
  • the tag processing unit 29 determines whether the hand is touching a part of the object corresponding to the function tag based on the acquired information indicating the hand pose (S107). The tag processing unit 29 may determine whether the hand is touching based on whether the distance between the three-dimensional coordinates of any joint point of the hand and a part associated with the function tag (e.g., tag areas 61, 62) is equal to or less than a threshold.
  • the tag processing unit 29 executes processing according to the function tag corresponding to that part (S108). On the other hand, if it is determined that the hand is not touching the part corresponding to the function tag, the processing of S108 is skipped.
  • the image drawing unit 30 draws an image based on the estimated posture (S108) and displays the drawn image on the display unit 18.
  • the image may also be displayed on another display.
  • the process from S103 to S109 is described as being performed once, but in reality, the process from S103 to S109 may be executed repeatedly, and the posture may be estimated and the image may be drawn in real time in response to the movement of the object.
  • FIG. 6 is a flow diagram showing an example of the process of determining key points and learning the estimation model 26.
  • FIG. 6 is a diagram explaining the process of S102 in FIG. 3 in more detail.
  • the key point determination unit 36 generates multiple key point candidates (S201). More specifically, the key point determination unit 36 may generate multiple key point candidates and their three-dimensional positions from a three-dimensional shape model of the object (more specifically, information on vertices included in the three-dimensional shape model), for example, by using the well-known Farthest Point algorithm.
  • FIG. 7 is a diagram illustrating an example of keypoint candidates generated from an object.
  • FIG. 7 shows an example of keypoints generated when a different object from those in FIGS. 3 and 4 is targeted.
  • seven keypoint candidates K1 to K7 are shown in FIG. 7, but more keypoint candidates may be generated.
  • the estimation learning unit 37 generates training data for the estimation model 26 (S202).
  • the training data includes a training image rendered based on the three-dimensional shape model and ground truth data indicating the positions of each of the keypoint candidates in the training image.
  • FIG. 8 is a flow diagram showing an example of a process for generating training data.
  • FIG. 8 is a diagram explaining the process of S202 in more detail.
  • the estimation learning unit 37 acquires data of a three-dimensional shape model of an object (S301).
  • the estimation learning unit 37 acquires multiple viewpoints for rendering (S302). More precisely, the estimation learning unit 37 acquires multiple camera viewpoints for rendering and shooting directions corresponding to the camera viewpoints.
  • the multiple camera viewpoints may be provided at positions at a constant distance from the origin of the three-dimensional shape model, and the shooting direction is a direction from the camera viewpoints toward the origin of the three-dimensional shape model.
  • the estimation learning unit 37 renders an image of the object for each viewpoint based on the three-dimensional shape model (S303).
  • the images may be rendered using a known method.
  • the estimation learning unit 37 adds the rendered image as a training image to the training data together with the viewpoint (S304).
  • the estimation learning unit 37 may perform a predetermined data augmentation on the rendered image and use the converted image as the training image.
  • the rendered image may be transformed by disturbing at least a portion of the luminance, saturation, and hue of the image, or by cropping out a portion of the image and resizing it to the same size as the original.
  • the estimation learning unit 37 may further add a photographed image of the object with a viewpoint to the training image.
  • This photographed image may be the photographed image used to generate the three-dimensional shape model.
  • the camera viewpoint of the photographed image may be the camera viewpoint acquired when generating the three-dimensional shape model.
  • the estimation learning unit 37 generates correct answer data indicating the positions of the keypoints in each training image based on the three-dimensional positions of the keypoint candidates and the viewpoint of the training image for each training image (S305). The estimation learning unit 37 generates correct answer data for each keypoint candidate for each training image.
  • FIG. 9 is a diagram showing an example of the correct answer data.
  • the correct answer data is information indicating the two-dimensional positions of key points of an object in a training image, and may be a position image in which each point indicates the positional relationship (e.g., direction) between that point and the key point.
  • a position image may be generated for each type of keypoint.
  • the position image indicates the relative direction of each point between that point and the keypoint.
  • a pattern is depicted according to the value of each point, and the value of each point indicates the direction between the coordinates of that point and the coordinates of the keypoint.
  • Figure 9 is merely a schematic diagram, and the actual value of each point changes continuously.
  • the position image in Figure 9 is a Vector Field image that indicates the relative direction of the keypoint with respect to that point.
  • the process shown in Figure 8 generates training data that includes training images and correct answer data.
  • the estimation learning unit 37 uses the training data to train an estimation model 26 for each keypoint candidate (S203).
  • the trained estimation model 26 is used to detect parts of an object that are hidden by a hand, for example, by the method described below.
  • the key point determination unit 36 outputs an instruction to the user to move the object being held in the user's hand in front of the image capture unit 20.
  • the user follows the instruction to move the object being held in front of the image capture unit 20.
  • the key point determination unit 36 acquires an image of the object held in the hand captured by the image capture unit 20, and further acquires the posture of the object in the image (S204).
  • the key point determination unit 36 may acquire an image of the object that constitutes a video.
  • the key point determination unit 36 may determine the two-dimensional position of a key point candidate based on information output when the image is input to the trained estimation model 26, and acquire the posture by processing similar to that of the posture determination unit 28 based on the two-dimensional position of the key point candidate and its position in the three-dimensional shape model. Note that if the difference between the acquired posture and the posture acquired from the previous image is equal to or less than a threshold value, or if the image of the object and the previously captured image are similar, the key point determination unit 36 may discard the image and repeat the processing of S204.
  • the key point determination unit 36 may acquire an image and posture of the object by having the user adjust the object so that it assumes a specified posture.
  • the key point determination unit 36 may display a rendering image of the specified object on a VR headset or the like, and adjust the position and posture of the object being held so that it overlaps with the rendering image.
  • the key point determination unit 36 extracts hand regions from each of the images (S205).
  • the hand regions may be extracted simply based on color, or may be extracted using a publicly known trained machine learning model.
  • the key point determination unit 36 determines whether each of the key point candidates is occluded in the extracted hand region (S206). The key point determination unit 36 may determine that a key point candidate is occluded when the position of the key point candidate in the image is within the extracted hand region.
  • the key point determination unit 36 checks whether a repetition end condition is met (S208).
  • the repetition end condition may be that the number of images that have been subject to judgment so far is equal to or greater than a threshold value, or that when the surface of a virtual sphere surrounding the object is divided into multiple parts, all parts are associated with the posture.
  • the part that is in the direction indicated by the posture obtained from the image may be the part associated with the posture.
  • the keypoint determination unit 36 determines keypoints based on the frequency with which each of the keypoint candidates is determined to be occluded and the reliability of the pose estimation (S208).
  • the keypoint determination unit 36 selects a provisional set of keypoints from the multiple keypoint candidates based on the frequency with which each of the keypoint candidates is determined to be occluded. As the initial provisional set, a predetermined number of keypoints that are less frequently occluded among the keypoint candidates may be selected. The keypoint determination unit 36 obtains the reliability of the pose estimation for the keypoints when the pose estimation unit 25 estimates the pose for the image acquired in S204.
  • the reliability may be determined based on the pose of the object estimated by the pose determination unit 28 and the correct pose.
  • the key point determination unit 36 may calculate the pose obtained from the image using SLAM technology or the like as the correct answer, and calculate the reliability based on the difference between the correct pose and the estimated pose.
  • the keypoint determination unit 36 may also reproject the position of each of the keypoint candidates in the image based on the pose estimated by the tentative keypoints and the three-dimensional positions of the keypoint candidates, and store the reprojected positions in the storage unit 12. In this case, the keypoint determination unit 36 may calculate, as the reliability, the average of the distances between the position estimated by the output of the estimation model 26 and the reprojected position for each of the keypoint candidates.
  • the keypoint determination unit 36 determines the keypoints in the set as official keypoints; if not, it selects a provisional set consisting of keypoints different from the previous set from the multiple keypoint candidates, and repeats the process from obtaining the reliability onwards.
  • the new provisional set may be generated, for example, by replacing a randomly selected keypoint from the original set with one of the unselected keypoint candidates.
  • the keypoint candidate to be replaced may be determined, for example, based on a score calculated based on low frequency and distance from the keypoints in the existing set.
  • the learning control unit 35 sets the posture estimation unit 25 to estimate the posture using the determined set of key points (S209).
  • the estimation model 26 used for actual posture estimation may be one that has already been trained on the key points (key point candidates).
  • the processing from S204 to S208 makes it possible to efficiently eliminate keypoints that are likely to be obscured by hands and that are likely to have a negative effect on pose estimation accuracy, making it possible to perform pose estimation more efficiently.
  • by using the reliability of pose estimation it is possible to prevent a decrease in pose estimation accuracy, for example, due to keypoints concentrating in a small area.
  • keypoint candidates that have a low contribution to pose estimation can be eliminated in advance, it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.
  • the keypoint candidates that are the subject of the processing in S208 may be those whose frequency of obscuration is equal to or less than a threshold.
  • the keypoint determination unit 36 may generate additional keypoint candidates to replace the keypoint candidates whose frequency of obscuration exceeds the threshold, and re-execute the processing from S202 onwards for the replaced keypoint candidates.
  • the part that is actually hidden by the hand is obtained, but instead, information indicating the part of the object specified by the user that is being held by the hand may be obtained as information indicating the part that is hidden by the hand.
  • FIG. 10 is a flow diagram showing another example of the process of determining key points and learning the estimation model 26.
  • the user manually specifies, as a tag region, a portion of an image of an object shown on a display that is obscured by a hand.
  • the key point determination unit 36 displays an image of an object based on a three-dimensional shape model (S401). Next, based on the user's operation on the image, it acquires a tag area of the object designated by the user and a function tag designated for the tag area (S402).
  • the key point determination unit 36 may display an icon of a paint tool including a paint palette along with an image of the object, and execute a process of coloring the scanned model with a color specified by the user using the paint palette.
  • the user may paint any position while holding the object.
  • an image of a virtual object of the same shape may be transparently superimposed on the image of the actual object, and the virtual object may be colored to visualize the area as if it were painted on the real thing.
  • the paint color may also be associated with a functional tag.
  • the key point determination unit 36 may acquire this colored area as a tag area.
  • the key point determination unit 36 may specify the tag area by having the user select one of a plurality of virtual stickers that each correspond to a functional tag and attach the selected virtual sticker to the object.
  • the keypoint determination unit 36 generates multiple keypoint candidates (S403) in parallel with S401 and S402. This process is similar to S201, so a detailed explanation will be omitted.
  • the key point determination unit 36 calculates the probability that each of the key point candidates is hidden in a tag area that satisfies a predetermined condition (S404).
  • the tag area that satisfies the predetermined condition may be, for example, a tag area designated as an area to be grasped, or a tag area associated with a function that is touched by the hand, such as a switch.
  • the keypoint determination unit 36 may, for example, cast rays in multiple (isotropic) directions from the keypoint candidate, and further calculate a value indicating the ratio of rays that hit the tag region as the probability of that keypoint candidate. Furthermore, if the tag region is a three-dimensional region, the keypoint determination unit 36 may set the probability value to 1 if the keypoint candidate is within that region, and may set the probability value to 0 if it is not.
  • the keypoint determination unit 36 selects keypoints to be used for final pose estimation based on the probability that the keypoint candidates are occluded (S405).
  • the keypoint determination unit 36 may select a predetermined number of keypoints from the keypoint candidates with low probability.
  • the keypoint determination unit 36 may select a keypoint based on the reliability and the probability. For example, the keypoint determination unit 36 may determine a tentative keypoint and calculate the reliability based on the tentative keypoint. The keypoint determination unit 36 generates a score indicating the suitability of the keypoint as a keypoint from the reliability and the probability of each keypoint, and determines the keypoint based on the score.
  • the reliability may be calculated in the following manner. First, the estimation model 26 learned about the tentative keypoint is used to estimate the posture of the image captured when the posture estimation unit 25 generates the three-dimensional shape model. Next, the keypoint determination unit 36 reprojects the positions of each of the keypoint candidates in the image based on the estimated posture and the three-dimensional positions of the keypoint candidates.
  • the keypoint determination unit 36 calculates the average of the distance between the position estimated by the output of the estimation model 26 and the reprojected position as the reliability for each of the keypoint candidates. Note that the keypoints may be determined iteratively using a method similar to S208.
  • the estimation learning unit 37 generates training data for learning the estimation model 26 for each of the keypoint candidates (S406).
  • the estimation learning unit 37 also trains the estimation model 26 for each of the keypoints (S407). Note that if the keypoints (keypoint candidates) have been learned in advance, there is no need to redundantly train the estimation model 26.
  • the method shown in Figure 10 also makes it possible to efficiently eliminate keypoints that are likely to be hidden by the hand and thus have a high probability of adversely affecting pose estimation when the object is held in the hand, making it possible to perform pose estimation more efficiently.
  • it is possible to eliminate keypoint candidates that have a low contribution to pose estimation in advance it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.
  • the part hidden by the hand is identified by assigning some kind of function. Therefore, the operation for identifying the part hidden by the hand itself can be omitted, improving convenience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

According to the present invention, a posture estimation system for estimating a posture using key points more appropriately acquires information indicating a portion constituting an object and hidden by a hand (S204, S402), determines three-dimensional positions of a plurality of key points for estimating the posture of the object on the basis of the information (S208, S405), trains a machine learning model for estimating the positions of a plurality of key points determined in an input image (S203, S407), acquires the estimated positions of key points in an image capturing the object and the hand on the basis of an output produced by the trained machine learning model in response to receiving the image, and determines the estimated posture of the object in a three-dimensional space on the basis of the estimated positions of the key points.

Description

姿勢推定システム、姿勢推定方法及びプログラムPosture estimation system, posture estimation method, and program

 本発明は、姿勢推定システム、姿勢推定方法及びプログラムに関する。 The present invention relates to a posture estimation system, a posture estimation method, and a program.

 物体が撮影された画像からその物体のキーポイントの位置を推定し、その推定されたキーポイントからその物体の姿勢を推定する手法がある。物体のキーポイントの3次元位置は予め決定されている。例えば、キーポイントの画像��の位置を予測する機械学習モデルが学習され、その機械学習モデルを用いて、撮影された画像からキーポイントの画像内の位置が推定される。 There is a technique for estimating the positions of an object's keypoints from an image of the object, and then estimating the object's pose from the estimated keypoints. The three-dimensional positions of the object's keypoints are determined in advance. For example, a machine learning model is trained to predict the positions of keypoints in an image, and the machine learning model is used to estimate the positions of keypoints in an image from a captured image.

 Sida Peng et alは、2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)において、論文PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimationを発表している。この論文では、3Dモデルから生成される入力画像と、正解の出力画像とを含む訓練データにより機械学習モデルを学習させ、さらにその機械学習モデルに撮影された画像が入力された際の出力に基づいて姿勢推定に用いるキーポイントの画像上の位置を算出することが開示されている。 Sida Peng et al. published a paper entitled PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). This paper discloses that a machine learning model is trained using training data including input images generated from a 3D model and correct output images, and that the positions of key points in the image used for pose estimation are calculated based on the output when a captured image is input to the machine learning model.

 姿勢推定をする際に、例えば手によって物体が隠れた場合など、画像からキーポイントの位置を推定することが難しいケースがある。そのことに起因して、姿勢推定の精度の低下、または、処理速度の低下が生じる恐れがあった。 When estimating pose, there are cases where it is difficult to estimate the position of key points from an image, for example when an object is hidden by a hand. This can lead to a decrease in the accuracy of pose estimation or a decrease in processing speed.

 本発明は上記実情に鑑みてなされたものであって、その目的は、姿勢の推定をより適切に実施することを可能にする技術を提供することにある。 The present invention was made in consideration of the above situation, and its purpose is to provide a technology that enables posture estimation to be performed more appropriately.

 上記課題を解決するために、本発明に係る姿勢推定システムは、1または複数のプロセッサを含み、��記1または複数のプロセッサは、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定し、入力された画像における前記決定された複数のキーポイントの位置を推定するための機械学習モデルを学習させ、前記学習された機械学習モデルにオブジェクトおよび手が撮影された画像が入力された際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における推定された姿勢を決定する。 In order to solve the above problems, the posture estimation system of the present invention includes one or more processors, which acquire information indicating parts of an object that are hidden by a hand, determine the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, train a machine learning model for estimating the positions of the determined number of key points in an input image, acquire positions of estimated key points in the image based on the output when an image of an object and a hand is input to the trained machine learning model, and determine an estimated posture of the object in three-dimensional space based on the positions of the estimated key points.

 本発明の一形態では、前記手により隠される部分を示す情報は、前記オブジェクトが前記手により把持された複数の画像であり、前記1または複数のプロセッサは、所定の手順により決定された複数のキーポイント候補が、前記オブジェクトが前記手により把持された前記複数の画像において前記手により隠される頻度に基づいて、前記複数のキーポイントの3次元位置を決定してよい。 In one form of the invention, the information indicating the portion obscured by the hand is a plurality of images in which the object is held by the hand, and the one or more processors may determine three-dimensional positions of the plurality of keypoint candidates determined by a predetermined procedure based on the frequency with which the hand obscures the plurality of keypoint candidates in the plurality of images in which the object is held by the hand.

 本発明の一形態では、前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって前記手により把持される部分であってよい。 In one embodiment of the present invention, the information indicating the portion hidden by the hand may be the portion of the object indicated by the user that is being held by the hand.

 本発明の一形態では、前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって、タグが対応付けられる部分を示す情報であり、前記1または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に当該タグに応じた処理を実行してよい。 In one embodiment of the present invention, the information indicating the portion hidden by the hand is information indicating the portion of the object designated by the user and associated with a tag, and the one or more processors may determine whether the portion associated with the tag is being operated by the hand based on an image of the object and hand, and execute processing according to the tag if it is determined that the portion is being operated.

 本発明の一形態では、前記1または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に、前記手による操作の大きさに基づいて当該タグに応じた処理を実行してよい。 In one embodiment of the present invention, the one or more processors may determine whether a part associated with the tag is being operated by the hand based on an image of the object and hand, and if it is determined that the part is being operated, may execute processing corresponding to the tag based on the magnitude of the operation by the hand.

 また、本発明に係る姿勢推定方法は、1または複数のプロセッサにより、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定し、入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得し、前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における姿勢を推定する。 In addition, the posture estimation method of the present invention involves using one or more processors to acquire information indicating parts of an object that are hidden by the hand, determining the three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information, acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image, acquiring the positions of the estimated key points in the image based on the output when an image of an object and a hand is input to the acquired machine learning model, and estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.

 また、本発明に係るプログラムは、オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得する取得手段、前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定するキーポイント決定手段、入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得するモデル取得手段、前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得する位置取得手段、および、前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における姿勢を推定する姿勢推定手段、としてコンピュータを機能させる。 The program of the present invention causes a computer to function as: an acquisition means for acquiring information indicating parts of an object that are hidden by a hand; a key point determination means for determining three-dimensional positions of a number of key points for estimating the posture of the object based on the acquired information; a model acquisition means for acquiring a trained machine learning model for estimating the positions of the determined number of key points in an input image; a position acquisition means for acquiring positions of the estimated key points in the image based on an output when an image of an object and a hand is input to the acquired machine learning model; and a posture estimation means for estimating the posture of the object in three-dimensional space based on the positions of the estimated key points.

 本発明によれば、キーポイントを用いた姿勢の推定をより適切に実施することができる。 The present invention makes it possible to more appropriately estimate posture using keypoints.

本発明の一実施形態に係る情報処理システムの構成の一例を示す図である。1 is a diagram illustrating an example of a configuration of an information processing system according to an embodiment of the present invention. 本発明の一実施形態に係る情報処理システムで実装される機能の一例を示すブロック図である。2 is a block diagram showing an example of functions implemented in an information processing system according to an embodiment of the present invention; FIG. 撮影されたオブジェクトの画像の一例を示す図である。FIG. 2 is a diagram showing an example of a photographed image of an object. 機能タグと関連付けられるタグ領域の一例を説明する図である。10A and 10B are diagrams illustrating an example of a tag area associated with a function tag. 情報処理システムの処理を概略的に示すフロー図である。FIG. 2 is a flow chart illustrating an outline of processing of the information processing system. キーポイントの決定および推定モデルの学習の処理の一例を示すフロー図である。FIG. 11 is a flow diagram illustrating an example of a process for determining keypoints and training an estimation model. オブジェクトから生成されるキーポイントの候補の一例を説明する図である。FIG. 10 is a diagram illustrating an example of key point candidates generated from an object. 訓練データを生成し推定モデルを学習させる処理の一例を示すフロー図である。FIG. 11 is a flow diagram showing an example of a process for generating training data and learning an estimation model. 正解データの一例を示す図である。FIG. 11 is a diagram illustrating an example of correct answer data. キーポイントの決定および推定モデルの学習の処理の他の一例を示すフロー図である。FIG. 11 is a flow diagram showing another example of a process for determining keypoints and training an estimation model.

 以下、本発明の一実施形態について図面に基づき詳細に説明する。本実施形態では、オブジェクトが撮影された画像を入力し、その姿勢を推定し、推定された姿勢に応じた画像を描画する情報処理システムに発明を適用した場合について説明する。 Below, one embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, a case will be described in which the invention is applied to an information processing system that inputs an image of an object, estimates its posture, and draws an image according to the estimated posture.

 この情報処理システムは、オブジェクトが撮影された画像からそのオブジェクトの推定される姿勢を示す情報を出力する機械学習モデルを含んでいる。 This information processing system includes a machine learning model that outputs information indicating the estimated pose of an object from an image in which the object is captured.

 図1は、本発明の一実施形態にかかる情報処理システムの構成の一例を示す図である。本実施形態にかかる情報処理システムは、情報処理装置10を含む。情報処理装置10は、例えば、ゲームコンソールやパーソナルコンピュータ、VRヘッドセットなどのコンピュータである。図1に示すように、情報処理装置10は、例えば、プロセッサ11、記憶部12、通信部13、操作部16、表示部18、撮影部20を含んでいる。情報処理システムは1台の情報処理装置10により構成されてもよいし、情報処理装置10を含む複数の装置により構成されてもよいし、例えば撮影部20または表示部18が情報処理装置10と別の筐体に配置されてもよい。 FIG. 1 is a diagram showing an example of the configuration of an information processing system according to one embodiment of the present invention. The information processing system according to this embodiment includes an information processing device 10. The information processing device 10 is, for example, a computer such as a game console, a personal computer, or a VR headset. As shown in FIG. 1, the information processing device 10 includes, for example, a processor 11, a memory unit 12, a communication unit 13, an operation unit 16, a display unit 18, and an imaging unit 20. The information processing system may be composed of one information processing device 10, or may be composed of multiple devices including the information processing device 10, and for example, the imaging unit 20 or the display unit 18 may be located in a housing separate from the information processing device 10.

 プロセッサ11は、例えば情報処理装置10にインストールされるプログラムに従って動作するCPU等のプログラム制御デバイスである。 The processor 11 is, for example, a program-controlled device such as a CPU that operates according to a program installed in the information processing device 10.

 記憶部12は、ROMやRAM等のメモリ素子やソリッドステートドライブのような外部記憶装置のうち少なくとも一部からなる。記憶部12には、プロセッサ11によって実行されるプログラムなどが記憶される。 The storage unit 12 is composed of at least a portion of a memory element such as a ROM or RAM, or an external storage device such as a solid-state drive. The storage unit 12 stores programs executed by the processor 11, etc.

 通信部13は、例えばネットワークインタフェースカードのような、有線通信又は無線通信用の通信インタフェースであり、インターネット等のコンピュータネットワークを介して、他のコンピュータや端末との間でデータを授受する。 The communication unit 13 is a communication interface for wired or wireless communication, such as a network interface card, and transmits and receives data between other computers and terminals via a computer network such as the Internet.

 操作部16は、例えば、キーボード、マウス、タッチパネル、ゲームコンソールのコントローラ等の入力デバイスであって、ユーザの操作入力を受け付けて、その内容を示す信号をプロセッサ11に出力する。 The operation unit 16 is an input device such as a keyboard, mouse, touch panel, or game console controller, and receives operation input from the user and outputs a signal indicating the content of the input to the processor 11.

 表示部18は、液晶ディスプレイ等の表示デバイスであって、プロセッサ11の指示に従って各種の画像を表示する。表示部18は、VRヘッドセットに内蔵されてもよいし、外部の表示デバイスに対して映像信号を出力するデバイスであってもよい。 The display unit 18 is a display device such as a liquid crystal display, and displays various images according to instructions from the processor 11. The display unit 18 may be built into the VR headset, or may be a device that outputs a video signal to an external display device.

 撮影部20は、イメージセンサを含む撮影デバイスである。撮影部20は、可視のRGB画像を取得可能なカメラであってよい。撮影部20は、可視のRGB画像と、そのRGB画像と同期した深度情報とを取得可能なカメラであってもよい。本実施形態にかかる撮影部20は、例えば動画像の撮影が可能なカメラであってもよいし、VRヘッドセットに内蔵されたカメラであってもよい。撮影部20は情報処理装置10の外部にあってもよく、この場合は情報処理装置10と撮影部20とが、通信部13または後述の入出力部を介して接続されてよい。 The imaging unit 20 is a photographing device including an image sensor. The imaging unit 20 may be a camera capable of acquiring visible RGB images. The imaging unit 20 may be a camera capable of acquiring visible RGB images and depth information synchronized with the RGB images. The imaging unit 20 in this embodiment may be, for example, a camera capable of capturing moving images, or may be a camera built into a VR headset. The imaging unit 20 may be outside the information processing device 10, in which case the information processing device 10 and the imaging unit 20 may be connected via the communication unit 13 or an input/output unit described below.

 なお、情報処理装置10は、マイクやスピーカなどといった音声入出力デバイスを含んでいてもよい。また、情報処理装置10は、例えば、ネットワークボードなどの通信インタフェース、DVD-ROMやBlu-ray(登録商標)ディスクなどの光ディスクを読み取る光ディスクドライブ、外部機器とデータの入出力をするための入出力部(USB(Universal Serial Bus)ポート)を含んでいてもよい。 The information processing device 10 may also include audio input/output devices such as a microphone and a speaker. The information processing device 10 may also include, for example, a communication interface such as a network board, an optical disk drive that reads optical disks such as DVD-ROMs and Blu-ray (registered trademark) disks, and an input/output unit (USB (Universal Serial Bus) port) for inputting and outputting data to and from external devices.

 図2は、本発明の一実施形態に係る情報処理システムで実装される機能の一例を示すブロック図である。図2に示すように、情報処理システムは、機能的に、姿勢推定部25、タグ処理部29、画像描画部30、形状モデル取得部31、遮蔽情報取得部32、学習制御部35を含む。姿勢推定部25は、機能的に、推定モデル26、位置取得部27、および姿勢決定部28を含む。学習制御部35は、機能的に、キーポイント決定部36、推定学習部37を含む。推定モデル26は、機械学習モデルの一種である。 FIG. 2 is a block diagram showing an example of functions implemented in an information processing system according to one embodiment of the present invention. As shown in FIG. 2, the information processing system functionally includes a posture estimation unit 25, a tag processing unit 29, an image rendering unit 30, a shape model acquisition unit 31, an occlusion information acquisition unit 32, and a learning control unit 35. The posture estimation unit 25 functionally includes an estimation model 26, a position acquisition unit 27, and a posture determination unit 28. The learning control unit 35 functionally includes a key point determination unit 36 and an estimation learning unit 37. The estimation model 26 is a type of machine learning model.

 これらの機能は、主にプロセッサ11及び記憶部12により実装される。より具体的には、これらの機能は、コンピュータである情報処理装置10にインストールされた、以上の機能に対応する実行命令を含むプログラムをプロセッサ11で実行することにより実装されてよい。また、このプログラムは、例えば、光学的ディスク、磁気ディスク、フラッシュメモリ等のコンピュータ読み取り可能な情報記憶媒体を介して、あるいは、インターネットなどを介して情報処理装置10に供給されてもよい。 These functions are mainly implemented by the processor 11 and the storage unit 12. More specifically, these functions may be implemented by having the processor 11 execute a program that is installed in the information processing device 10, which is a computer, and that includes execution instructions corresponding to the above functions. In addition, this program may be supplied to the information processing device 10 via, for example, a computer-readable information storage medium such as an optical disk, a magnetic disk, or a flash memory, or via the Internet, etc.

 なお、本実施形態にかかる情報処理システムに、必ずしも図2に示す機能のすべてが実装されていなくてもよく、また、図2に示す機能以外の機能が実装されていてもよい。 Note that the information processing system according to this embodiment does not necessarily have to implement all of the functions shown in FIG. 2, and may also implement functions other than those shown in FIG. 2.

 姿勢推定部25は、推定モデル26に入力画像が入力された際に出力される情報に基づいて、対象となるオブジェクトの姿勢を推定する。入力画像は、撮影部20により撮影されたオブジェクトの画像である。推定モデル26は、機械学習モデルであり、訓練データにより学習され、学習済の推定モデル26は、入力データが入力されると、推定結果としてデータを出力する。 The posture estimation unit 25 estimates the posture of the target object based on the information output when an input image is input to the estimation model 26. The input image is an image of the object captured by the imaging unit 20. The estimation model 26 is a machine learning model that is trained using training data, and when input data is input, the trained estimation model 26 outputs data as an estimation result.

 図3は、撮影されたオブジェクトの画像の一例を示す図である。図3に示される対象オブジェクト51は、例えば手53によって保持されており、撮影部20により撮影される。 FIG. 3 is a diagram showing an example of an image of a photographed object. The target object 51 shown in FIG. 3 is held, for example, by a hand 53, and is photographed by the photographing unit 20.

 学習済の推定モデル26には、対象となるオブジェクトが撮影された画像の情報が入力され、推定モデル26はそのオブジェクトの姿勢推定のためのキーポイントの位置を示す情報を出力する。より具体的には、推定モデル26は、オブジェクトに対して設定される複数のキーポイントのそれぞれについてキーポイントの位置を示す画像を出力する。推定モデル26は、キーポイントごと、またはキーポイント候補ごとに存在してよい。 Information on an image of a target object is input to the trained estimation model 26, and the estimation model 26 outputs information indicating the positions of keypoints for estimating the posture of the object. More specifically, the estimation model 26 outputs an image indicating the positions of each of a number of keypoints set for the object. An estimation model 26 may exist for each keypoint or each keypoint candidate.

 推定モデル26の訓練データは、対象となるオブジェクトの3次元形状モデルによりレンダリングされた複数の学習画像と、学習画像におけるオブジェクトのキーポイントの位置を示す正解データとを含む。キーポイントは、オブジェクト内にある仮想的な点であって、姿勢の算出に用いる点である。推定モデル26が出力するデータは、各点がその点とキーポイントとの位置関係(例えば相対方向)を示す位置画像であってもよいし、各点がキーポイントが存在する確率を示すヒートマップである位置画像であってもよい。推定モデル26の学習の詳細については後述する。 The training data for the estimation model 26 includes multiple learning images rendered by a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the learning images. Keypoints are virtual points within an object that are used to calculate the pose. The data output by the estimation model 26 may be a position image in which each point indicates the positional relationship between that point and a keypoint (e.g., relative direction), or a position image that is a heat map in which each point indicates the probability that a keypoint exists. The learning of the estimation model 26 will be described in detail later.

 入力画像は、撮影部20により撮影されたオブジェクトの画像が加工された画像であってもよい。例えば対象となるオブジェクトを除く領域がマスクされた画像であってもよいし、画像におけるオブジェクトのサイズが所定の大きさになるように拡大または縮小された画像であってもよい。 The input image may be an image that has been processed from an image of an object captured by the image capture unit 20. For example, it may be an image in which the area excluding the target object is masked, or an image that has been enlarged or reduced so that the size of the object in the image is a predetermined size.

 位置取得部27は、学習済の推定モデル26にオブジェクトおよび手が撮影された画像が入力された際の推定モデル26の出力に基づいて、入力画像におけるキーポイントの2次元位置を決定する。例えば、位置取得部27は、推定モデル26から出力される位置画像に基づいて、入力画像におけるキーポイントの2次元位置の候補を決定する。位置取得部27は、例えば、位置画像のうちの任意の2点の組み合わせのそれぞれからキーポイントの候補点を算出し、複数の候補点に対して位置画像の各点が示す方向と合致しているかを示すスコアを生成する。位置取得部27はそのスコアが最も大きい候補点をキーポイントの位置と推定してよい。また位置取得部27は、キーポイントごとに上記の処理を繰り返す。 The position acquisition unit 27 determines the two-dimensional position of the keypoint in the input image based on the output of the estimation model 26 when an image of an object and a hand is input to the trained estimation model 26. For example, the position acquisition unit 27 determines candidates for the two-dimensional position of the keypoint in the input image based on the position image output from the estimation model 26. For example, the position acquisition unit 27 calculates candidate points for the keypoint from each combination of any two points in the position image, and generates a score indicating whether the multiple candidate points match the direction indicated by each point in the position image. The position acquisition unit 27 may estimate the candidate point with the largest score as the position of the keypoint. The position acquisition unit 27 also repeats the above process for each keypoint.

 姿勢決定部28は、入力画像におけるキーポイントの2次元位置を示す情報と対象となるオブジェクトの3次元形状モデルにおけるキーポイントの3次元位置を示す情報とに基づいて、そのオブジェクトの姿勢を推定し、推定された姿勢を示す姿勢データを出力する。オブジェクトの姿勢は、公知のアルゴリズムによって推定される。例えば、姿勢推定についてのPerspective-n-Point(PNP)問題の解法(例えばEPnP)により推定されてよい。また、姿勢決定部28はオブジェクトの姿勢だけでなく入力画像におけるオブジェクトの位置も推定してよく、姿勢データにその位置を示す情報が含まれてもよい。 The pose determination unit 28 estimates the pose of the object based on information indicating the two-dimensional positions of keypoints in the input image and information indicating the three-dimensional positions of keypoints in a three-dimensional shape model of the target object, and outputs pose data indicating the estimated pose. The pose of the object is estimated using a known algorithm. For example, it may be estimated using a solution to the Perspective-n-Point (PNP) problem for pose estimation (e.g., EPnP). Furthermore, the pose determination unit 28 may estimate not only the pose of the object but also the position of the object in the input image, and the pose data may include information indicating that position.

 撮影部20は、予めキャリブレーションによってカメラ内部パラメータが取得されているものとする。このパラメータは、PnP問題を解く際に用いられる。 The internal camera parameters of the image capture unit 20 are assumed to have been acquired in advance through calibration. These parameters are used when solving the PnP problem.

 推定モデル26、位置取得部27、姿勢決定部28の詳細は、PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimationの論文に記載されたものであってよい。 Details of the estimation model 26, the position acquisition unit 27, and the attitude determination unit 28 may be those described in the paper PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation.

 タグ処理部29は、オブジェクトの部分であって、機能タグが対応付けられる部分を示す情報と、オブジェクトおよび手が撮影された画像に基づいて、機能タグが対応付けられる部分が手により操作されているか判定する。タグ処理部29は、その部分が操作されていると判定された場合に、その機能タグに応じた処理を実行する。タグ処理部29は、その部分が操作されていると判定された場合に、手による操作の大きさに基づいてタグに応じた処理を実行してもよい。 The tag processing unit 29 determines whether the part associated with the functional tag is being operated by the hand, based on information indicating the part of the object to which the functional tag is associated, and an image of the object and the hand. If it is determined that the part is being operated, the tag processing unit 29 executes processing according to the functional tag. If it is determined that the part is being operated, the tag processing unit 29 may execute processing according to the tag based on the size of the hand operation.

 図4は、機能タグと関連付けられるタグ領域61,62の一例を説明する図である。タグ領域61,62のそれぞれは、対象オブジェクト51のうちの一部の領域である。タグ領域61,62は、対象オブジェクト51の表面上の領域であってもよいし、内部を含む立体的な領域であってもよい。タグ領域61,62は、それぞれ互いに異なる機能タグと関連付けられている。例えば、タグ領域61はスイッチの機能タグと関連付けられ、タグ領域62は把持する箇所の機能タグと関連付けられてよい。タグ領域61,62は、ユーザが把持可能な領域、仮想タッチスクリーンを表示しタッチインタラクションができる領域、ライト用途で光源やパーティクルを噴出させる領域のいずれかであってよく、それぞれその領域に対応する機能タグと関連付けられてよい。 FIG. 4 is a diagram illustrating an example of tag regions 61, 62 associated with functional tags. Each of the tag regions 61, 62 is a part of the target object 51. The tag regions 61, 62 may be regions on the surface of the target object 51, or may be three-dimensional regions including the interior. The tag regions 61, 62 are associated with different functional tags. For example, the tag region 61 may be associated with the functional tag of a switch, and the tag region 62 may be associated with the functional tag of a part to be grasped. The tag regions 61, 62 may be any of a region that can be grasped by the user, a region that displays a virtual touch screen and allows touch interaction, and a region that emits a light source or particles for lighting purposes, and each may be associated with a functional tag corresponding to that region.

 画像描画部30は、推定されたオブジェクトの姿勢に基づいて、画像を描画する。画像描画部30は、推定されたオブジェクトの姿勢と、3次元形状モデルとに基づいて、そのオブジェクトの3次元画像を描画してもよい。画像描画部30は、推定されたオブジェクトの姿勢に基づいて、例えばVR画像のオブジェクトといった描画用のオブジェクトの姿勢を決定し、その描画用のオブジェクトを描画してもよい。 The image rendering unit 30 renders an image based on the estimated orientation of the object. The image rendering unit 30 may render a three-dimensional image of the object based on the estimated orientation of the object and a three-dimensional shape model. The image rendering unit 30 may determine the orientation of an object to be rendered, such as an object in a VR image, based on the estimated orientation of the object, and render the object to be rendered.

 形状モデル取得部31は、撮影部20により対象となるオブジェクトが撮影された複数の撮影画像を取得する。形状モデル取得部31は、その複数の撮影画像から、オブジェクトの3次元形状モデルを生成し取得する。より具体的には、形状モデル取得部31は、複数の撮影画像のそれぞれについて局所的な特徴を示す複数の特徴ベクトルを抽出し、複数の撮影画像から抽出された互いに対応する複数の特徴ベクトルと撮影画像においてその特徴ベクトルが抽出された位置とからその特徴ベクトルが抽出された点の3次元位置を求める。そして、形状モデル取得部31はその3次元位置に基づいてオブジェクトの3次元形状モデルを取得する。この方法は、いわゆるSfMやVisual SLAMを実現するソフトウェアでも用いられる公知の方法であるので、詳細の説明は省略する。 The shape model acquisition unit 31 acquires multiple captured images of a target object captured by the imaging unit 20. The shape model acquisition unit 31 generates and acquires a three-dimensional shape model of the object from the multiple captured images. More specifically, the shape model acquisition unit 31 extracts multiple feature vectors indicating local features for each of the multiple captured images, and determines the three-dimensional position of the point from which the feature vector was extracted from the multiple corresponding feature vectors extracted from the multiple captured images and the position from which the feature vector was extracted in the captured image. The shape model acquisition unit 31 then acquires a three-dimensional shape model of the object based on the three-dimensional position. This method is a well-known method that is also used in software that realizes so-called SfM and Visual SLAM, so a detailed explanation will be omitted.

 遮蔽情報取得部32は、対象となるオブジェクトを構成する部分であって、手により隠される部分を示す情報を取得する。このとき、手はそのオブジェクトを持っているものとする。手により隠される部分を示す情報は、より具体的には、対象となるオブジェクトが手により把持された複数の画像、および、ユーザから手により把持される部分として指定された対象となるオブジェクトの部分を示す情報のうち���なくとも一部である。 The occlusion information acquisition unit 32 acquires information indicating the parts of the target object that are hidden by the hand. At this time, it is assumed that the hand is holding the object. More specifically, the information indicating the parts hidden by the hand is at least a part of a plurality of images in which the target object is held by the hand, and information indicating the parts of the target object that are specified by the user as the parts that are held by the hand.

 遮蔽情報取得部32は、手により隠される部分を示す情報として、撮影部20により撮影された、対象となるオブジェクトが手により把持された複数の画像を取得してよい。 The occlusion information acquisition unit 32 may acquire multiple images of the target object being held by the hand, captured by the image capture unit 20, as information indicating the portion obscured by the hand.

 遮蔽情報取得部32は、手により隠される部分を示す情報として、ユーザにより指定されたオブジェクトの部分であって、手により把持される部分を示す情報を取得してもよい。オブジェクトの部分としてタグ領域61,62が指定されてよい。 The occlusion information acquisition unit 32 may acquire information indicating a part of an object designated by a user and held by the hand as information indicating a part hidden by the hand. Tag regions 61, 62 may be designated as parts of the object.

 遮蔽情報取得部32は、手により持たれる領域を推定する学習済の機械学習モデルに対象となるオブジェクトの情報を入力し、その機械学習モデルの出力によりそのオブジェクトの部分を特定してもよい。この機械学習モデルについては公知であるのでその詳細の説明を省略する。 The occlusion information acquisition unit 32 may input information about the target object into a trained machine learning model that estimates the area held by the hand, and identify parts of the object using the output of the machine learning model. This machine learning model is publicly known, so a detailed description of it will be omitted.

 学習制御部35は、対象となるオブジェクトの3次元形状モデルに基づいて、そのオブジェクトのキーポイントを決定するとともに推定モデル26を学習させる。 The learning control unit 35 determines the key points of the target object based on the three-dimensional shape model of the object and trains the estimation model 26.

 キーポイント決定部36は、対象となるオブジェクトの3次元形状モデルと、手により隠される部分を示す情報に基づいて、対象となるオブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定してよい。手により隠される部分を示す情報がオブジェクトが手により把持された複数の画像である場合には、キーポイント決定部36は、所定の手法により決定された複数のキーポイント候補が、その複数の画像において手により隠される頻度に基づいて、複数のキーポイントを決定し、その決定されたキーポイントの3次元位置を決定してよい。 The key point determination unit 36 may determine the three-dimensional positions of multiple key points for estimating the posture of the target object based on a three-dimensional shape model of the target object and information indicating the parts hidden by the hands. If the information indicating the parts hidden by the hands is multiple images of the object being held by the hands, the key point determination unit 36 may determine multiple key points based on the frequency with which multiple key point candidates determined by a predetermined method are hidden by the hands in the multiple images, and may determine the three-dimensional positions of the determined key points.

 キーポイント決定部36は、例えば公知のFarthest Point アルゴリズムにより複数のキーポイント候補のセットを生成してよい。キーポイントの数Nは例えば4以上の整数であればよく。キーポイント候補の数はキーポイントの数Nより大きい整数(例えばキーポイントの数の1.3倍以上)であればよい。 The keypoint determination unit 36 may generate a set of multiple keypoint candidates, for example, by using the well-known Farthest Point algorithm. The number of keypoints N may be an integer greater than or equal to 4. The number of keypoint candidates may be an integer greater than the number of keypoints N (for example, greater than or equal to 1.3 times the number of keypoints).

 手により隠される部分を示す情報が、ユーザにより指示されたオブジェクトの部分であ���て、手により把持される部分を示す情報である場合には、キーポイント決定部36は、その部分に基づいて、複数のキーポイント候補から複数のキーポイントを決定してよく、その決定されたキーポイントの3次元位置を決定してよい。 If the information indicating the part hidden by the hand is information indicating the part of the object indicated by the user and held by the hand, the key point determination unit 36 may determine multiple key points from multiple key point candidates based on that part, and may determine the three-dimensional positions of the determined key points.

 キーポイント決定部36は、キーポイントによる姿勢推定における信頼度にさらに基づいて、複数のキーポイント候補から複数のキーポイントを決定してよい。信頼度の算出方法については後述する。 The keypoint determination unit 36 may determine multiple keypoints from multiple keypoint candidates based on the reliability of the posture estimation using the keypoints. The method of calculating the reliability will be described later.

 推定学習部37は、入力された画像における、決定された複数のキーポイントの位置を推定するための機械学習モデルである推定モデル26を学習させる。より具体的には、推定学習部37は、推定モデル26の学習に用いる訓練データを生成し、その訓練データにより推定モデル26を学習させる。 The estimation learning unit 37 trains the estimation model 26, which is a machine learning model for estimating the positions of the determined multiple key points in the input image. More specifically, the estimation learning unit 37 generates training data used to train the estimation model 26, and trains the estimation model 26 using the training data.

 訓練データは、対象となるオブジェクトの3次元形状モデルによりレンダリングされた複数の学習画像と、学習画像におけるオブジェクトのキーポイントの位置を示す正解データとを含む。少なくとも初期の訓練データにおいて、推定学習部37による正解データの生成の対象となるキーポイントは、キーポイント候補のセットに含まれるものであってよい。推定学習部37は、初期のセットに含まれるすべてのキーポイント候補について、正解データを生成し、推定モデル26を学習させてよい。 The training data includes a number of training images rendered using a three-dimensional shape model of the target object, and ground truth data indicating the positions of the object's keypoints in the training images. At least in the initial training data, the keypoints for which the estimation learning unit 37 generates ground truth data may be those included in a set of keypoint candidates. The estimation learning unit 37 may generate ground truth data for all keypoint candidates included in the initial set, and train the estimation model 26.

 推定学習部37は、より具体的には、レンダリングされたオブジェクトの姿勢に基づいて学習画像におけるキーポイント候補の位置を決定し、キーポイント候補のそれぞれについて、その位置に応じた正解の位置画像を生成してよい。なお、訓練データは、オブジェクトが撮影された学習画像と、いわゆるSfMやVisual SLAMにより推定される学習画像内のオブジェクトの姿勢から生成される位置画像とを含んでもよい。 More specifically, the estimation learning unit 37 may determine the positions of keypoint candidates in the learning images based on the pose of the rendered object, and generate a correct position image for each of the keypoint candidates according to its position. The training data may include learning images in which the object is photographed, and position images generated from the pose of the object in the learning images estimated by so-called SfM or Visual SLAM.

 本実施形態では、推定学習部37は、キーポイント候補のそれぞれについて推定モデル26を学習させる。また選択されたキーポイント候補についての推定モデル26は、キーポイントの推定モデル26として、入力画像に対する姿勢推定(推論処理)に利用される。 In this embodiment, the estimation learning unit 37 trains an estimation model 26 for each of the keypoint candidates. The estimation model 26 for the selected keypoint candidate is used as the keypoint estimation model 26 for pose estimation (inference processing) for the input image.

 以下では、情報処理システムの処理について説明する。図5は、情報処理システムの処理を概略的に示すフロー図である。 The processing of the information processing system is explained below. Figure 5 is a flow diagram that shows an overview of the processing of the information processing system.

 はじめに情報処理システムは、対象となるオブジェクトが撮影された画像に基づいて、公知の手法により、そのオブジェクトの3次元形状モデルを生成する(S101)。 First, the information processing system generates a three-dimensional shape model of a target object using a known method based on an image of the object (S101).

 そして情報処理システムに含まれる学習制御部35は、3次元形状モデルおよび手により隠される部分を示す情報に基づいて、キーポイントの位置を決定するとともに、姿勢推定のための推定モデル26を学習させる(S102)。 Then, the learning control unit 35 included in the information processing system determines the positions of the key points based on the three-dimensional shape model and information indicating the parts hidden by the hand, and trains the estimation model 26 for pose estimation (S102).

 推定モデル26が学習されると、姿勢推定部25はオブジェクトが撮影された入力画像を学習済の推定モデル26に入力し(S103)、その推定モデル26が出力するデータを取得する。そして、その推定モデル26の出力に基づいて、画像中のキーポイントの2次元位置を決定する(S104)。 Once the estimation model 26 has been trained, the posture estimation unit 25 inputs an input image of an object into the trained estimation model 26 (S103) and obtains data output by the estimation model 26. Then, based on the output of the estimation model 26, the two-dimensional positions of key points in the image are determined (S104).

 より具体的には、推定モデル26の出力が、各点がキーポイントとの相対方向を示す位置画像である場合には、姿勢推定部25に含まれる位置取得部27は、位置画像の各点からキーポイントの位置の候補を算出し、その候補に基づいてキーポイントの位置を決定する。推定モデル26の出力がヒートマップの位置画像である場合には、位置取得部27は公知の方法により最も確率の高い点の位置をキーポイントの位置として決定する。 More specifically, if the output of the estimation model 26 is a position image in which each point indicates the relative direction to a keypoint, the position acquisition unit 27 included in the posture estimation unit 25 calculates candidates for the position of the keypoint from each point of the position image, and determines the position of the keypoint based on the candidates. If the output of the estimation model 26 is a position image of a heat map, the position acquisition unit 27 determines the position of the most probable point as the position of the keypoint using a known method.

 姿勢推定部25は、決定されたキーポイントの2次元位置と、3次元形状モデルにおけるそのキーポイントの3次元位置とに基づいて、オブジェクトの姿勢を推定する(S105)。 The posture estimation unit 25 estimates the posture of the object based on the two-dimensional positions of the determined keypoints and the three-dimensional positions of those keypoints in the three-dimensional shape model (S105).

 またタグ処理部29は、入力画像から手のポーズを示す情報を取得する(S106)。手のポーズとして、撮影された手指の入力画像または3次元空間における関節点の座標が取得��れてよい。手のポーズの取得において、画像と、関節点を示す正解データとにより学習された機械学習モデルが用いられてよい。入力画像は可視画像だけでなく深度画像も含んでよい。手のポーズを取得する手法は公知であるので、詳細な説明は省略する。 The tag processing unit 29 also acquires information indicating the hand pose from the input image (S106). As the hand pose, an input image of a photographed finger or coordinates of joint points in three-dimensional space may be acquired. In acquiring the hand pose, a machine learning model trained using an image and ground truth data indicating the joint points may be used. The input image may include not only a visible image but also a depth image. Techniques for acquiring the hand pose are well known, so detailed explanations are omitted.

 タグ処理部29は、取得された手のポーズを示す情報に基づいて、機能タグに対応するオブジェクトの部分に手が触れているか判定する(S107)。タグ処理部29は、手のいずれかの関節点の3次元座標と機能タグに関連付けられた部分(例えばタグ領域61,62)との距離が閾値以下であるか否かによって手が触れているか判定してもよい。 The tag processing unit 29 determines whether the hand is touching a part of the object corresponding to the function tag based on the acquired information indicating the hand pose (S107). The tag processing unit 29 may determine whether the hand is touching based on whether the distance between the three-dimensional coordinates of any joint point of the hand and a part associated with the function tag (e.g., tag areas 61, 62) is equal to or less than a threshold.

 機能タグに対応する部分に手が触れていると判定された場合には(S107)、タグ処理部29は、その部分に対応する機能タグに応じた処理を実行する(S108)。一方、機能タグに対応する部分に手が触れていないと判定された場合には、S108の処理はスキップされる。 If it is determined that the hand is touching the part corresponding to the function tag (S107), the tag processing unit 29 executes processing according to the function tag corresponding to that part (S108). On the other hand, if it is determined that the hand is not touching the part corresponding to the function tag, the processing of S108 is skipped.

 その後、画像描画部30は、推定された姿勢に基づいて画像を描画し(S108)、描画された画像を表示部18に表示させる。画像の表示先は他のディスプレイであってもよい。 Then, the image drawing unit 30 draws an image based on the estimated posture (S108) and displays the drawn image on the display unit 18. The image may also be displayed on another display.

 図5の例ではS103からS109の処理が1回行われる記載となっているが、実際には、S103からS109の処理が繰り返し実行され、オブジェクトの移動に応じて姿勢の推定および画像の描画がリアルタイムに行われてよい。 In the example of FIG. 5, the process from S103 to S109 is described as being performed once, but in reality, the process from S103 to S109 may be executed repeatedly, and the posture may be estimated and the image may be drawn in real time in response to the movement of the object.

 図6は、キーポイントの決定および推定モデル26の学習の処理の一例を示すフロー図である。図6は、図3におけるS102の処理をより詳細に説明する図である。 FIG. 6 is a flow diagram showing an example of the process of determining key points and learning the estimation model 26. FIG. 6 is a diagram explaining the process of S102 in FIG. 3 in more detail.

 はじめにキーポイント決定部36は、複数のキーポイント候補を生成する(S201)。より具体的には、キーポイント決定部36は、オブジェクトの3次元形状モデル(より具体的には3次元形状モデルに含まれる頂点の情報)から、複数のキーポイント候補およびその3次元位置を、例えば公知のFarthest Point アルゴリズムにより生成してよい。 First, the key point determination unit 36 generates multiple key point candidates (S201). More specifically, the key point determination unit 36 may generate multiple key point candidates and their three-dimensional positions from a three-dimensional shape model of the object (more specifically, information on vertices included in the three-dimensional shape model), for example, by using the well-known Farthest Point algorithm.

 図7は、オブジェクトから生成されるキーポイント候補の一例を説明する図である。図7では図3,4とは別のオブジェクトを対象とする場合について生成されるキーポイントの例を示している。図7では説明の容易のため、7つのキーポイント候補K1~K7が記載されているが、より多くのキーポイント候補が生成されてよい。 FIG. 7 is a diagram illustrating an example of keypoint candidates generated from an object. FIG. 7 shows an example of keypoints generated when a different object from those in FIGS. 3 and 4 is targeted. For ease of explanation, seven keypoint candidates K1 to K7 are shown in FIG. 7, but more keypoint candidates may be generated.

 キーポイント候補が生成されると、推定学習部37は、推定モデル26の訓練データを生成する(S202)。訓練データは、3次元形状モデルに基づいてレンダリングされた訓練画像と、訓練画像におけるキーポイント候補のそれぞれの位置を示す正解データとを含む。 Once the keypoint candidates are generated, the estimation learning unit 37 generates training data for the estimation model 26 (S202). The training data includes a training image rendered based on the three-dimensional shape model and ground truth data indicating the positions of each of the keypoint candidates in the training image.

 図8は、訓練データを生成する処理の一例を示すフロー図である。図8はS202の処理をより詳細に説明する図である。はじめに推定学習部37は、オブジェクトの3次元形状モデルのデータを取得する(S301)。そして、推定学習部37はレンダリングのための複数の視点を取得する(S302)。より厳密には、推定学習部37はレンダリングのための複数のカメラ視点と、カメラ視点に応じた撮影方向とを取得する。複数のカメラ視点は3次元形状モデルの原点からの距離が一定となる位置に設けられてよく、撮影方向はカメラ視点から3次元形状モデルの原点に向かう方向である。 FIG. 8 is a flow diagram showing an example of a process for generating training data. FIG. 8 is a diagram explaining the process of S202 in more detail. First, the estimation learning unit 37 acquires data of a three-dimensional shape model of an object (S301). Then, the estimation learning unit 37 acquires multiple viewpoints for rendering (S302). More precisely, the estimation learning unit 37 acquires multiple camera viewpoints for rendering and shooting directions corresponding to the camera viewpoints. The multiple camera viewpoints may be provided at positions at a constant distance from the origin of the three-dimensional shape model, and the shooting direction is a direction from the camera viewpoints toward the origin of the three-dimensional shape model.

 視点が取得されると、推定学習部37は3次元形状モデルに基づいて、視点のそれぞれについてオブジェクトの画像をレンダリングする(S303)。画像は公知の手法によりレンダリングされてよい。 Once the viewpoints are acquired, the estimation learning unit 37 renders an image of the object for each viewpoint based on the three-dimensional shape model (S303). The images may be rendered using a known method.

 画像がレンダリングされると、推定学習部37はレンダリングされた画像を訓練画像��して、視点とともに訓練データに追加する(S304)。ここで推定学習部37は、レンダリングされた画像に対して所定のデータ拡張を実施し、変換された画像を訓練画像としてもよい。データ拡張手法において、例えば、レンダリングされた画像に対して、画像の輝度、彩度、色相のうち少なくとも一部に対する擾乱を与えたり、画像の一部を切り抜いて元と同じサイズにリサイズする、といった変換がレンダリングされた画像に対して行われてもよい。 Once the image is rendered, the estimation learning unit 37 adds the rendered image as a training image to the training data together with the viewpoint (S304). Here, the estimation learning unit 37 may perform a predetermined data augmentation on the rendered image and use the converted image as the training image. In the data augmentation method, for example, the rendered image may be transformed by disturbing at least a portion of the luminance, saturation, and hue of the image, or by cropping out a portion of the image and resizing it to the same size as the original.

 推定学習部37は、さらに視点付きのオブジェクトの撮影画像を訓練画像に追加してもよい。この撮影画像は、3次元形状モデルの生成に用いられた撮影画像であってよい。撮影画像のカメラ視点は3次元形状モデルの生成の際に取得されたカメラ視点であってよい。 The estimation learning unit 37 may further add a photographed image of the object with a viewpoint to the training image. This photographed image may be the photographed image used to generate the three-dimensional shape model. The camera viewpoint of the photographed image may be the camera viewpoint acquired when generating the three-dimensional shape model.

 訓練画像が整備されると、推定学習部37は、訓練画像のそれぞれについて、キーポイント候補の3次元位置と、訓練画像の視点とに基づいて、訓練画像におけるキーポイントの位置を示す正解データを生成する(S305)。推定学習部37は、訓練画像ごとに、キーポイント候補のそれぞれに対して正解データを生成する。 Once the training images are prepared, the estimation learning unit 37 generates correct answer data indicating the positions of the keypoints in each training image based on the three-dimensional positions of the keypoint candidates and the viewpoint of the training image for each training image (S305). The estimation learning unit 37 generates correct answer data for each keypoint candidate for each training image.

 図9は、正解データの一例を模式的に示す図である。正解データは、訓練画像におけるオブジェクトのキーポイントの2次元位置を示す情報であり、各点がその点とキーポイントとの位置関係(例えば方向)を示す位置画像であってよい。 FIG. 9 is a diagram showing an example of the correct answer data. The correct answer data is information indicating the two-dimensional positions of key points of an object in a training image, and may be a position image in which each point indicates the positional relationship (e.g., direction) between that point and the key point.

 位置画像は、キーポイントの種類ごとに生成されてよい。位置画像は、各点におけるその点とキーポイントとの相対的な方向を示す。図9に示される位置画像では、各点の値に応じたパターンが記載され、各点の値は、その点の座標とキーポイントの座標との方向を示している。図9はあくまで模式的な図であり、各点の実際の値は連続的に変化する。図9の位置画像は、各点におけるその点を基準としたキーポイントの相対的な方向を示すVector Field画像である。 A position image may be generated for each type of keypoint. The position image indicates the relative direction of each point between that point and the keypoint. In the position image shown in Figure 9, a pattern is depicted according to the value of each point, and the value of each point indicates the direction between the coordinates of that point and the coordinates of the keypoint. Figure 9 is merely a schematic diagram, and the actual value of each point changes continuously. The position image in Figure 9 is a Vector Field image that indicates the relative direction of the keypoint with respect to that point.

 図8に示す処理により、訓練画像と正解データとを含む訓練データが生成される。 The process shown in Figure 8 generates training data that includes training images and correct answer data.

 訓練データが生成されると、推定学習部37は、訓練データによりキーポイント候補ごとの推定モデル26を学習させる(S203)。学習済の推定モデル26は、例えば以下に示す手法により、手により隠されるオブジェクトの部分を検出するために用いられる。 Once the training data is generated, the estimation learning unit 37 uses the training data to train an estimation model 26 for each keypoint candidate (S203). The trained estimation model 26 is used to detect parts of an object that are hidden by a hand, for example, by the method described below.

 推定モデル26が学習されると、キーポイント決定部36は、ユーザに対して、撮影部20の前で、手により把持されたオブジェクトを動かす指示を出力する。ユーザはその指示に従い把持したオブジェクトを撮影部20の前で動かす。 Once the estimation model 26 has been trained, the key point determination unit 36 outputs an instruction to the user to move the object being held in the user's hand in front of the image capture unit 20. The user follows the instruction to move the object being held in front of the image capture unit 20.

 そして、キーポイント決定部36は、撮影部20により撮影された、手に把持されたオブジェクトの画像を取得し、さらにその画像におけるオブジェクトの姿勢を取得する(S204)。キーポイント決定部36は、動画を構成しオブジェクトが撮影された画像を取得してよい。オブジェクトの姿勢の取得において、キーポイント決定部36は、画像を学習済の推定モデル26に入力した際に出力された情報に基づいてキーポイント候補の2次元位置を決定し、そのキーポイント候補の2次元位置と3次元形状モデルにおける位置とに基づいて、姿勢決定部28と同様の処理によりその姿勢を取得してよい。なお、取得された姿勢と、以前の画像から取得された姿勢との違いが閾値以下である場合、または、オブジェクトの画像と以前に撮影された画像とが類似する場合には、キーポイント決定部36はその画像を破棄し、S204の処理を繰り返してよい。 Then, the key point determination unit 36 acquires an image of the object held in the hand captured by the image capture unit 20, and further acquires the posture of the object in the image (S204). The key point determination unit 36 may acquire an image of the object that constitutes a video. In acquiring the posture of the object, the key point determination unit 36 may determine the two-dimensional position of a key point candidate based on information output when the image is input to the trained estimation model 26, and acquire the posture by processing similar to that of the posture determination unit 28 based on the two-dimensional position of the key point candidate and its position in the three-dimensional shape model. Note that if the difference between the acquired posture and the posture acquired from the previous image is equal to or less than a threshold value, or if the image of the object and the previously captured image are similar, the key point determination unit 36 may discard the image and repeat the processing of S204.

 なお、キーポイントの推定に失敗するなどの理由から、姿勢の取得ができなかった場合には、キーポイント決定部36は、ユーザにオブジェクトを指定された姿勢となるよう調���させることで、オブジェクトの画像および姿勢を取得してもよい。キーポイント決定部36は、VRヘッドセットなどに指定されたオブジェクトのレンダリング画像を表示させ、そのレンダリング画像と重なるように把持しているオブジェクトの位置及び姿勢を調整させてよい。 If the posture cannot be acquired due to reasons such as failure to estimate key points, the key point determination unit 36 may acquire an image and posture of the object by having the user adjust the object so that it assumes a specified posture. The key point determination unit 36 may display a rendering image of the specified object on a VR headset or the like, and adjust the position and posture of the object being held so that it overlaps with the rendering image.

 画像が取得されると、キーポイント決定部36は、その画像のそれぞれから手の領域を抽出する(S205)。手の領域の抽出は、単に色に基づいて行われてもよいし、公知の学習済の機械学習モデルによって行われてもよい。 Once the images are acquired, the key point determination unit 36 extracts hand regions from each of the images (S205). The hand regions may be extracted simply based on color, or may be extracted using a publicly known trained machine learning model.

 キーポイント決定部36は、キーポイント候補のそれぞれが、抽出された手の領域に隠されるか判定する(S206)。キーポイント決定部36は、画像におけるキーポイント候補の位置が、抽出された手の領域内にある場合に、そのキーポイント候補が隠されると判定してよい。 The key point determination unit 36 determines whether each of the key point candidates is occluded in the extracted hand region (S206). The key point determination unit 36 may determine that a key point candidate is occluded when the position of the key point candidate in the image is within the extracted hand region.

 そして、キーポイント決定部36は、繰り返し終了条件を満たしているか確認する(S208)。繰り返し終了条件は、これまでに判定の対象となった画像の数が閾値以上であることであってもよいし、オブジェクトを囲む仮想的な球の表面を複数の部分に分割した場合にすべての部分が姿勢に対応付けられることであってもよい。画像から取得された姿勢が示す方向にある部分が、姿勢に対応付けられた部分であってよい。 Then, the key point determination unit 36 checks whether a repetition end condition is met (S208). The repetition end condition may be that the number of images that have been subject to judgment so far is equal to or greater than a threshold value, or that when the surface of a virtual sphere surrounding the object is divided into multiple parts, all parts are associated with the posture. The part that is in the direction indicated by the posture obtained from the image may be the part associated with the posture.

 繰り返し終了条件を満たさない場合には(S207のN)、S204以降の処理が繰り返される。一方、繰り返し終了条件を満たす場合には(S207のY)、キーポイント決定部36は、キーポイント候補のそれぞれが隠されたと判定された頻度と、姿勢推定の信頼度とに基づいて、キーポイントを決定する(S208)。 If the repetition end condition is not met (N in S207), the processes from S204 onwards are repeated. On the other hand, if the repetition end condition is met (Y in S207), the keypoint determination unit 36 determines keypoints based on the frequency with which each of the keypoint candidates is determined to be occluded and the reliability of the pose estimation (S208).

 より具体的には、キーポイント決定部36は、キーポイント候補のそれぞれが隠されたと判定された頻度に基づいて、複数のキーポイント候補からキーポイントの仮のセットを選択する。初期の仮のセットとして、キーポイント候補のうち隠される頻度が低いものから所定の数のキーポイントが選択されてよい。キーポイント決定部36は、そのキーポイントについて、S204で取得された画像に対して姿勢推定部25により姿勢を推定した場合の姿勢推定の信頼度を取得する。 More specifically, the keypoint determination unit 36 selects a provisional set of keypoints from the multiple keypoint candidates based on the frequency with which each of the keypoint candidates is determined to be occluded. As the initial provisional set, a predetermined number of keypoints that are less frequently occluded among the keypoint candidates may be selected. The keypoint determination unit 36 obtains the reliability of the pose estimation for the keypoints when the pose estimation unit 25 estimates the pose for the image acquired in S204.

 信頼度は、姿勢決定部28により推定されたオブジェクトの姿勢と、その正解の姿勢とに基づいて決定されてよい。例えば、キーポイント決定部36は、画像からSLAM技術等により求められた姿勢を正解として算出し、その正解の姿勢と、推定された姿勢との差に基づいて信頼度を算出してよい。 The reliability may be determined based on the pose of the object estimated by the pose determination unit 28 and the correct pose. For example, the key point determination unit 36 may calculate the pose obtained from the image using SLAM technology or the like as the correct answer, and calculate the reliability based on the difference between the correct pose and the estimated pose.

 また、キーポイント決定部36は、仮のキーポイントにより推定された姿勢と、キーポイント候補の3次元位置とに基づいて、画像におけるキーポイント候補のそれぞれの位置を再投影し、再投影された位置を記憶部12に��納してよい。この場合、キーポイント決定部36は、キーポイント候補のそれぞれについて、推定モデル26の出力により推定される位置と、再投影された位置との距離の平均を信頼度として算出してよい。 The keypoint determination unit 36 may also reproject the position of each of the keypoint candidates in the image based on the pose estimated by the tentative keypoints and the three-dimensional positions of the keypoint candidates, and store the reprojected positions in the storage unit 12. In this case, the keypoint determination unit 36 may calculate, as the reliability, the average of the distances between the position estimated by the output of the estimation model 26 and the reprojected position for each of the keypoint candidates.

 キーポイント決定部36は、その信頼度が閾値より高い場合にはそのセットのキーポイントを正式なキーポイントとして決定し、そうでない場合には複数のキーポイント候補からこれまでのセットと異なるキーポイントからなる仮のセットを選択し、信頼度の取得以降の処理を繰り返す。新たな仮のセットは、例えば、元のセット内のキーポイントのうち、ランダムに選択されたキーポイントを、未選択のキーポイント候補のうちいずれかと交換することにより生成されてよい。交換の対象となるキーポイント候補は、例えば、頻度の小ささおよび既存のセット内のキーポイントとの距離の大きさに基づいて算出されるスコアに基づいて決定されてよい。 If the reliability is higher than a threshold, the keypoint determination unit 36 determines the keypoints in the set as official keypoints; if not, it selects a provisional set consisting of keypoints different from the previous set from the multiple keypoint candidates, and repeats the process from obtaining the reliability onwards. The new provisional set may be generated, for example, by replacing a randomly selected keypoint from the original set with one of the unselected keypoint candidates. The keypoint candidate to be replaced may be determined, for example, based on a score calculated based on low frequency and distance from the keypoints in the existing set.

 キーポイントのセットが決定されると、学習制御部35は、その決定されたキーポイントのセットを用いて姿勢を推定するように姿勢推定部25を設定する(S209)。なお、実際の姿勢推定に用いられる推定モデル26はそのキーポイント(キーポイント候補)について学習済のものであってよい。 Once the set of key points is determined, the learning control unit 35 sets the posture estimation unit 25 to estimate the posture using the determined set of key points (S209). Note that the estimation model 26 used for actual posture estimation may be one that has already been trained on the key points (key point candidates).

 S204からS208の処理により、姿勢推定の精度を推定する際に悪影響を及ぼす蓋然性の高い、手により隠されやすいキーポイントを効率的に除外し、姿勢推定をより効率的に実施することが可能になる。また姿勢推定の信頼度も用いることで、例えばキーポイントが狭いエリアに集中することに起因する、姿勢推定の精度の低下を防ぐことができる。また、姿勢推定への寄与が低いキーポイント候補を予め排除できるため、姿勢推定における制度と処理速度とを両立させ、処理を効率化することができる。 The processing from S204 to S208 makes it possible to efficiently eliminate keypoints that are likely to be obscured by hands and that are likely to have a negative effect on pose estimation accuracy, making it possible to perform pose estimation more efficiently. In addition, by using the reliability of pose estimation, it is possible to prevent a decrease in pose estimation accuracy, for example, due to keypoints concentrating in a small area. In addition, because keypoint candidates that have a low contribution to pose estimation can be eliminated in advance, it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.

 なお、S208の処理の対象となるキーポイント候補は、隠される頻度が閾値以下のものであってもよい。ここで、キーポイント候補の数がキーポイントの数に所定の数(例えば2)を足した値より少ない場合には、キーポイント決定部36は、隠される頻度が閾値を超えるキーポイント候補と交換するための追加のキーポイント候補を生成し、交換されたキーポイント候補についてS202以降の処理が再実行されてもよい。 The keypoint candidates that are the subject of the processing in S208 may be those whose frequency of obscuration is equal to or less than a threshold. Here, if the number of keypoint candidates is less than the number of keypoints plus a predetermined number (e.g., 2), the keypoint determination unit 36 may generate additional keypoint candidates to replace the keypoint candidates whose frequency of obscuration exceeds the threshold, and re-execute the processing from S202 onwards for the replaced keypoint candidates.

 図6に示される処理では、実際に手により隠される部分を取得してい��が、�����り��、手により隠され��部分を示��情報として、ユーザにより指定されたオブジェクトの部分であって、手により把持される部分を示す情報を取得してもよい。 In the process shown in FIG. 6, the part that is actually hidden by the hand is obtained, but instead, information indicating the part of the object specified by the user that is being held by the hand may be obtained as information indicating the part that is hidden by the hand.

 図10は、キーポイントの決定および推定モデル26の学習の処理の他の一例を示すフロー図である。この例では、ディスプレイに表示されるオブジェクトの画像に対して、ユーザが手により隠される部分をタグ領域として手動で指定する。 FIG. 10 is a flow diagram showing another example of the process of determining key points and learning the estimation model 26. In this example, the user manually specifies, as a tag region, a portion of an image of an object shown on a display that is obscured by a hand.

 はじめにキーポイント決定部36は、3次元形状モデルに基づいて、オブジェクトの画像を表示させる(S401)。次に、ユーザの画像に対する操作に基づいて、オブジェクトのうちユーザが指定するタグ領域と、タグ領域に対して指定された機能タグとを取得する(S402)。 First, the key point determination unit 36 displays an image of an object based on a three-dimensional shape model (S401). Next, based on the user's operation on the image, it acquires a tag area of the object designated by the user and a function tag designated for the tag area (S402).

 キーポイント決定部36は、オブジェクトの画像とともに塗料パレットを含むペイントツールのアイコンを表示させ、ユーザが塗料パレットにより指定した色を用いてスキャンしたモデルに着色させる処理を実行してよい。なお、ユーザは、オブジェクトを持ちながら任意の位置にペイントしてもよい。この場合、実際のオブジェクトの画像に同形状の仮想オブジェクトの画像を透過的に重畳させ、仮想オブジェクトに着色することにより、実物に塗られたように領域が可視化されてよい。また塗料の色と機能タグとが関連付けられてよい。 The key point determination unit 36 may display an icon of a paint tool including a paint palette along with an image of the object, and execute a process of coloring the scanned model with a color specified by the user using the paint palette. The user may paint any position while holding the object. In this case, an image of a virtual object of the same shape may be transparently superimposed on the image of the actual object, and the virtual object may be colored to visualize the area as if it were painted on the real thing. The paint color may also be associated with a functional tag.

 キーポイント決定部36は、この着色された領域をタグ領域として取得してよい。また、キーポイント決定部36は、ペイントツールによる着色の代わりに、ユーザがそれぞれ機能タグに対応する複数の仮想シールのうちいずれかを選び、選ばれた仮想シールをオブジェクトに貼り付けることでタグ領域を指定してもよい。 The key point determination unit 36 may acquire this colored area as a tag area. Alternatively, instead of coloring with a paint tool, the key point determination unit 36 may specify the tag area by having the user select one of a plurality of virtual stickers that each correspond to a functional tag and attach the selected virtual sticker to the object.

 キーポイント決定部36は、S401,S402と並行して、複数のキーポイント候補を生成する(S403)。この処理はS201と同様であるので詳細の説明を省略する。 The keypoint determination unit 36 generates multiple keypoint candidates (S403) in parallel with S401 and S402. This process is similar to S201, so a detailed explanation will be omitted.

 キーポイント決定部36は、キーポイント候補のそれぞれが、所定の条件を満たすタグ領域に隠れる蓋然性を算出する(S404)。所定の条件を満たすタグ領域は、例えば把持される領域として指定されたタグ領域や、スイッチなど手が触れる機能と対応づけられたタグ領域であってよい。 The key point determination unit 36 calculates the probability that each of the key point candidates is hidden in a tag area that satisfies a predetermined condition (S404). The tag area that satisfies the predetermined condition may be, for example, a tag area designated as an area to be grasped, or a tag area associated with a function that is touched by the hand, such as a switch.

 キーポイント決定部36は、例えば、キーポイント候補から複数の(等方的な)方向にレイを飛ばし、さらにタグ領域に当たるレイの比率を示す値をそのキーポイント候補における蓋然性として算出してよい。またタグ領域が3次元的な領域である場合には、キーポイント決定部36はキーポイント候補がその領域内にある場合に蓋然性の値を1とし、そうでない場合に蓋然性の値を0にしてよい。 The keypoint determination unit 36 may, for example, cast rays in multiple (isotropic) directions from the keypoint candidate, and further calculate a value indicating the ratio of rays that hit the tag region as the probability of that keypoint candidate. Furthermore, if the tag region is a three-dimensional region, the keypoint determination unit 36 may set the probability value to 1 if the keypoint candidate is within that region, and may set the probability value to 0 if it is not.

 キーポイント決定部36は、キーポイント候補が隠れる蓋然性に基づいて、最終的な姿勢推定に用いるキーポイントを選択する(S405)。ここで、キーポイント決定部36は、キーポイント候補のうち蓋然性が低いものから所定の数のキーポイントを選択してよい。 The keypoint determination unit 36 selects keypoints to be used for final pose estimation based on the probability that the keypoint candidates are occluded (S405). Here, the keypoint determination unit 36 may select a predetermined number of keypoints from the keypoint candidates with low probability.

 他には、キーポイント決定部36は、信頼度と蓋然性とに基づいてキーポイントを選択してもよい。例えば、キーポイント決定部36は仮のキーポイントを決定し、その仮のキーポイントに基づいて信頼度���算出してよい。キーポイント決定部36は、キーポイントごとの信頼度と蓋然性とから、キーポイントとしての適性を示すスコアを生成し、そのスコアに基づいてキーポイントを決定する。信頼度の算出は、以下の手順で行われてもよい。はじめに、その仮のキーポイントについて学習された推定モデル26を用いて、姿勢推定部25が3次元形状モデルを生成する際に撮影された画像について姿勢を推定する。次に、キーポイント決定部36はその推定された姿勢と、キーポイント候補の3次元位置とに基づいて、画像におけるキーポイント候補のそれぞれの位置を再投影する。そして、キーポイント決定部36は、キーポイント候補のそれぞれについて、推定モデル26の出力により推定される位置と、再投影された位置との距離の平均を信頼度として算出する。なお、S208と同様の手法で反復的にキーポイントを決定してもよい。 Alternatively, the keypoint determination unit 36 may select a keypoint based on the reliability and the probability. For example, the keypoint determination unit 36 may determine a tentative keypoint and calculate the reliability based on the tentative keypoint. The keypoint determination unit 36 generates a score indicating the suitability of the keypoint as a keypoint from the reliability and the probability of each keypoint, and determines the keypoint based on the score. The reliability may be calculated in the following manner. First, the estimation model 26 learned about the tentative keypoint is used to estimate the posture of the image captured when the posture estimation unit 25 generates the three-dimensional shape model. Next, the keypoint determination unit 36 reprojects the positions of each of the keypoint candidates in the image based on the estimated posture and the three-dimensional positions of the keypoint candidates. Then, the keypoint determination unit 36 calculates the average of the distance between the position estimated by the output of the estimation model 26 and the reprojected position as the reliability for each of the keypoint candidates. Note that the keypoints may be determined iteratively using a method similar to S208.

 キーポイントが決定されると、推定学習部37はそのキーポイント候補のそれぞれについて推定モデル26の学習のための訓練データを生成する(S406)。また推定学習部37はキーポイントのそれぞれについて推定モデル26を学習させる(S407)。なお、事前にキーポイント(キーポイント候補)について学習されている場合には、推定モデル26の学習を重複して行わなくてもよい。 Once the keypoints are determined, the estimation learning unit 37 generates training data for learning the estimation model 26 for each of the keypoint candidates (S406). The estimation learning unit 37 also trains the estimation model 26 for each of the keypoints (S407). Note that if the keypoints (keypoint candidates) have been learned in advance, there is no need to redundantly train the estimation model 26.

 図10に示される手法でも、手で把持した場合の姿勢推定に悪影響を及ぼす蓋然性の高い、手により隠されやすいキーポイントを効率的に除外し、姿勢推定をより効率的に実施することが可能になる。また、姿勢推定への寄与が低いキーポイント候補を予め排除できるため、姿勢推定における制度と処理速度とを両立させ、処理を効率化することができる。 The method shown in Figure 10 also makes it possible to efficiently eliminate keypoints that are likely to be hidden by the hand and thus have a high probability of adversely affecting pose estimation when the object is held in the hand, making it possible to perform pose estimation more efficiently. In addition, because it is possible to eliminate keypoint candidates that have a low contribution to pose estimation in advance, it is possible to achieve both accuracy and processing speed in pose estimation, making processing more efficient.

 また、図10に示される手法では、何らかの機能を割り当てることにより手により隠される部分が特定される。そのため、手により隠される部分そのものを特定するための操作を省略することができ、利便性を向上させることができる。 In addition, in the method shown in FIG. 10, the part hidden by the hand is identified by assigning some kind of function. Therefore, the operation for identifying the part hidden by the hand itself can be omitted, improving convenience.

 なお、上記の具体的な数値及び図面中のオブジェクトや数値は例示��あり、これらの例には限定されず、必要に応じて改変されてよい。 Note that the specific numerical values above and the objects and numerical values in the drawings are merely examples, and are not limited to these examples and may be modified as necessary.

Claims (7)

 1または複数のプロセッサを含み、
 前記1または複数のプロセッサは、
 オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、
 前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定し、
 入力された画像における前記決定された複数のキーポイントの位置を推定するための機械学習モデルを学習させ、
 前記学習された機械学習モデルにオブジェクトおよび手が撮影された画像が入力された際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、
 前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における推定された姿勢を決定する、
 姿勢推定システム。
one or more processors;
The one or more processors:
acquiring information indicating a portion of an object that is hidden by a hand;
determining three-dimensional positions of a number of key points for estimating a pose of the object based on the acquired information;
training a machine learning model to estimate locations of the determined keypoints in an input image;
Obtaining estimated positions of key points in an image based on an output of the trained machine learning model when the image includes an object and a hand;
determining an estimated pose of the object in three-dimensional space based on the estimated keypoint locations;
Pose estimation system.
 請求項1に記載の姿勢推定システムにおいて、
 前記手により隠される部分を示す情報は、前記オブジェクトが前記手により把持された複数の画像であり、
 前記1または複数のプロセッサは、所定の手順により決定された複数のキーポイント候補が、前記オブジェクトが前記手により把持された前記複数の画像において前記手により隠される頻度に基づいて、前記複数のキーポイントの3次元位置を決定する、
 姿勢推定システム。
2. The posture estimation system according to claim 1,
the information indicating the portion obscured by the hand is a plurality of images of the object being grasped by the hand;
the one or more processors determine three-dimensional positions of the plurality of keypoints based on a frequency with which the plurality of keypoint candidates determined by a predetermined procedure are occluded by the hand in the plurality of images in which the object is held by the hand;
Pose estimation system.
 請求項1に記載の姿勢推定システムにおいて、
 前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって前記手により把持される部分である、
 姿勢推定システム。
2. The posture estimation system according to claim 1,
The information indicating the portion hidden by the hand is a portion of the object indicated by the user and grasped by the hand.
Pose estimation system.
 請求項3に記載の姿勢推定システムにおいて、
 前記手により隠される部分を示す情報は、ユーザにより指示された前記オブジェクトの部分であって、タグが対応付けられる部分を示す情報であり、
 前記1または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に当該タグに応じた処理を実行する、
 姿勢推定システム。
4. The posture estimation system according to claim 3,
the information indicating the portion hidden by the hand is information indicating a portion of the object designated by a user and to which a tag is associated;
the one or more processors determine whether a part associated with the tag is being operated by the hand based on an image of the object and the hand, and execute a process corresponding to the tag when it is determined that the part is being operated.
Pose estimation system.
 請求項4に記載の姿勢推定システムにおいて、
 前記1または複数のプロセッサは、前記オブジェクトおよび手が撮影された画像に基づいて、前記タグが対応付けられた部分が前記手により操作されているか判定し、当該部分が操作されていると判定された場合に、前記手による操作の大きさに基づいて当該タグに応じた処理を実行する、
 姿勢推定システム。
5. The posture estimation system according to claim 4,
the one or more processors determine whether a part associated with the tag is being operated by the hand based on an image of the object and the hand, and when it is determined that the part is being operated, executes a process corresponding to the tag based on a size of the operation by the hand.
Pose estimation system.
 1または複数のプロセッサにより、
 オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得し、
 前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定し、
 入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得し、
 前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得し、
 前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における姿勢を推定する、
 姿勢推定方法。
by one or more processors,
acquiring information indicating a portion of an object that is hidden by a hand;
determining three-dimensional positions of a number of key points for estimating a pose of the object based on the acquired information;
obtaining a trained machine learning model for estimating locations of the determined keypoints in an input image;
Obtaining estimated positions of key points in the image based on an output of the obtained machine learning model when an image of an object and a hand is input;
estimating a pose of the object in a three-dimensional space based on the estimated keypoint positions;
Pose estimation methods.
 オブジェクトを構成する部分であって、手により隠される部分を示す情報を取得する取得手段、
 前記取得された情報に基づいて、前記オブジェクトの姿勢を推定するための複数のキーポイントの3次元位置を決定するキーポイント決定手段、
 入力された画像における前記決定された複数のキーポイントの位置を推定するための学習済の機械学習モデルを取得するモデル取得手段、
 前記取得された機械学習モデルにオブジェクトおよび手が撮影された画像を入力した際の出力に基づいて、前記画像における推定されたキーポイントの位置を取得する位置取得手段、および、
 前記推定されたキーポイントの位置に基づいて、前記オブジェクトの3次元空間における姿勢を推定する姿勢推定手段、
 としてコンピュータを機能させるためのプログラム。
an acquisition means for acquiring information indicating a part of an object that is hidden by a hand;
a key point determining means for determining three-dimensional positions of a plurality of key points for estimating a pose of the object based on the acquired information;
a model acquisition means for acquiring a trained machine learning model for estimating positions of the determined plurality of keypoints in an input image;
A position acquisition means for acquiring positions of estimated key points in the image based on an output when an image of an object and a hand is input to the acquired machine learning model; and
a posture estimation means for estimating a posture of the object in a three-dimensional space based on the estimated positions of the key points;
A program that makes a computer function as a
PCT/JP2023/033602 2023-09-14 2023-09-14 Posture estimation system, posture estimation method, and program Pending WO2025057379A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/033602 WO2025057379A1 (en) 2023-09-14 2023-09-14 Posture estimation system, posture estimation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/033602 WO2025057379A1 (en) 2023-09-14 2023-09-14 Posture estimation system, posture estimation method, and program

Publications (1)

Publication Number Publication Date
WO2025057379A1 true WO2025057379A1 (en) 2025-03-20

Family

ID=95021067

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/033602 Pending WO2025057379A1 (en) 2023-09-14 2023-09-14 Posture estimation system, posture estimation method, and program

Country Status (1)

Country Link
WO (1) WO2025057379A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240404106A1 (en) * 2023-06-01 2024-12-05 International Business Machines Corporation Training a pose estimation model to determine anatomy keypoints in images

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008046750A (en) * 2006-08-11 2008-02-28 Canon Inc Image processing apparatus and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008046750A (en) * 2006-08-11 2008-02-28 Canon Inc Image processing apparatus and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240404106A1 (en) * 2023-06-01 2024-12-05 International Business Machines Corporation Training a pose estimation model to determine anatomy keypoints in images

Similar Documents

Publication Publication Date Title
US11308347B2 (en) Method of determining a similarity transformation between first and second coordinates of 3D features
CN112330730B (en) Image processing methods, devices, equipment and storage media
US10453235B2 (en) Image processing apparatus displaying image of virtual object and method of displaying the same
CN106062862B (en) System and method for immersive and interactive multimedia generation
JP6323040B2 (en) Image processing apparatus, image processing method, and program
CN104937635B (en) Model-based multi-hypothesis target tracker
KR101687017B1 (en) Hand localization system and the method using head worn RGB-D camera, user interaction system
US9734393B2 (en) Gesture-based control system
CN111510701A (en) Display method, apparatus, electronic device, and computer-readable medium for virtual content
CN116917949A (en) Model objects based on monocular camera output
US10950056B2 (en) Apparatus and method for generating point cloud data
TW201814438A (en) Virtual reality scene-based input method and device
EP2371434A2 (en) Image generation system, image generation method, and information storage medium
JP2011095797A (en) Image processing device, image processing method and program
JP2010040037A (en) Method and device for real-time detection of interaction between user and augmented-reality scene
WO2016029939A1 (en) Method and system for determining at least one image feature in at least one image
WO2012106070A2 (en) Using a three-dimensional environment model in gameplay
WO2019012632A1 (en) Recognition processing device, recognition processing method, and program
KR20200117685A (en) Method for recognizing virtual objects, method for providing augmented reality content using the virtual objects and augmented brodadcasting system using the same
WO2025057379A1 (en) Posture estimation system, posture estimation method, and program
KR101308184B1 (en) Augmented reality apparatus and method of windows form
JP7765611B2 (en) Information processing device, information processing method, and program
TWI815021B (en) Device and method for depth calculation in augmented reality
CN108401452A (en) Apparatus and method for performing real target detection and control using virtual reality head mounted display system
WO2025069312A1 (en) Information processing system, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23952269

Country of ref document: EP

Kind code of ref document: A1