Skip to content

av1-ml-meta/explainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Explainer

Proposal to add Machine Learning attributes as Meta in AV1 specification

Problem we are trying to solve

Today, to apply machine vision techniques using Machine Learning like real-time object recognition on client-side in, we need to have the dedicated hardware accelerators capability and corresponding heavy loaded software stack or the cloud support.

Let's take the case of object detection as an example; today, each client has to decode each of the video frames and apply the Machine vision algorithm. This does not scale from a hardware or a software perspective when each client has to do the same compute-intensive operations. On the other hand, if there is a generic meta to save the position coordinates of objects, only the video content producers need to perform the object detection or frame classification. The object-meta can be embedded as part of the compressed bitstream, and then it will be just a few lines of software changes for most of the video players to draw bounding boxes around the objects in each video frame. This would enable even legacy platforms to use features like object detection and classification of video frames with minimal implementation cost.

media_pipeline_today

media_pipeline_proposed

ML Metadata

Add Machine Learning metadata as part of AV1 spec, and the data entails scene-specific information for:

- Image classification data for each frame
- Object localization and tracking data
- Semantic segmentation

The information consists of image labels, coordinate data of objects, scene semantics, label confidence, the scheme used for each of these such as the model name with version, dataset name, and model accuracy metric.

Image classification data

- Classify generic image
    String representing the scene information : str
- Probability of confidence
  - Model architecture (eg: ResNet50_v3) : str
      Quality of classification varies with model
  - Dataset name (model meta eg: ImageNet) : str
      Quality of classification varies with dataset
- Confidence value : int
      Classification accuracy based on the model

classification vs obj detection

Localized Object Annotation*

object_detect_ph

- Number of objects : int
- Coordinate information : [] 
- Height, width : []
- Labels : str[]
- Probability of confidence:[]
    Model architecture (eg: Mask-RCNN_v1, Yolo) : str
    Dataset name (model meta eg: Coco*) : str
    Confidence value : int

Metadata OBU extension (5.8.1)

metadata_obu( )  {
    ..
    ..
    else if ( metadata_type == METADATA_TYPE_ML_ATTRIBUTES)
        metadata_ml_attributes( )
}

Metadata ML attributes syntax (5.8.8)

metadata_ml_attributes() { Type
    persist_scene_classification_flag
f(1)
    if ( !persist_scene_classification_flag ) 
        scene_classification_data_present_flag
f(1)
    if ( scene_classification_data_present_flag ) { 
        n = scene_classification_data_description_length
f(8)
        scene_classification_data_description
string(n)
        model_architecture_name_present_flag
f(1)
        if ( model_architecture_name_present_flag ) {
            n = model_architecture_name_length
f(8)
            model_architecutre_name
string(n)
        }
        model_data_set_name_present_flag
f(1)
        if (model_data_set_name_present_flag) {
            n = model_data_set_name_length
f(8)
            model_dataset_name
string(n)
        }
        confidience_value
f(8)
    }
    object_annotation_present_flag
f(1)
    if (object_annotation_present_flag) {
        N = number_of_identified_objects
f(8) //Upto 256 objects per frame
        for ( i = 0; i < N; i++ ) {
            object_label_name_present_flag
f(1)
            if (object_label_name_present_flag) {
                n = object_label_name_length
f(8)
                object_label_name
string(n)
            }
            object_bounding_box_x_coordinate
f(16)
            object_bounding_box_y_coordinate
f(16)
            object_bounding_box_width
f(16)
            object_bounding_box_height
f(16)
            confidence_value
f(8)
        }
        model_architecture_name_present_flag
f(1)
        if (model_architecture_name_present_flag) {
            n = model_architecture_name_length
f(8)
            model_architecutre_name
string(n)
        }
        model_data_set_name_present_flag
f(1)
        if (model_data_set_name_present_flag) {
            n = model_architecture_name_length
f(8)
            model_data_set_name
string(n)
        }
    }
}

Sample Bitstream format

Metadata OBU extension (5.8.1) …..0x06…// METADATA_TYPE_ML_ATTRIBUTES

scene_classification_data_present_flag / scene_classification_data_description_length / scene_classification_data_description / model_architecture_name_length / model_architecutre_name / model_data_set_name_length / model_data_set_name / object_annotation_present_flag / number_of_identified_objects / object_label_name_length / object_label_name / object_bounding_box_x_coordinate / object_bounding_box_y_coordinate / object_bounding_box_width / object_bounding_box_height / model_architecture_name_length / model_architecutre_name / model_data_set_name_length / model_dataset_name

/ TRUE / 0x12 / “Person, sheep, dog” / 0x08 / “ResNet50” / 0x08 / “ImageNet” / 0X5F = (95%) / TRUE / 0x0007 / 0x03 / “dog” / X-dog / Y-dog / W-dog / H-dog / 0x03 / “MAN” / X-man / Y-man / W-man / H-man / 0x05 / “sheep1” / X-sheep1 / Y-sheep1 / W-sheep1 / H-sheep1 / 0x05 / “sheep2” / X-sheep2 / Y-sheep2 / W-sheep2 / H-sheep2 / 0x05 / “sheep3” / X-sheep3 / Y-sheep3 / W-sheep3 / H-sheep3 / 0x05 / “sheep4” / X-sheep4 / Y-sheep4 / W-sheep4 / H-sheep4 / 0x5 / “sheep5” / X-sheep5 / Y-sheep5 / W-sheep5 / H-sheep5 / 0x04 / “Yolo”/ 0x04 / “Coco” / 0x5A (= 90%) /

   X-{xxx} & Y-{xxx} indicate X coordinate and Y coordinate of the object bounding box
   W-{xxx} & H-{xxx} indicate Width & Height of the object bounding box

ToDo

a) Semantic segment Annotation** of each Frame (Optional)

  • Can be used to understand what the scene represents and relationships between objects.

  • Number of segments, Labels, per pixel masks for each segment, along with model name, version, dataset are the attributes needed.

  • Per pixel masks could be too much information to put in meta. May be a seperate video stream could be an option.

    ** https://arxiv.org/pdf/1909.11065v2.pdf

About

Proposal to add Machine Learning attributes as Meta in AV1 specification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •