Explainer

Proposal to add Machine Learning attributes as Meta in AV1 specification

Problem we are trying to solve

Today, to apply machine vision techniques using Machine Learning like real-time object recognition on client-side in, we need to have the dedicated hardware accelerators capability and corresponding heavy loaded software stack or the cloud support.

Let's take the case of object detection as an example; today, each client has to decode each of the video frames and apply the Machine vision algorithm. This does not scale from a hardware or a software perspective when each client has to do the same compute-intensive operations. On the other hand, if there is a generic meta to save the position coordinates of objects, only the video content producers need to perform the object detection or frame classification. The object-meta can be embedded as part of the compressed bitstream, and then it will be just a few lines of software changes for most of the video players to draw bounding boxes around the objects in each video frame. This would enable even legacy platforms to use features like object detection and classification of video frames with minimal implementation cost.

ML Metadata

Add Machine Learning metadata as part of AV1 spec, and the data entails scene-specific information for:

- Image classification data for each frame
- Object localization and tracking data
- Semantic segmentation

The information consists of image labels, coordinate data of objects, scene semantics, label confidence, the scheme used for each of these such as the model name with version, dataset name, and model accuracy metric.

Image classification data

- Classify generic image
    String representing the scene information : str
- Probability of confidence
  - Model architecture (eg: ResNet50_v3) : str
      Quality of classification varies with model
  - Dataset name (model meta eg: ImageNet) : str
      Quality of classification varies with dataset
- Confidence value : int
      Classification accuracy based on the model

Localized Object Annotation*

- Number of objects : int
- Coordinate information : [] 
- Height, width : []
- Labels : str[]
- Probability of confidence:[]
    Model architecture (eg: Mask-RCNN_v1, Yolo) : str
    Dataset name (model meta eg: Coco*) : str
    Confidence value : int

Metadata OBU extension (5.8.1)

metadata_obu( )  {
    ..
    ..
    else if ( metadata_type == METADATA_TYPE_ML_ATTRIBUTES)
        metadata_ml_attributes( )
}

Metadata ML attributes syntax (5.8.8)

metadata_ml_attributes() {	Type
persist_scene_classification_flag	f(1)
if ( !persist_scene_classification_flag )
scene_classification_data_present_flag	f(1)
if ( scene_classification_data_present_flag ) {
n = scene_classification_data_description_length	f(8)
scene_classification_data_description	string(n)
model_architecture_name_present_flag	f(1)
if ( model_architecture_name_present_flag ) {
n = model_architecture_name_length	f(8)
model_architecutre_name	string(n)
}
model_data_set_name_present_flag	f(1)
if (model_data_set_name_present_flag) {
n = model_data_set_name_length	f(8)
model_dataset_name	string(n)
}
confidience_value	f(8)
}
object_annotation_present_flag	f(1)
if (object_annotation_present_flag) {
N = number_of_identified_objects	f(8) //Upto 256 objects per frame
for ( i = 0; i < N; i++ ) {
object_label_name_present_flag	f(1)
if (object_label_name_present_flag) {
n = object_label_name_length	f(8)
object_label_name	string(n)
}
object_bounding_box_x_coordinate	f(16)
object_bounding_box_y_coordinate	f(16)
object_bounding_box_width	f(16)
object_bounding_box_height	f(16)
confidence_value	f(8)
}
model_architecture_name_present_flag	f(1)
if (model_architecture_name_present_flag) {
n = model_architecture_name_length	f(8)
model_architecutre_name	string(n)
}
model_data_set_name_present_flag	f(1)
if (model_data_set_name_present_flag) {
n = model_architecture_name_length	f(8)
model_data_set_name	string(n)
}
}
}

Sample Bitstream format

Metadata OBU extension (5.8.1) …..0x06…// METADATA_TYPE_ML_ATTRIBUTES

scene_classification_data_present_flag / scene_classification_data_description_length / scene_classification_data_description / model_architecture_name_length / model_architecutre_name / model_data_set_name_length / model_data_set_name / object_annotation_present_flag / number_of_identified_objects / object_label_name_length / object_label_name / object_bounding_box_x_coordinate / object_bounding_box_y_coordinate / object_bounding_box_width / object_bounding_box_height / model_architecture_name_length / model_architecutre_name / model_data_set_name_length / model_dataset_name

/ TRUE / 0x12 / “Person, sheep, dog” / 0x08 / “ResNet50” / 0x08 / “ImageNet” / 0X5F = (95%) / TRUE / 0x0007 / 0x03 / “dog” / X-dog / Y-dog / W-dog / H-dog / 0x03 / “MAN” / X-man / Y-man / W-man / H-man / 0x05 / “sheep1” / X-sheep1 / Y-sheep1 / W-sheep1 / H-sheep1 / 0x05 / “sheep2” / X-sheep2 / Y-sheep2 / W-sheep2 / H-sheep2 / 0x05 / “sheep3” / X-sheep3 / Y-sheep3 / W-sheep3 / H-sheep3 / 0x05 / “sheep4” / X-sheep4 / Y-sheep4 / W-sheep4 / H-sheep4 / 0x5 / “sheep5” / X-sheep5 / Y-sheep5 / W-sheep5 / H-sheep5 / 0x04 / “Yolo”/ 0x04 / “Coco” / 0x5A (= 90%) /

   X-{xxx} & Y-{xxx} indicate X coordinate and Y coordinate of the object bounding box
   W-{xxx} & H-{xxx} indicate Width & Height of the object bounding box

ToDo

a) Semantic segment Annotation** of each Frame (Optional)

Can be used to understand what the scene represents and relationships between objects.
Number of segments, Labels, per pixel masks for each segment, along with model name, version, dataset are the attributes needed.
Per pixel masks could be too much information to put in meta. May be a seperate video stream could be an option.

** https://arxiv.org/pdf/1909.11065v2.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Explainer

Problem we are trying to solve

ML Metadata

Image classification data

Localized Object Annotation*

Metadata OBU extension (5.8.1)

Metadata ML attributes syntax (5.8.8)

Sample Bitstream format

ToDo

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

av1-ml-meta/explainer

Folders and files

Latest commit

History

Repository files navigation

Explainer

Problem we are trying to solve

ML Metadata

Image classification data

Localized Object Annotation*

Metadata OBU extension (5.8.1)

Metadata ML attributes syntax (5.8.8)

Sample Bitstream format

ToDo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages