That is how Omnimatte improves Google’s video and picture modifying course of
Mattes are an important part of image and video editing. They help combine a foreground image, such as actors on a set, with a background image such as a huge city. Newer computer vision techniques are capable of producing high quality mats for videos and images. However, the scene effects created by a subject, including reflections, smoke, shadows, and so on, are still ignored to this day.
To fill this void, Google introduced a novel method of creating mats, which use layered neural rendering to break a video into layers known as omnimattes. The Omnimattes capture both the subject and all effects associated with the subject in the scene. Omnimattes, like traditional mats, are RGBA images that can be edited with readily available image or video editing software and can be used anywhere these traditional mats are used, such as adding text to a video under a smoke trail.
Register for our upcoming AI conference >>
Work behind it
The work is presented in the paper entitled “Omnimatte: Associating Objects and Their Effects in Video”. To generate Omnimattes, the researchers split each input video into a number of layers:
- One level for each moving subject
- An additional layer for stationary background objects
Take, for example, a boy who is out and about with his dog. So the subjects – the boy and the dog – have separate levels for them. In addition, a separate layer is added to the background around the street. Eventually, all of these layers are brought together using traditional alpha blending techniques, which reproduces the input video. For the results, the researchers used:
- Mask R-CNN to segment the input objects.
- STM, a video object segmenter trained on the DAVIS data set to track objects across frames.
- Using RAFT to compute the optical flow between consecutive frames.
- For dynamic background elements such as tree branches, the researchers use panoptic segmentation to segment them and treat the segments as additional objects.
Photo credit: paper
Result
The results of the paper include:
- Successful association of topics with the associated scene effects.
- The method can help remove a dynamic object from a video. This can be done either by binarizing the Omnimatte and using it as input for a separate video completion method such as FGVC, or by simply excluding the Omnimatte layer of the object from the reconstruction.
- The model presented in the paper outperformed the best method for shadow detection, ie ISD.
- It successfully captures the warps, reflections, and shadows with a generic, much simpler input.
However, the model is unable to separate objects or effects that remain completely stationary relative to the background throughout the video. “These problems could be addressed by building a background rendering that explicitly models the 3D structure of the scene,” the paper concluded.
See also
Track AI in video editing
In 2016, IBM used its Watson supercomputer to curate footage and create a trailer for the horror thriller Morgan – one of the first uses of AI in video editing. Watson essentially used machine learning to study previous trailers and then used what he learned to curate and select parts from the film that he thought would be appropriate for the trailer. Although the AI does the job in a fraction of the time, it would have taken human hours or days to view all of the footage and produce the final video.
In 2016 itself, Adobe launched its in-house AI and ML platform called Adobe Sensei, which offers several useful features across all products. For example, Sensei AI can be used to quickly adjust and fix errors in images, videos, and other media in Adobe Creative Cloud products, including Photoshop, Premiere, and Illustrator, and for advanced search capabilities in Adobe Stock and Lightroom. Similar tools such as Quickstories from American technology company GoPro, the end-to-end video marketing tool Magisto, the online video editing tool Rawshorts, Lumen5 and many other AI-based tools exist.
This ability of the AI to interpret video opens up the possibility for use in almost all kinds of editing tools, from color correction to object removal, image stabilization, visual effects etc. However, in the case of “deep fakes” such as politicians uttering words that they never said it remains a problem. Therefore, ethical and legal frameworks need to be created to address these issues in the future.
Join our Discord server. Become part of a dedicated online community. Join here.
Subscribe to our newsletter
Get the latest updates and relevant offers by sharing your email.
kumar gandharv
Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), embarks on a journey as a tech journalist at AIM. A keen observer of national and IR-related news. He loves going to the gym. Contact: [email protected]
Comments are closed.