Nokia is bringing Volumetric Video services closer to consumers
Immersive content is steadily making its way to our daily media consumption. New multisensory devices and algorithms that allow capturing and experiencing volumetric content with high fidelity and reduced cost are driving this trend. The advancement of the Metaverse is also accelerating the adoption of immersive services. And yet, the industry is still in search for high-revenue immersive services that could attract everyday consumers on a massive scale.
Volumetric video is one of the potential answers. Just like traditional two-dimensional (2D) video, volumetric video is a sequence of frames. But while a traditional video frame can only provide a flat view of the environment, each volumetric video frame represents a 3D space which the user can immerse themselves in and move through.
Volumetric video is undoubtedly one of the breakthrough technologies of the near future, with the potential to revolutionize the way we experience content. However, volumetric video formats, codecs, storage systems and transmission technologies need to be standardized to achieve uniformity and ensure that devices and systems work seamlessly together.
A family of standards around volumetric video coding
From very early on, Nokia’s engineers recognized the possibilities of volumetric video, and for the past several years we have actively contributed to these standardization efforts in the Moving Picture Experts Group (MPEG) to create a family of standards around Visual Volumetric Video-based Coding (V3C) technology.
In V3C, each volumetric video frame is transformed from its 3D representation into multiple 2D video representations and associated metadata known as atlas data. After the conversion from 3D to 2D, the resulting 2D video representations are compressed using traditional video coding standards while atlas data are compressed using a separate encoding mechanism. As with video data, atlas data is composed of standardized network abstraction layer (NAL) units enabling efficient encapsulation into file formats and application layer protocol units.
V3C can be reused for a variety of applications. When volumetric video is represented by point clouds it can be coded using Video-based Point Cloud Compression (V-PCC), while volumetric video, represented by multi-view and depth, can be coded using the MPEG Immersive Video standard (MIV).
Missing piece of puzzle enabling real-time delivery: RTP
Making volumetric video available for consumers requires much more than just coding technology – system level aspects also need to be defined for the last mile delivery. Nokia’s standardization efforts in MPEG have resulted in defining an end-to-end system for delivering volumetric video using Dynamic Adaptive Streaming over HTTP (DASH). Even though DASH is an extremely popular video delivery standard, its underlying transport protocol-level limitations make it unsuitable for conversational use cases. Therefore, developing a standard for transporting V3C over Real-time Protocol (RTP) was necessary.
Supplemental RTP specifications known as RTP payload formats have to be defined to enable real-time delivery of a specific media type. An RTP payload format includes specific information on the packetization rules of the media payload into RTP packets, or on the mapping of RTP header fields to events in the media bitstream.
Nokia led the standardization of RTP
Nokia took the lead in standardization efforts in IETF and proposed the first draft for “RTP Payload Format for Visual Volumetric Video-based Coding” in January 2022. Over the course of 18 months, the draft was rigorously improved, and in the IETF meeting in July 2023 it reached the Last Call milestone, which is the final step before becoming a Proposed Standard.
The RTP payload format for V3C defines the packetization/de-packetization rules for the atlas data component, whereas the video encoded components are packetized/de-packetized according to the respective video codec RTP payload format. V3C RTP payload format also defines additions to Session Description Protocol (SDP) that allow the grouping of different V3C RTP component streams and the exchange of static atlas data as part of the initial session negotiation.
There are three different types of packets defined for atlas data: Single NAL unit, Aggregated Packet (AP), and Fragmentation Unit (FU). A single NAL unit packet contains one atlas NAL unit. Aggregation Packets enable the reduction of packetization overhead for small NAL units by storing them in the same packet. Fragmentation Units are used to fragment a single NAL unit that exceeds the Maximum Transmission Unit (MTU) size into multiple RTP packets.
To demonstrate the viability of the standard and test it in the real world, Nokia implemented the specification and released it to the multimedia technology developers’ community. The released atlas data payloader and de-payloader for RTP Payload Format for V3C is written as a plugin to cross-platform multimedia GStreamer framework. Under the permissive BSD 3-clause Clear license, it should accelerate the development of future immersive services.
The future of media is immersive
Volumetric video will not only transform the way we experience content for entertainment such as gaming and movies. It will also disrupt other sectors by, for example, enhancing employee safety and increasing productivity through improved remote operations. It can also support in creating 3D models of structures for engineering or even of human organs for medical purposes.
A lot is still to be done before volumetric video services make their way into our everyday lives. However, thanks to the efforts made by Nokia and other technology developers, important steps towards this are taken every day.