Facing the Engineering Reality of Mixed Reality with Intel AI

21 min readDec 21, 2019

Breaking down the problem of Mixed Reality.
Analysis of the computational workload of the different components of Mixed Reality.
The motivation for designing a Hybrid Mixed Reality Cloud.
Intel Architecture for Mixed Reality

Introduction

In the digital age of today, we humans and almost all things we interact with on a daily basis have a digital self in addition to physical existence. Mixed Reality (MR) is the idea of creating an immersive and seamless experience by blending digital entities in a real-world environment. The following video is a demonstration of this idea.

Mixed reality applications can render accurate to-scale digital models of objects — both static and dynamic — and allow the user to interact with them using intuitive gestures that resemble the way they would interact with the real object. High-precision physics simulation allows accurate reproduction of real-world outcomes of these interactions — thus providing a highly capable tool for education, design and planning. Mixed reality is also capable of virtual teleportation of human users by mapping their actions and expressions on virtual avatars, thus taking remote collaboration to a whole new level. Mixed reality tools are set to revolutionize engineering, design, productivity, education and entertainment. This makes mixed reality one of the most actively researched areas of computer science at the moment. To know more about the widespread applications of Mixed Reality, please read my earlier blog. All these amazing capabilities of MR come at a huge computational cost. As a result, deploying MR at scale poses unique engineering challenges. This article breaks down the technical details of Mixed Reality and analyses the computational cost involved in each of the key steps. Mixed Reality owes its recent success hugely to modern computer architectures that involve an optimal workload distribution between the Cloud and the Edge. In this article, we explain the design philosophy of these compute architectures.

Breaking down the problem of Mixed Reality

A typical Mixed Reality pipeline consists of the following components:

Spatial mapping and 3D reconstruction

Spatial mapping is the process of creating a 3D map of the environment that is being scanned by the MR device. This allows the MR device to understand the environment and model interactions with the real world. Spatial mapping is essential for occlusion, collision avoidance, motion planning and realistic blending of the real and virtual worlds.

The three important components that serve as the building blocks of spatial mapping and 3D reconstruction are:

Axes: In order to make measurements in the real world and embed virtual objects of the right scale, the MR device must define its global coordinate system. Depending upon the application, the device may use a 2D or a 3D coordinate system. In a 2D coordinate system, the x-axis represents the horizontal direction and y-axis is vertical. On the other hand, in a 3D coordinate system, the x-axis represents the transverse, y-axis, the longitudinal and z-axis, the vertical direction with respect to the initial pose of the user during calibration (that is defined as the origin of the coordinate system).
Pose: The combination of position and orientation is termed as the pose of an object, even though this concept is sometimes used only to describe the orientation. Exterior orientation and translation are also used as synonyms of the pose. In computer vision and robotics, a typical task is to identify specific objects in an image and to determine each object’s position and orientation with respect to some coordinate system. This information can then be used, for example, to allow a robot to manipulate an object or to avoid crashing into the object.
Mesh: In 3D computer graphics and solid modelling, a polygon mesh is defined as a collection of vertices, edges and faces that constitutes the surface of a polyhedral object. In order to simplify rendering, surfaces are often modelled using triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons). However, in general, digital models of surfaces can be composed of concave and convex polygons or even polygons with holes.

Scene understanding

Scene understanding refers to the task of analyzing objects in the context of the 3D structure of the scene. This includes the layout of different objects and the spatial, functional, and semantic relationships between objects. Scene understanding comprises the following steps:

Depth estimation: Depth estimation or extraction refers to the task of obtaining a measure of the distance of (ideally) each point of the scene within the current field of view. This is crucial to obtain a representation of the spatial structure of a scene.
3D object segmentation: In computer vision, image segmentation is the process of partitioning a digital image into multiple segments following a given criterion. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Plane classification: Plane detection is achieved by classifying multiple overlapping image regions, in order to obtain an initial estimate of angularity for a set of points, which are segmented into planar and non-planar regions using a sequence of Markov random fields.
Object detection: Object detection is the task of detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection has applications in many other areas of computer vision, such as face detection in security systems, pedestrian detection in autonomous vehicles, image retrieval and video surveillance.

3D object rendering

Rendering or image synthesis is the automatic process of generating a photo-realistic or non-photo realistic image or object from a 2D or 3D model (or models — collectively known as a scene file) by means of computer programs. he results of displaying such an image or object is called render. Rendering pipelines in MR typically consist of the following components.

Anchoring: Anchors, also referred to as anchor points, are points in the environment that an MR device knows should always hold their respective digital objects (often referred to as abductors). This applies specifically to static digital objects. For example, say you want to place a digital lamp on a table. You would set the anchor to be on top of the table which the device has already discovered and recognized as a horizontal plane. Anchoring is what ensures that a 3D asset remains in place and behaves the way an object actually would if it were resting at that point in physical space even as the MR device changes its position and orientation in space. This is made possible by the ability of the MR device and its virtual camera to keep track of their positions and orientations in space with respect to the anchor points and render the appropriate surfaces and perspectives of the abductor object(s) as they move around the abductor(s) placed via anchor points. Anchor points are the key to creating realistic MR/AR experiences in a convenient way.
Alignment: To scale or not to scale — alignment answers this question. When it comes to augmented reality, setting the right camera alignment and scale is absolutely essential to bridging the gap between real and virtual worlds. Since mixed reality (like, augmented reality) has to visualize virtual objects in real environments, the correct pose and alignment of the displayed objects are of critical importance for immersive user experience, for example, in surgical navigation. Each MR experience session is preceded by a calibration procedure that is aimed to compute the transformation function(s) that enable the augmented virtual objects to be represented to-scale in the same coordinate system as the real-world objects. Proper calibration is also necessary for displaying landmarks in the real world.
Persistence: Object persistence allows the MR device to remember the relative positions and orientations of virtual objects in the environment, thus making them appear stable to the user.
Occlusion: Occlusion commonly refers to objects blocking other objects located behind them from a given viewpoint. In MR and AR, occlusion is essential for experiences to be immersive; virtual objects (or parts thereof) should be displayed only in places where there are no physical objects between them and the camera.

Manipulation and Interaction

In the real world, “manipulation” refers to any changes to an object applied by/through the human hand. Similarly, in VR/AR/MR environments, a real hand or an interface tool is used to grasp and move 3D virtual artefacts. This virtual interaction enables a rich set of direct manipulations of these 3D objects. The way a virtual object behaves when touched by human hand is (in most cases) governed by the same laws of Physics as their real-world counterparts. So modelling such interactions depends on our ability to accurately simulate physics within our computational budget. The performance of a computational physics engine depends on the following six factors [source]:

The Simulator Paradigm: The computational paradigm of the simulator determines which aspects of the world can be accurately simulated by it. Some of the popular simulator paradigms are constraint-based, impulse-based and penalty based. For a comprehensive overview of simulator paradigms please refer to [Erleben, 2004].
The integrator: Numerical integration constitutes a large fraction of the computations involved in the physics simulation. The numerical accuracy of the integrator directly impacts the fidelity of human-object and object-object interactions in the virtual world. Some popular integration methods used in Mixed Reality applications have been discussed by [Baraff, 1997] and Erleben [2004].
Object representation: Finding the correct representation of objects in the simulation environment (which in the case of MR is a virtual space overlaid in the real world) is crucial for fast and accurate modelling of their interactions. Various aspects of object representation have been discussed in [Hadap et al. 2004] and [Ratcliff 2007].
Collision detection and contact determination: Modeling the physics of contact is one of the biggest challenges faced by physics engines. Representation of objects plays an important part here. Fast and reliable detection of contact and collision and finding the immediate outcome of such an interaction is crucial for an immersive MR experience and key to productivity using MR tools in practical settings. Collision detection methods have been discussed in [Kavan, 2003] and [Hadap et al. 2004].
Material properties: Sometimes certain properties of materials, like Coulomb friction, need to be considered in modelling their interactions. Modelling friction properties in a physics simulator has been covered by Kaufman [2005].
Constraint implementation: Most object interactions must abide by physical constraints. For example, virtual fish swimming inside a closed rigid aquarium can not leave the boundaries of the aquarium, no matter how we interact with the fish. Different physics engines have different methods of constraint implementation which determine what family of constraints are supported and how accurately they can be simulated. See [Erleben, 2004] for a detailed discussion.

Multimedia gesture recognition

In modern mixed reality applications such as Microsoft Hololens, the user interfaces are based completely on eye-gaze, hand gestures and voice commands. Eye gaze is used to target and track an object and hand gestures and voice are used to grab the object and manipulate/take actions on it. The eye gaze — hand gesture-based UI is oftentimes not as precise as a mouse-keyboard or touch UI in a traditional computer, but the sheer simplicity of putting a HoloLens on and immediately interacting with content is purely magical and allows users to get to work without any other accessories.

Multi-user experience

The power of mixed reality really shines when it comes to shared experiences in the same (possibly remote) environment. Such shared experiences are made possible by co-location of multiple users through virtual teleportation. A similar situation may arise in which a group is distributed across geographies and at least one participant is not in the same physical space as the others (as is often the case with VR). This is known as a remote experience. Often, there can be a blend of these two cases in which a group has both co-located and remote participants (e.g. two groups in different conference rooms).

The following videos demonstrate the two different scenarios:

Users in the same physical space

Remote: All your users will be in separate physical spaces.

Computational complexity

Having briefly introduced the hierarchy of computational tasks that form the backbone of any Mixed Reality experience, in this section, we analyse the computational workload that each of these tasks poses. This will lead us to identify an optimal workload distribution scheme that conforms with our cost, computational and power constraints.

Mapping and reconstruction

3D surface reconstruction is accomplished by correspondence matching across the digital image sequences. A set of distinctive feature-points are detected in each image. These feature points are searched for correspondence across the rest of the images. Corresponding feature points are then triangulated to obtain their 3D coordinates. Correspondence matching thus results in a dense 3D point cloud of the target object. The worst-case complexity of these algorithms grows quadratically with the number of feature points used.

After 3D point cloud generation, surface reconstruction algorithms are used to fit a watertight surface over those points. The Poisson Surface Reconstruction and Ball-Pivoting algorithm is a popular choice. The computational complexity of Poisson reconstruction is given by its Octree Depth, Solver Divide and Angle Threshold. We will discuss these parameters in brief.

Octree Depth: Octree depth is the depth of the octree which is used during the reconstruction process. An octree of depth D produces a three-dimensional mesh of resolution 2D × 2D × 2D. As the octree depth increases, mesh resolution increases exponentially. The effect is increased CPU resource and memory consumption. The following plots from [citation] demonstrate how reconstruction time increases with Octree Depth for different point cloud resolutions.

Solver Divide: Solver divide specifies the depth up to which a conjugate gradient solver is used to solve the Poisson equation. Beyond this depth, Gauss-Seidel relaxation is used. If the solver divide is increased, computation time decreases as, at a greater value, Gauss-Seidel relaxation is used instead of gradient solver to solve the equation.

Angle threshold: Angle threshold is the value of the maximum allowable angle between the active edge and the new edge created by the rolling ball in the Ball Pivoting algorithm. If the angle exceeds the threshold value, the ball rolling is stopped for that region. This stops the creation of a new edge and the corresponding vertex is omitted. Increase in angle threshold value results in an increase of computation time, as more edges are added to the 3D mesh by the rolling ball. The quality of the reconstructed surface improves with an increase in angle threshold value. Limiting the value of the angle threshold stops the creation of a triangular mesh with high obtuse angles, thus reducing time complexity at the cost of the fidelity of the reconstructed shape. The following figure from [1] shows the variation of computational time with angle threshold.

Scene understanding

Scene understanding is the process, often in real-time, of perceiving, analyzing and elaborating an interpretation of a 3D dynamic scene observed through a network of sensors. In recent years, Deep Learning Convolutional Neural Networks (CNN) has become the de-facto standard computation framework in computer vision. CNN’s can deliver near-human accuracy in most computer vision applications, such as classification, detection and segmentation.

The high accuracy, however, comes at the price of large computational demand. Computational throughput, power consumption and memory efficiency are three important metrics for analyzing the computational burden of deep learning. The computational workload of an algorithm is measured in terms of the number of Floating Point Operations (FLOPs). The following table taken from [2] presents the computational workload posed by popular deep learning network architectures used in computer vision.

We can observe a general trend of increasing computational workload with increasing accuracy of prediction. As Mixed Reality is highly constrained by the computational resources that are available on the device, the problem thus becomes to pursue the optimal accuracy under a limited computational budget (memory and/or FLOPs). This calls for using lightweight network architectures such as SqueezeNet, MobileNet, ShuffleNet, ShiftNet and FE-Net that offer reasonable accuracy while being low on FLOPs.

In supervised learning, a machine learning algorithm builds a model (e.g. an artificial neural network) by examining many examples and attempting to find a model that minimizes loss — a process that is known as empirical risk minimization. Training a model means learning (determining) good/optimum values for all the parameters of the model from labelled training examples. After training comes inference. It is the stage in which a trained model is used to infer/predict the test samples. Like training, inference performs a forward pass to predict the output for a given input but, unlike training, inference doesn’t include a backward pass to compute the error and update weights. The success of a Mixed Reality application that requires complex scene understanding is contingent on very high-speed inference of deep neural networks that are trained offline.

The throughput of object detection from a deep neural network running on a certain hardware platform is measured in terms of the number of Frames processed Per Second (FPS). High FPS object detection is crucial for immersive MR experiences. The following table presents the different state of the art deep learning-based object detection algorithms and their FPS numbers while running on a state of the art NVIDIA Titan-X Pascal GPU.

High throughput object detection and instance segmentation require parallel processing hardware like GPUs. The high core count of these processors makes them extremely power-hungry.

The figure below presents the peak performance of recent Nvidia GPUs for single-precision floating-point (FP32) arithmetic measured by GigaFLOPs or GFLOPs and power consumption gauged by Thermal Design Power (TDP).

The following table compares CPU, GPU and FPGA (Field Programmable Gate Arrays) processors in terms of their throughput and power demand for running different deep learning neural network architectures.

The high power requirements along with the size, weight and cost of specialised hardware for deep learning makes it impractical for inclusion in mixed reality setups that are mostly head-mounted and have strict constraints on power consumption, thermal profile, comfort and price. In an age of burgeoning affordable internet connectivity options and with 5G on the horizon, most mixed reality applications have taken to off-loading almost all deep learning computation to cloud-based AI services that offer GPUs, FPGAs and application-specific processors (ASICs) that are optimized for Deep Learning workloads.

Remote Rendering

Rendering or transcoding is extremely processor-intensive tasks that can occupy all available resources. Remote rendering is a name given to the process of offloading rendering and transcoding tasks to another computer. In Mixed Reality, we need to render enormous high-resolution 3D videos. [26] Instead of clogging on-device compute with this heavy task, the 3D model data can be stored in a secure cloud server, under the control of the content owner, and only 2D rendered images of the models can be streamed back to client requests. This method relies on good network bandwidth between the client and server and may require significant server resources to do the rendering for all client requests, but it is crucial for high-fidelity immersive MR experiences.

Manipulation and interaction

The computational performance of physics simulation depends on the number and complexity of the geometric features present in the CAD model. Even the presence of a single, relatively small geometric feature can increase the size of the underlying discrete physical simulation problem by as much as 10-fold.

Approximations are an effective method of reducing the computational load of the physics simulation. A 2-D model can be dimensionally reduced to a 1-D beam and a 3D solid can be replaced with a 2D surface, which is comparatively computationally less expensive to analyze. In most of the cases, model simplification has little effect on the accuracy of physics simulation provided the objects and their interactions are represented appropriately.

The following diagram from [27] compares the performance of some leading physics engines for a task of stacking objects:

Multimedia gesture recognition

As discussed earlier, Mixed Reality applications rely on eye gaze, hand gestures and voice commands for immersive human-machine interaction. Reliable detection of these inputs under a plethora of real-world user conditions is a highly challenging task that is being actively researched to date. Human bodies are nonrigid objects with a high degree of variability, scales, orientations, poses, and illumination offer additional factors of variability to the commands.

In order to address this variability, all modern algorithms for these tasks use some form of machine learning and AI. However, AI is computationally expensive and in most cases, the raw user input data may not be transmitted to the cloud for inference because of privacy concerns. Additionally, these inputs must be registered without any noticeable latency and the network bandwidth can act as a bottleneck and ruin the experience by introducing latency. For these reasons, modern mixed reality setups have dedicated on-device chips for low power, low latency AI inference.

Hybrid MR cloud

In the previous section, we discussed the computational workloads presented by each of the tasks that come together to make an immersive Mixed Reality experience possible. MR devices are mostly mobile wearable devices. Hence they must be low power, lightweight and comfortable. Also, the affordability of these devices is key to the widespread adoption of MR. However, some parts of the MR pipeline like scene understanding and rendering require high-power high-performance compute-resources. As high power parallel computing hardware can not be included in a wearable MR headset, running these tasks on the device would introduce high latency and ruin the immersive experience. Off-loading the heavy lifting to cloud servers and streaming the results to the device is a viable solution to this problem. Such a cloud computing service can offer a selection of CPUs, GPUs, FPGAs, and ASICs for each task to choose for minimizing latency and maximizing throughput. This makes these cloud services hybrid.

We must note that not everything can be offloaded and processed in the cloud because of the latency introduced by limited network bandwidth. Hence, jobs that do not require high performance compute are run on the device itself. These jobs especially include those that are the most crucial for generating spontaneous responses to the users’ inputs. The other reason is for processing these operations on the device itself is the privacy concerns about the user data that cannot be uploaded to the cloud.

Intel AI and Mixed Reality

Intel’s diverse portfolio of hardware and software tools has been poised to meet the unique demands of AI-powered applications like Mixed Reality. As discussed earlier, low latency deep neural network inference is the key to an immersive MR experience. Intel Architecture (IA) is currently the market leader in this segment.

IA for AI

The Instruction Set Architecture (ISA) of the Processor Graphics SIMD execution units is well suited to Deep Learning. This ISA offers rich data type support for 32-bit FP, 16-bit FP, 32-bit integer, a 16-bit integer with SIMD multiply-accumulate instructions. At a theoretical peak, these operations can complete on every clock for every execution unit. Additionally, the ISA offers rich sub-register region addressing to enable efficient cross lane sharing for optimized convolution implementations, or efficient horizontal scan-reduce operations. Finally, the ISA provides efficient memory block loads to quickly load data tiles for optimized convolution or optimized generalized matrix multiply implementations.

Intel Deep Learning Accelerator (Intel DLA)

Intel’s DLA (Deep Learning Accelerator) is a software-programmable hardware overlay on Arria 10 and Stratix 10 families of Intel FPGAs to realize the ease of use of software programmability and the efficiency of custom hardware designs. Neural network topologies can be mapped onto the FPGA architecture to form an electrical circuit that is dedicated to the computation of a forward pass through the neural network (inference). Such a dedicated electrical circuit provides significantly lower latency than a general-purpose processor. DLA has partitioned configurable parameters in the runtime for compile-time tunability of performance while using different neural network frameworks. This is achieved through a VLIW network that delivers full programmability without incurring any performance or efficiency overhead [12]. This is an impressive engineering feat since overlays always tend to come with a serious overhead.

Xbar interconnect: A neural network forward pass computation is composed of a relatively small number of distinct operations (like Convolution, Pooling, Local Response Normalization (LRN), etc.) that are repeated over and over again. The functions corresponding to frequently used operations can be pre-optimized as kernels for hardware overlay onto FPGAs. Intel DLA uses Xbar interconnect — a custom interconnect — to efficiently connect and assemble members from a suite of pre-optimized kernels and auxiliary kernels.

Graph compiler: Intel DLA has a unique graph compiler that breaks down a neural net into subgraphs, schedules subgraph execution, and allocates explicit cache buffers to optimize the use of a custom-developed stream buffer and filter caches [12]. This allows for more efficient hardware implementations of neural networks.

Intel DLA along with Intel Xeon CPUs form the backbone of Intel’s offerings for cloud-based AI inference and is being used in applications like Project Evo of Microsoft.

Intel Processor Graphics

Intel Processor Graphics (Intel® HD Graphics, Intel® Iris® graphics and Intel® Iris® Pro graphics) provides a balance of fixed-function acceleration with programmability to deliver a good performance/power trade-off across a wide spectrum of AI workloads [13]. These hardware modules are ubiquitous as Intel Processor Graphics ship as part of Intel’s SOCs. As of today, more than a billion devices ranging from servers to PCs to embedded devices have Intel Processor Graphics running. This makes it a widely available platform to run on-device AI inference. Intel Processor Graphics is available in a broad set of power/performance offerings from Intel Atom® processors, Intel® Core™ processors, and Intel® Xeon® processors. Intel Processor Graphics is integrated on-die with the CPU. This integration enables the CPU and Processor Graphics to share system memory, memory controller, and portions of the cache hierarchy. Such a shared memory architecture enables efficient input/output data transfer and even “zero-copy” buffer sharing. Unlike discrete graphics (GPU) acceleration for deep learning, input and output data need not be transferred from system memory to discrete graphics memory on every execution — thus avoiding the cost of increased latency and power.

Intel AI Software

In order to complement its state-of-the-art hardware offerings, Intel provides a suite of software tools for accelerated inference using deep neural networks.

clDNN: Intel Processor Graphics can be programmed for Deep Learning using Intel’s open-source clDNN library. clDNN is a library of kernels based on OpenCL that accelerate many of the common function calls in popular neural network topologies like AlexNet, VGG-Net, GoogLeNet, ResNet, Faster-RCNN, SqueezeNet and FCN. During network compilation, clDNN optimizes the neural network operations in three stages as described in the figure below:

OpenVINO: To utilize the hardware resources of Intel Processor Graphics easily and effectively, Intel provides the Deep Learning Deployment Toolkit, available via the OpenVINO toolkit. This toolkit takes a trained model and tailors it to run optimally for specific endpoint device characteristic. In addition, it delivers a unified API to integrate inference with application logic. The Deep Learning Deployment Toolkit comprises two main components: a) Model Optimizer — a cross-platform command-line tool that performs static model analysis and adjusts deep learning models for optimal execution on end-point target devices like CPU, GPU and FPGA — and b) Inference Engine — a runtime that delivers a unified API to integrate the inference with application logic. For acceleration on CPU, OpenVINO uses the MKL-DNN plugin based off Intel® Math Kernel Library (Intel® MKL) which includes functions necessary to accelerate most popular image recognition topologies. Acceleration on FPGA is supported using a plugin for Intel® Deep Learning Inference Accelerator (Intel® DLIA). For GPU, OpenVINO uses clDNN.

Model flow through the Deep Learning Deployment Toolkit. Source: Link.

Video communication and rendering tools: Apart from AI tools, Intel also provides industry-leading tools for encoding, decoding and processing video data. Intel® Quick Sync Video technology is based on the dedicated media capabilities of Intel Processor Graphics to improve the performance and power efficiency of media applications — specifically speeding up functions like decode, encode and video processing. Intel® Media SDK or Intel® Media Server Studio provides an API to access these media capabilities and hardware-accelerated codecs for Windows and Linux.

Conclusion

In this article, we analysed Mixed Reality from a Systems Engineering perspective. We broke the problem into its components and analysed the nature of computational workloads associated with them. Finally, we motivated the design of Hybrid Mixed Reality Cloud as a backbone for supporting scalable and immersive Mixed Reality applications.

Thank you for taking the time to read my article. If you like my work, please consider hitting the clap button and following me on Twitter and Medium.

I would like to thank my wonderful team at GridRaster Inc. for giving me an opportunity to explore this amazing field and participate in cutting edge development and I would also grateful to the Intel Software Innovator program for giving exposure to the cutting edge technology being developed at Intel.

References

Facing the Engineering Reality of Mixed Reality with Intel AI

Contents

Introduction

Breaking down the problem of Mixed Reality

Computational complexity

Hybrid MR cloud

Intel AI and Mixed Reality

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Manorama Jha

No responses yet