How to MR

Chapter 1: Alignment in Mixed Reality

Manorama Jha
9 min readDec 31, 2020

Introduction

Mixed Reality (MR) is emerging as a driver of the next big revolution in manufacturing, healthcare, and remote collaboration. The internet is full of articles¹ ² and case studies³ on how MR headsets, such as the Microsoft Hololens and the Magic Leap One, are being used by aerospace companies and hospitals to achieve unprecedented engineering and medical feats. However, MR is much more than the headsets and few of these articles explain how magical experiences can actually be created using off-the-shelf MR hardware. In this series named “How to MR”, we will discuss various computer science challenges that need to be addressed in order to create useful and immersive MR experiences. In the first chapter of this series, we discuss the problem of alignment in MR.

MR allows the user to render and interact with virtual 3D holograms in a real-world environment. In a simple interior design use case, MR would allow the user to render the hologram of a couch in an empty room and move it around the room with hand gestures to find out where in the room it would look best. One of the primary use cases of this technology is to overlay a virtual reference model of an object on top of a real instance of the same object for comparison. For example, in manufacturing, a virtual reference model of a precision engineering part can be overlaid on a damaged instance for inspection and repair. In remote assistance, a circuit diagram can be overlaid on a real switchboard and remote instructions can be issued for debugging the connections. In surgery, pre-operative scans (like MRI, CT Scan, etc.) can be overlaid on the patient’s organ in real-time for precise guidance to the point of surgery.

Definition: The problem of Alignment in MR is defined as the task of overlaying a virtual model of an object on top of a real-world instance of the same object such that the position, orientation, and scale of the virtual model matches perfectly with the pose and dimensions of the real-world object.

Accurate alignment of the virtual model with the real object requires accurately estimating the render transformation function between the coordinate systems of the headset and the object in real-time. This is particularly challenging because the headset, which is usually mounted on the user, is a moving frame of reference. The real-world object can also move and can even change its shape, if deformable (as in the example of surgeons trying to align pre-operative scans with live organs). Recomputing the transformation function every time the user moves their head or the object moves — can be computationally expensive. Additionally, the system should be robust to varying illumination and noise in the environment.

In this article, we break down the challenges of alignment in MR and present a brief overview of the methods used in addressing the challenges. We will use Microsoft Hololens 2 as a platform for our discussion.

Problem Description

Figure 1: The Alignment Problem in MR illustrated. The task is to accurately estimate the transformation matrix that would align the virtual model in the Device (Hololens, in this example) Coordinate System with the real object in the World Coordinate System.

In order to understand alignment, we must first understand the coordinate systems involved in the problem. The real-world objects reside in a “world” coordinate system. When the user looks at the world objects through the Hololens, the frame of reference of the user corresponds to the Hololens coordinate system. In order to align the rendered 3D model at the location and pose of the object in the real world, the system must find the correspondence between every point of the real-world object and a canonical model of the object in the Hololens coordinate system. This amounts to finding a transformation function, T that maps the coordinates of each point of the canonical model to the coordinates of the corresponding points of the real-world object.

For most practical uses, we can assume T to be an affine transformation — in which case, it can be represented by a transformation matrix M (see Figure 1).

The problem of alignment boils down to the estimation of M accurately and efficiently. Since the above calculation must be performed in the Hololens coordinate system, the problem may also include representing real-world points in the Hololens coordinate system.

Estimating Transformation Matrix for Alignment: Point Cloud Registration

The problem of estimating the transformation matrix can be formalized as a Point Cloud Registration (PCR) problem. PCR has been studied extensively in computer vision applications such as medical imaging and robot localization. Pomerleau et al. present a detailed review of PCR algorithms and their applications. Applications of PCR in Augmented and Mixed Reality have been described by Kevin Kjellén in their Masters’ Thesis. Given a pair of point clouds (in our case, one for the virtual model and one for the scanned real-world object), the task of PCR is to solve the following two sub-problems:

  1. Pairwise correspondence: For each point on the virtual model, find the matching point on the real-world object scan that it corresponds to.
  2. Finding the “rigid-body” transformation matrix: Given a correspondence mapping between the two point clouds, estimate the transformation matrix M that would take the virtual model and align it with the real-world object.

The Iterative Closest Point algorithm is one of the most popular approaches to solving this problem. Watch the following video for a gentle introduction.

The algorithm starts with an initial guess on the correspondence and solves the two sub-problems iteratively. The algorithm achieves an improvement in its estimation of both the correspondence mapping and the transformation matrix after every iteration.

Iterative Closest Point often suffers from optimization related issues such as getting stuck in local optima. Many approaches have been proposed to get around these challenges. Rusinkiewicz et al. describe some of these methods. Recently machine learning approaches have also been used to improve the accuracy of PCR. We will save that discussion for a future blog.

Methods of Improving Alignment

In this section, we will brief the challenges involved in achieving clear alignment in uncontrolled environments and some methods used for addressing them.

Point cloud Denoising

There are two main ways in which point cloud data is captured by MR systems:

  1. Multi-view stereo RGB images are used to reconstruct the 3D point cloud of the scene. Example: COLMAP.
  2. Depth data is recorded separately alongside RGB images. Example: TSDF

The second approach usually gives more robust and accurate point clouds but it requires dedicated hardware for depth data capture. However, in both the approaches, the reconstructed point cloud can get corrupted by noise and outliers caused by sensor error and errors in depth estimation. Measurement noise is especially high around edges and corners. Reconstruction algorithms that work with multi-view images introduce noise in places where they fail to manage ambiguities in correspondence among the multiple images. The inevitable noise in 3D point clouds undermines the performance of surface reconstruction algorithms and impairs further geometry processing tasks since the fine details are lost and the underlying manifold structures are prone to be deformed.

Denoising of 3D Point Clouds Constructed from Light Fields

This motivates the use of point cloud denoising algorithms. Denoising 3D point clouds is challenging because 3D point clouds are permutation invariant and the neighboring points representing the local topology interact without any explicit connecting information. The denoising of point clouds is inherently difficult since point sets do not directly provide information about the surface topology. The problem of accurately estimating an underlying surface representation is made even more difficult due to the presence of outliers. In the case where outliers are present, we need to identify and remove them before further processing of the point cloud data since they do not contain any information of the surface and would deteriorate the performance of any denoising approach.

3D point repositioning algorithms for denoising encourage the careful delineation and preservation of sharp and fine-scale 3D features of surfaces. There are various procedures of point cloud denoising [review paper]. Robust outlier detection and removal, bilateral normal mollification and point set repositioning are some of the most popular approaches.

Segmentation

In a real world use case of alignment, the target object may be present in a cluttered environment. When we scan such a scene, the point cloud would include all the objects in the field of view — only one of which would be the target object of interest. In order to make the job easier for point cloud registration, it is customary to first segment the target object out of the point cloud. Nguyen et al. present a survey of point cloud segmentation algorithms.

Example of 3D point cloud semantic segmentation

High Resolution Depth Data

As discussed in the previous section, point clouds reconstructed using dedicated depth sensors alongside RGB cameras are more robust and accurate. It has been reported that most of these systems suffer from the problems of noise in the depth-data and resolution mismatch between the depth sensor and the color cameras, since the resolution of range sensors currently available in the market is significantly lower than the resolution of color cameras. High-resolution depth maps can be obtained using stereo matching, but this often fails to construct accurate depth maps of weakly or repetitively textured scenes or if the scene exhibits complex self-occlusions. Depth sensors, on the other hand, provide coarse depth information regardless of presence or absence of texture.

With a view to overcoming the weaknesses of the individual sensors and capturing the best of both worlds, modern mixed reality setups use a calibrated system, comprising of a time-of-flight (depth) camera and a stereoscopic camera pair, that allows for data fusion. Such a system can provide accurate and fast 3D reconstruction of a scene at high resolution, e.g. 1200 × 1600 pixels at 30 frames/second.

Popular MR headsets like Microsoft HoloLens and Magic Leap are designed to be completely portable for obvious reasons of user-friendliness and versatility. However, this form factor introduces resource constraints which do not allow these devices to include a source of high quality real-time depth data, which severely restricts their use as research tools for multiple use cases.

To overcome this restriction many research groups (such as Garon et al.) have devised methods to augment a HoloLens headset with a much higher resolution depth sensor. Generally the external depth sensor (for example — Intel Realsense or Microsoft Azure Kinect) is mounted on the HoloLens and calibrated by connecting it to a compute stick that communicates with the HoloLens in real-time.

Real-time process optimization

Real time processing is crucial for a robust alignment experience in MR. In other words, when the environment changes, the target object moves or the user’s frame of reference shifts, the results of alignment should look virtually unaffected. The entire workflow of capturing, processing, segmentation, registration and rendering should be real time. Having all of these computationally heavy processes to run on the MR device is not possible due to power and resource constraints. This is where a server (cloud) and client (MR headset) based setup becomes relevant where we can divide our processes between the server and the client and have a virtually real-time operation. Read my earlier article for a detailed discussion on the engineering challenges of creating immersive MR experiences and the design of hybrid compute architectures for this purpose.

Conclusion

In this article we discussed the task of alignment in mixed reality. We motivated the task by citing some exciting real world use cases. We formalized the task as a point cloud registration problem and introduced the Iterative Closest Point family of algorithms. We also discussed point cloud denoising, acquisition of high resolution depth data and real time processing architectures that are crucial for a robust and high-fidelity experience in the real world applications of alignment.

I would like to express my heartfelt thanks to the reader for taking the time to read my article. I hope it was helpful. I would appreciate it if you could leave your valuable feedback in the comments section below. MR is a super exciting field and I am grateful to my employer, GridRaster Inc. for giving me an opportunity to study it as part of my day-to-day job and make meaningful contributions to the field. To know about how GridRaster is developing cutting edge artificial intelligence based 3D computer vision solutions for enterprise mixed reality applications (including alignment), check out this article.

References

  1. 3D Point cloud denoising via Deep Neural Network based local surface estimation
  2. Research on scattered points cloud denoising algorithm
  3. Robust Feature-Preserving Denoising of 3D Point Clouds
  4. Total Denoising: Unsupervised Learning of 3D Point Cloud Cleaning
  5. Real-Time High Resolution 3D Data on the HoloLens
  6. High-resolution depth maps based on TOF-stereo fusion
  7. A review of algorithms for filtering the 3D point cloud

--

--

Manorama Jha

Software Development Engineer at Gridraster Inc. | Mixed Reality | Computer Vision | www.manoramajha.com