Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View
We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360° panoramic view of an indoor scene when given only a partial observation (<= 50%) in the form of an RGB-D image. To make this possible, Im2Pano3D leverages strong contextual priors learned from large-scale synthetic and realworld indoor scenes. To ease the prediction of 3D structure, we propose to parameterize 3D surfaces with their plane equations and train the model to predict these parameters directly. To provide meaningful training supervision, we use multiple loss functions that consider both pixel level accuracy and global context consistency. Experiments demonstrate that Im2Pano3D is able to predict the semantics and 3D structure of the unobserved scene with more than 56% pixel accuracy and less than 0.52m average distance error, which is significantly better than alternative approaches.
Here we demonstrate how Im2Pano3D can generalize to different camera configurations. The camera configurations we consider includes: single or multiple registered RGB-D cameras such as Matterport cameras (a-d), single RGB- D camera capturing a short video sequence (e), color-only panoramic camera (f), and color panoramic cameras paired with a single depth camera (g). To improve the ability of the network to generalize to different input observation pat- terns, we use a random view mask during training. Tab.3 shows the qualitative evaluation. For all of these camera configurations, Im2Pano3D provides a unified framework that effectively fills in the missing 3D structure and semantic information of the unobserved scene.
In this example, the input RGB-D observation contains a view of a room with a tv on the right wall and a picture (partially) on the left wall.
In the ouput the network not only complete the unobserved part of the painting and the chair but also correctly predict the location of a bed .
The network also correctly predicts the existence of a window, however, with a different location compared to ground truth.
On the other hand, the prediction misses several objects such as cabinets, and pillows.
In this example,the input RGB-D observation contains only a view of a white wall and a white door. The network completes the scene as if it were a dining room with a table and chairs surrounding it. Although the completed scene looks plausible, it is very different from ground truth - which is a hallway with a partial view of a bedroom (through a doorway). This example demonstrates cases where the partial input observation does not contain sufficient information for the network to perform a prediction close to the ground truth.
In this example,the input RGB-D observation contains a view of a bathroom (throughadoorway),and half of a closet. These elements typically co-exist in a bedroom. As a result the network predicts the scene category to be bedroom. In particular, the network predicts the semantics and 3D structure of a bed, a window, and a cabinet in the missing region, without any direct observation of these objects from the input. While the network correctly predicts the existence of these objects, and makes a reasonable prediction of the room layout, we can see that the predicted bed is smaller than that of the ground truth. Also, the predicted window and cabinets are in different locations compared to the ground truth. From the probability distribution maps, we can also observe that the network has several hypotheses for the potential locations of doors, but with lower probability - indicating the uncertainty of the predictions.