2D-3D Geometric Fusion Network using Multi-Neighbourhood Graph Convolution for RGB-D Indoor Scene Classification

Fork me on GitHub

upc-logo

Universitat Politècnica de Catalunya

introduction
Multi-modal fusion has been proved to help enhance the performance of scene classification tasks. This paper presents a 2D-3D Fusion stage that combines 3D Geometric Features with 2D Texture Features obtained by 2D Convolutional Neural Networks. To get a robust 3D Geometric embedding, a network that uses two novel layers is proposed. The first layer, Multi-Neighbourhood Graph Convolution, aims to learn a more robust geometric descriptor of the scene combining two different neighbourhoods: one in the Euclidean space and the other in the Feature space. The second proposed layer, Nearest Voxel Pooling, improves the performance of the well-known Voxel Pooling. Experimental results, using NYU-Depth-V2 and SUN RGB-D datasets, show that the proposed method outperforms the current state-of-the-art in RGB-D indoor scene classification task.

If you find this work useful, please consider citing:

	Albert Mosella-Montoro, Javier Ruiz-Hidalgo, 2D-3D Geometric Fusion network using Multi-Neighbourhood Graph Convolution for RGB-D indoor scene classification, Information Fusion, 2021, ISSN 1566-2535, https://doi.org/10.1016/j.inffus.2021.05.002
	
@article{MOSELLAMONTORO2021,
title = {2D–3D Geometric Fusion network using Multi-Neighbourhood Graph Convolution for RGB-D indoor scene classification},
journal = {Information Fusion},
volume = {76},
pages = {46-54},
year = {2021},
issn = {1566-2535},
doi = {https://doi.org/10.1016/j.inffus.2021.05.002},
author = {Albert Mosella-Montoro and Javier Ruiz-Hidalgo},
}

Check our paper here.

Method
The proposed network is composed of two branches: the 3D Geometric branch and the 2D Texture branch. The 3D Geometric branch is composed of two novel layers named Multi-Neighbourhood Graph Convolution (MUNEGC) and Nearest Voxel Pooling. The input of this branch is a 3D point cloud that can be obtained directly from a lidar sensor or using the depth information and the intrinsic camera parameters of an RGB-D sensor. Each node of the 3D input point cloud encodes the depth information using the HHA encoding. HHA encodes the depth into a 0 to 255 range with three channels. Each channel represents horizontal disparity, height above the ground, and the angle with the inferred gravity direction. The 2D Texture branch uses as a backbone the well-known architecture ResNet-18. The input of this branch is a 2D RGB image corresponding to the same capture as the capture used on the 3D Geometric branch. After the corresponding branches, the extracted 3D Geometric and 2D Texture features are fused using the 2D–3D Fusion stage, and the result of this stage is used by the Classification network to predict the corresponding scene class. The proposed network is depicted in the following picture:

arch



The main contributions of this paper are:

The proposal of the Multi-Neighbourhood Graph Convolution operation, that takes into consideration the neighbours of the centre point in Feature and Euclidean spaces.


munegc


The Nearest Voxel Pooling algorithm, which consists of an improved version of the current Voxel Pooling algorithm that mitigates the noise introduced by sensors.


The fusion of 2D-3D multi-modal features through the proposed 2D-3D Fusion stage. Using geometric proximity allows the network to exploit the benefits of 2D and 3D Networks simultaneously.


2d3dfusion
Results

Results on SUN RGB-D Dataset

Mean Acc(%)
Method RGB Geometric Fusion
Multi-modal fusion 40.4 36.5 41.5
Effective RGB-D representations 44.6 42.7 53.8
DF²Net 46.3 39.2 54.6
MapNet - - 56.2
TrecNet 50.6 47.9 56.7
Ours 56.4 44.1 58.6

Results on NYU-Depth-V2 Dataset

Mean Acc(%)
Method RGB Geometric Fusion
Effective RGB-D representations 53.4 56.4 67.5
DF²Net 61.1 54.8 65.4
MapNet - - 67.7
TrecNet 64.8 57.7 69.2
Ours 67.8 59.2 75.1
code
pytorch pygeometric

We implement our method using Pytorch and Pytorch Geometric.

Find source code on here.

acknowledgements
This research was supported by Secretary of Universities and Research of the Generalitat de Catalunya and the European Social Fund via a PhD grant to the first author (FI2018), and developed in the framework of project TEC2016-75976-R, financed by the Ministerio de Economía, Industria y Competitividad and the European Regional Development Fund (ERDF).
gencat fse meyc