For our open-ended computer vision project in my Computational Robotics course, my project partner and I wanted to work on a project related to localization. The idea of SLAM in unstructured environments was pretty interesting to us, particularly the step where the pose of the camera is estimated relative to a newly-constructed map of the environment. Implementing a fully-fledged monocular SLAM project would be too ambitious for this brief 2-week-long project, so we decided to focus on creating a 3D reconstruction from still images. This process is fairly similar to the camera pose estimation step in monocular SLAM algorithms, but not needing to deal with challenges such as loop closure or scale allowed us to focus on the process of how to get from 2D to 3D data.
Project Partner: Everardo Gonzalez
Project Results
Ultimately, we were able to get a 3D reconstruction with two images on both a publicly-available dataset where the camera calibration matrices were provided, and with our own dataset which we recorded with an Xbox 360 Kinect. Photos of our results are shown below:
Input Images - standard dataset:
Output pointcloud and computed camera positions - standard dataset:
Unfortunately, we didn’t have a “ground truth” 3D scan of the fountain or known locations of our camera when these photos were taken, which meant we couldn’t compute quantitative metrics to evaluate our results. However, we can qualitatively see that the results look pretty good - the wall behind the fountain appears fairly flat, and notable features like the ornament above the fountain, the fountain centerpiece, and the black and white sign in the bottom right are distinctly visible.
Qualitatively, the relative position of the camera across the two images also looks good. In OpenCV, the x-axis (red) points to the right, the y-axis (green) points down, and the z-axis (blue) points directly into the screen (this is illustrated in the image below)
Looking at the input images, we expect the orientation of the camera’s axes across the two images to be fairly similar, with the z-axes to be slightly pointed towards each other. Since the photos are upright, we also expect that the x-axis will point to the right and the y-axis will point downwards. Looking at our computed camera positions relative to our output pointcloud, we can see that these appear to match the position and orientation we expected, providing confidence that our implementation was successful.
Input Images - custom dataset:
Output pointcloud and computed camera positions - custom dataset:
In the left view it’s evident that the camera locations look about as we’d expect from the images, but it’s difficult to tell the quality of the pointcloud due to it’s sparseness (likely caused by the relatively low resolution of the kinect compared to the camera used for the fountain dataset). However, looking at the pointcloud from the top-down view (pictured on the right), the distinct roundness of the globe can be seen.
We also tested our implementation on other views and found that the quality of the results significantly drops if it can’t find good keypoint matches. In particular, our implementation is particularly sensitive to:
- low or inconsistent lighting
- repetitive patterns
- views that are too different or are too similar to each other
Project Documentation
For more information and to see our detailed write-up, check out our project repo!