Projects:2017s1-100 Face Recognition using 3D Data
Introduction
This project seeks to develop a system that is capable of recognising faces captured using commercial off-the-shelf devices. It will be able to capture depth imagery of faces and align them to a common facial pose, before using them to perform recognition. The project will involve elements of literature survey (both sensor hardware and algorithmic techniques), software development (in Matlab), data collection, and performance comparison with existing approaches.
Objectives
Develop a system that is capable of recognizing faces captured using commercial off-the-shelf devices such as the Xbox Kinect.
- Recovery of 3D data from polarimetric imagery
- Recovery of 3D data from Xbox Kinect and alignment to common pose
- Facial recognition from 3D models
Project Team
Jesse Willsmore
Orbille Piol
Michael Sadler
Supervisors
Dr Brian Ng
Dr David Booth (DST Group)
Sau-Yee Yiu (DST Group)
Philip Stephenson (DST Group)
3D data from Polarimetry
3D data from Xbox Kinect and Pre-processing
This section involves the creation of the software responsible for interfacing the newest Kinect camera (Kinect V2) hardware with MATLAB software on a computer. It included the canonical preprocessing techniques done from the 3D depth point cloud data resulting to a 3D facial image aligned to a frontal view pose for face recognition.
Method & Results
Kinect and Eurecom Database Interface
The acquisition of the image using the Kinect V2 was done first through connecting the Kinect camera with the MATLAB software. The subject was set up to have a distance no further than 1 metre away from the camera. The reasoning behind this is to allow the depth resolution to work around the 1.5mm depth resolution accuracy of a Kinect Camera. The image acquisition toolbox present in Matlab allows the user to acquire three sets of possible data, a 2D RGB image, a grayscale depth map or a RGB-D point cloud. This thesis worked with the RGB-D point cloud to be consistent with the data acquired from the available database. An added bonus is that Matlab has already aligned the coloured image with its subsequent grayscale point cloud and thus preventing any misalignment issues presented by the difference in camera locations. The previous implementation of the Kinect with Matlab involved the aligning of the RGB pixels with the grayscale point cloud. The figure below displays the result of the image acquisition step for the Kinect camera
The Eurecom database was used as it already provides the array for the creation of point clouds with each subject and is the current database used for the facial recognition software. Another reason is due to the point cloud already been processed in a way that the background and any possible outliers from the front of the subject have been separated from the needed point cloud which in this case is the person being captured. The image below shows the three different poses the subjects mad which shows the initial processing done on the database already. The yellow and blue areas from the image of the point cloud represents the outliers, both the background and the frontal outliers. The main area of work is the green shaded area of the point cloud. The interface is currently working with images with no major occlusions, however eyeglasses are an exception. The reasoning behind this is to allow the nose-tip detection and face cropping stages described ahead with relative ease.
Nose Tip Detection
Nose-tip detection for the point cloud acquired from the Kinect data is fully automatic. It is done through the use of the Viola - Jones algorithm. The algorithm uses the already aligned coloured image to detect the face initially. The image is cropped on the detected face and is put through Viola - Jones again to acquire the nose of the face. This reduces the incidence of false detection and improves the accuracy of the nose-tip detection. The point cloud is then cropped depending on the detected nose, and the closest point to the camera is selected as the nose-tip. Although, this way of nose-tip detection may provide only a rough location of the nose-tip, the nose-tip point is only needed for the purpose of a finding a rough centre point of the face for face cropping. The subsequent face cropping algorithm is robust enough that it will work as long as the nose-tip detected is within a few points from the actual nose - tip.Due to the Viola - Jones algorithm not working for tilted heads, the detection for tilted head is done by first rotating the image point cloud until a face is detected. To improve robustness the algorithm also checks if the other necessary features which is needed for the face cropping is detected. As the location of the head has barely changed with the rotation of the image it is assumed that the rotated image detected face is at the same place as the tilted head and that the nose tip is within the same area as the nose point cloud. The figures below summarizes the steps in determining the nose tip point for the Kinect camera both with frontal and tilted heads.
With outliers and the background already removed from the point cloud acquired from the database an automatic nose-tip method can be done albeit a little different with the Kinect camera captured data. One of the major issues is the misalignment of the coloured images present in the database compared to the point cloud data and thus the same method is not viable. Nose-tip detection is done by utilizing the vertical facial symmetry of the average human face, by first finding the closest points to the camera for frontal poses or by finding the smallest or largest point values in X depending on left or right pose and creating a histogram with Y values. This is followed by disregarding points not around the middle section of the head and allowing only points that are within two thirds distance from the centre. The correct nose tip is determined by choosing the closest point to the camera for frontal poses or the leftmost or rightmost point of the face for profile photos within the now enclosed area. The two thirds distance leeway is to account for any additional outlier such as beards or glasses and also for any additional pose differences that has not been taken into account. This method is highlighted by figure underneath.
3D Face Cropping
The point cloud is put through a face cropping algorithm which only includes the important features such as the eyes, nose and the mouth while also removing the hair and ears from the point cloud system. The algorithm is initialized by assuming that the nose tip is at the center of the face. This is also necessary for the canonical preprocessing techniques especially the symmetric filling section 4.3.2. This is done by putting the nose-tip point acquired on the previous step and normalizing the X and Y values of the point cloud to place the nose tip on the origin point, while leaving the depth data untouched. The radius of the cropping is set up so that it will remove the ears, the neck and any excess hair collected from the top of the head leaving only a section of the forehead, eyes nose, mouth and a section of the chin.
The Eurecom database had all of the subjects at a same distance from the camera for all of the photos and for each session. This meant that the radius for the face cropping was consistent enough and the same throughout the whole database. The radius was selected through a trial and error basis, and the optimal radius was found to be along 80 geometric points from the nose-tip detected. The Kinect camera on the other hand did not have this luxury. Instead the Viola Jones algorithm is adjusted to find additional features, not only the nose. When possible, the eyes and the mouth were also located. Using the vertical facial symmetry of an average face and the cropped point clouds specific to the mouth, nose and the eyes, the radius of the face is found even with varying distances from the Kinect camera. The point cloud is also arranged in the order to which the width of the face is along the X axes, and the height of the face on the Y. The results are seen underneath
Facial recognition from 3D models
The proposed method for face recognition utilises sparse representation and is designed to be robust under occlusion and different facial expressions.
Method
A dictionary is built for each subject which is made up of that subject's training samples. These dictionaries can be utilised to identify a test sample by exploiting the assumption, that for each subject, these dictionaries will lie on a linear subspace in order to perform classification.