With the advancement of digital information, particularly digital images, it catches the interest of researches from computational vision to explore further. Action recognition specifically has become one of the fields that are churning out rapidly. The main reason is because it act as one of the core components in many other applications, for instance video surveillance, human-computer interface and robotic.
Unlike human brains, which can unconsciously process and distinguish objects, facial expression and actions; for machinery and computers, it involves a complex procedure. Part of human action recognition process is achieve by gathering action based feature data in time dimension and perform classification. However, due to the large amount of feature dataset, it leads to curse-of-dimensionality circumstances. With the intention of handling these large amounts of feature dataset adequately, the dimension of feature data is compress.
Dimensionality reduction in computational vision, particularly in action recognition is a crucial step to map high dimensional feature dataset into a lower dimensional meaningful form. Best case scenario, the dimensionality of reduced data is direct corresponding of the intrinsic dimensionality of the original data [1]. In another words, after performing dimensionality reduction, the reduced feature dataset should consist only the least amount of parameter to represent the actual data.
Aware of the significance of dimensionality reduction mainly in the field of human action recognition and intrigue by the characteristic of variety of dimensionality reduction technique, the research is conducted. In this research, the main focus is to investigate the optimum and robust dimensionality reduction technique with certain trade-off in the application of human action recognition.
Nevertheless, to facilitate this research, a complete feasible human action recognition model is required. Based on the Piotr Dollar et al paper and code available [2], a complete action recognition using sparse spatio-temporal feature model is reconstructed. The algorithm of model used in this research will be described in brief in this paper.
Aim
The core aim of this research is to explore and evaluate the performance of different dimensionality reduction techniques exploited to reduce the dimension of descriptors produce in action recognition. Investigation is perform by making use of KTH dataset (video sequences) in empirical evaluation of selection of dimensionality reduction technique. If time permits, additional experiments on the same selected dimensionality reduction techniques will be conducted using UCF50 dataset on the identical action recognition algorithm.
Additionally, produce a workable algorithm and Matlab code to perform action recognition.
Objectives
In order to achieve the aim of this research, several objectives will be address:
Perform thorough research literature review in the field of action recognition
Get familiar with Matlab programming
Comprehend the variety of dimensionality reduction techniques in up to date literature
Propose a feasible action recognition algorithm
Formulate an experiment procedure to evaluate and make comparison on the performance of dimensionality reduction techniques
Conduct experiment and identify limitations/ restriction in performance evaluation
Analytically evaluate the performance of different dimensionality reduction technique use in proposed action recognition algorithm
Draw conclusion and specify future research required
1.3 Report Outline
The report is organized as follow:
Section 2: Consist of background material for the research (literature review), reviewing recent work being done on dimensionality reduction and human action recognition. Detail review of selective dimensionality reduction method and diverse methods of performing human action recognition.
Section 3: Outline the human action recognition algorithm used for this research and initial specification. In detail explain the procedure required and fragment which requires dimensionality reduction.
Section 4: In view of the fact that the research is conducted in a rather short time, this section provides a comprehensive research plan, time management and risk assessment for the project.
2 Literature Review
2.1 Introduction to Literature Review
The aim of this section is to outline the structure of literature review relating the research of this paper. To enhance the understanding of dimensionality reduction in human action recognition, the first section reviews the basis of dimensionality reduction and discusses selective dimensionality reduction method available. Subsequently, reviewing approaches to perform action recognition and briefly introduce the action recognition model used in this research.
2.2 Dimensionality Reduction
Curse-of-dimensionality refers to complications arise due to larger amount of /high dimensional data. In this research, we are only interested with implementation of dimensionality reduction in action recognition. In the course of action recognition, regardless of the methods use for processing, curse-of-dimensionality still transpire, creating challenges to data analysis. It is known for a fact that when 'collecting' interest point and feature dataset, not all data is essential for machinery to distinguish or understand the underlying phenomena.
Dimensionality reduction is essential to reduce feature dataset hence decreasing computational processing time and memory space. According to [3], after the process of dimensionality reduction, we can uncover hidden structure which aid in better understanding of the dataset as well as able to visualize data more efficiently. Dimensionality reduction technique can be divided into two categories,
Among the earliest dimensionality reduction techniques develop are Principle Component Analysis (PCA) and Multi-dimensional Scaling (MDS). The drawback of these two techniques is it is only capable of process linear dataset. In recent years, researchers have developed new dimensionality reduction techniques which able to process nonlinear data, for instance Isometric Feature Mapping (Isomap) and Sammon Mapping.
2.2.1 Linear Dimensionality Reduction technique
As mention in [3], linear dimensionality reduction techniques produce a lower-dimensional space which is linearly permutation of the original variables.
Principle Component Analysis (PCA)
PCA, which can also be refer as Karhunen-Loeve transform (KLT) is a classic and extensively use unsupervised dimensionality reduction technique which is invented by Karl Pearson in 1901. It can significantly reduce the original dimensionality to a non correlated feature set with minimum loss of information [4].
To further explain PCA mathematically, we will discuss it step by step until attaining a reduce dimensionality feature dataset.
Record dataset from experiment. Dataset could be 2-dimensional, 3-dimensional and so on.
Subtract mean value from each dimension. The mean value is the average across each dimension.
Perform calculation to obtain covariance matrix.
Generate value of eigenvalue and eigenvector.
Form feature vector.
From step 4, upon obtaining eigenvalue and eigenvectore from covariance matrix, arrange eigenvector in relation to eigenvalue from highest to lowest. As a result, eigenvector is arranged in order of significances. The most significant eigenvector can also be referred to as principle component.
Upon completion, choose eigenvector as required, removing less significant component. The consequences of elimination result in loss of information (amount of information loss is direct proportional to eigenvalues)
With eigenvectors selected, form feature vectors [7]:
Feature Vector = (eig1, eig2, eig3…..) (1)
Take this example for instance, data set with dimensionality of X-dimension. Perform calculation up till part 4, producing eigenvector and eigenvalues. Sort eigenvector in order of significant. Select only P eigenvector (P is less or equal to X). Yield P dimensions.
Final step, generate new dataset
Transpose FeatureVector then multiply with output obtained from part 2. From [7],
New Dataset = Row_FeatureVector x Row_Data(from step 2) (2)
where Row_FeatureVector is the transpose of Feature Vectors and Row_Data is the transpose of data after subtraction of the each dimension mean value.
As stated in [3, 6], the advantages of implementing PCA is noise reduction, minimizing redundancy and ambiguity with a straightforward computational process. However, due to the fact that PCA is a linear dimensionality reduction technique, it is not capable to accurately process nonlinear data.
2.2.2 Nonlinear Dimensionality Reduction technique
Nonlinear dimensionality reduction method can be referred to as manifold learning. Since last decade, researchers have been shifting focus to nonlinear dimensionality reduction; numerous new methods have been proposed, for example Locally Linear Embedding (LLE) and Isomap. Unlike linear dimensionality reduction, nonlinear dimensionality reduction offer better performance for complex nonlinear data. In terms of real world data such as digital images and speeches, nonlinear dimensionality reduction outperform linear dimensionality reduction since real world data is consist of high dimensional nonlinear manifold [5].
Local Linear Embedding (LLE)
LLE was invented by Roweis and Saul in year 2000. This technique is constructed based on the assumption that neighbourhood of data points is locally linear [10].
The algorithm of LLE is describe as follow,
Search for nearest neighbour of each point but computing the distance between two points from original data points, Xi.
Calculate weight, wij of each point as such that each point is linear with its neighbour. To enforce linearity, calculate the reconstruction error
(3)
[11] Use eigenvector-based optimization, compute low-dimensional coordinates with the condition that each coordinate can best reconstructed by weight, wij.
(4)
2.3 Human Action Recognition via sparse spatio-temporal features
Human action recognition is one of the research field in computer vision which serves as a main components in various domains such as video surveillance, human-computer interface and robotic. It has received growing attention since the 1980s [12] therefore we can see huge diversity of methods to accomplish action recognition.
Some approaches of action recognition inclusive of using sparse spatio-temporal feature [8], local space-time features combine with SVM classification [14], compound feature mined from dense spatio-temporal [15], spatio-temporal salient point (measure changes in information content of pixel neighbourhood in space and time) [16], utilises global information from each video input to select relevant interest points for a condensed representation [17] etc.
In order to carry out experiment to examine the performance of each approaches, several action databases are available online. These databases are KTH [18] dataset which is widely used in researches [13, 17], Weizmann dataset, UCF dataset [19] etc
Regardless of the approaches to perform action recognition, it always comes to three main procedures: feature extraction, dimensionality reduction and pattern classification [13].
In the context of this research, we only select one out of several methods to perform action recognition. Piotr Dollár et al develop a behaviour recognition (human and rodent) using sparse spatial-temporal feature [8]. The idea of the feature extraction propose by Piotr et al unlike 3D Harris corner detector, tend to detect more interesting feature instead of too few interest point with the knowledge that spatial interest point are better cope with irrelevant and misleading feature [8]. This method is adopted in our action recognition model. Piotr Dollar et al approach to perform feature detection will be discussed in detail in next section.
For classification, we have chosen to use Naïve Bayers Nearest Neighbour (NBNN). The reason for choosing this method is due to its simplicity which does not require training phase.
Initial Design
This section outline the human action recognition algorithm which is the initial work done in the research to evaluate the performance of selected dimensionality reduction technique. The first segment of this section briefly describes the overall algorithm of the model. Subsequent segments discuss in more in depth of the procedure taken to achieve human action recognition and specify particular section which requires dimensionality reduction.
Proposed Human Action Recognition algorithm
In order to replicate the human action recognition model proposed in [8], Matlab code for action recognition written for this research paper is achieve by adopting Piotr Dollar et al toolbox and codes [8]
Figure 3. Proposed Human Action Recognition algorithm used for research
Action recognition in this paper is performed using KTH dataset comprise of six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping). The video sequences for each action is perform by 25 persons in 4 different scenarios [9].
As shown in Fig.3, the general framework for action recognition is accomplished by two main parts. The first part or initial steps (data collection) for action recognition is to identify interest points in video sequences and extract the spatio-temporally windowed pixel values (which will be tag as cuboids in this paper) [8] and subsequently perform transformation on cuboids, producing feature descriptors. To reduce the dimensionality of feature descriptor, in this research, we implement several different dimensionality reduction techniques and evaluate the performance of each technique. Finally, perform classification and store it.
The second part of action recognition involves testing. The same procedure is applied on test video sequences, upon obtaining the cuboids descriptors; implement Naïve Bayes Nearest Neighbour (NBNN) for classification. Afterwards evaluate the accuracy of action being identify with respect to the technique of dimensionality reduction used.
3.1.1 Feature Detection
The characteristic of the feature detector used in this research, is reproduce of [8], generate strong responses when region in video sequences contain motions.
Figure 4. Location of responses taken from KTH jogging video sequences (person 1)
Shown in Fig. 4 is an example of feature detection of interest point of human action (output generated from Matlab code written for this research paper) which based on separable linear filters. The response function [8] to perform detection is as below:
(5)
g(x,y,σ) indicates 2-dimensional Gaussian kernel. This kernel is use on spatial dimension. Whereas for temporal dimension, apply 1-dimensional Gabor filter. The function pair is defined as below:
(6)
(7)
Based on [8], Dollar et al proposed in all cases, apply ω = .
3.1.2 Cuboids Extraction
From features identified in previous section, cuboid (spatio-temporal pixel window) which best fit the response function, equation (5) is extracted. Subsequently, using cuboid extracted, generate cuboid descriptor.
Figure 5. Cuboids extracted from KTH jogging video sequences (person 1)
Cuboid descriptor is achieved by transforming cuboids extracted. As proposed by Dollar et al, several methods of transformation can be apply, for this research, we first transform cuboids into normalize brightness. Then perform histogram with values in cuboids. The reason of selection is histogramming values in cuboid enhance robustness to perturbations as well as removing spatio and temporal positional information [8].
In this research, our core focus is using several dimensionality reduction techniques and applies it to reduce dimensionality of descriptors computed. Later, implement KTH test video sequence to perform testing. From the output obtained, perform evaluation on dimensionality reduction technique used in terms of accuracy, speed and complexity. This section will be cover in second part of the research.
3.1.3 Pattern Classification
Using Naïve Bayes Nearest Neighbour (NBNN) is a simple yet offers impressive accuracy for feature-based classification for our research. Although Naïve Bayes Nearest Neighbour (NBNN) requires longer computation period, it does not require training procedure. This method is done by keeping keypoints of training video sequences [20].
(8)
Each keypoints obtained is classify with class k to generate a minimum value using equation (8). In equation (8), denote keypoints extracted from video sequences. computes nearest neighbour keypoints, that can be categorize in class k.