Artificial Neural Network Sound Recognition Computer Science Essay

Published: November 9, 2015 Words: 5335

Sound recognition is an attempt to identify sound based on different methods. In sound recognition, the goal is to identify the sound irrespective of other sounds being made. Unlike Speech which has form structure (vowels, consonants, phonemes) and music which has harmonic structure (notes, rhythm, timbre) [1] are both structured sound. Sounds on the other hand, typically contains a variety of sounds, including conditions that are characterized by narrow spectral peaks, such as chirpings of insects. Sounds are unstructured and similar to noise variably composed and thus models are difficult to build for them [2].

Sound recognition has found its application in areas such as a speech recognition system implemented as a part of the Teaching and Learning Using Information Technology (TLIT) [3], speaker identification [4, 5, 6], automatic speaker recognition [7], phoneme recognition [8], word recognition [9], environmental sound [10], voice recognition [11] and emotion recognition [12]. These are indications of how widely sound recognition has been used.

Artificial Neural Network is an information processing method that is inspired by the way biological nervous systems (such as the brain and neurons) process information [13]. Basically, an artificial neural network is a system i.e. receives an input, processes the data, and provides an output. A neural network is a processor that has the disposition for storing knowledge and making it available for use [14]. The main characteristic of neural networks is in their ability to learn complex nonlinear input-output relationships, using successive training procedures and adapt themselves to the data. The most widely used family of neural networks for pattern classification tasks [13] is the feed-forward network and Recurrent neural networks (RNN). Another popular network is the Self-Organizing Map (SOM), or Kohonen-Network [15], which is mainly used for data clustering and feature mapping.

The increasing popularity of neural network models to solve pattern recognition problems has been primarily due to their seemingly low dependence on domain-specific knowledge and due to the availability of efficient learning algorithms for practitioners to use [14]. They provide a new suite of nonlinear algorithms for feature extraction and classification. In addition, existing feature extraction and classification algorithms can also be mapped on neural network architectures for efficient (hardware) implementation. An ANN can be applied to pattern recognition or data classification, through a learning process [13]. Neural networks have changed the way we solve "real-world" problems in science, engineering and economics [16].

Literature review

Overview

The process of sound recognition is a stepwise process involving preprocessing of sound signal, feature extraction and classification. The first stage is the preprocessing stage which begins with A/D conversion up to windowing. In the feature extraction stage, methods such as Mel frequency cepstral coefficient (MFCC), linear prediction coding (LPC) and its derivatives, have been used. The final stage of classification and identification has been carried out using gaussian mixture models, hidden markov model (HMM) and numerous neural networks architectures.

Speech recognition

There have been three major approaches have been broadly used in speech recognition systems, namely, the acoustic-phonetic approach, the pattern recognition approach and the artificial intelligence approach. The acoustic-phonetic approach attempts to decide the speech signal in a sequential manner based. The pattern recognition approach, on the other hand, classifies the speech patterns without explicit feature determination [14] and segmentation. The artificial intelligence approach creates a hybrid system between the two previous approaches. Besides, the combination of neural networks and linear dynamic models is proven in achieving a high level of accuracy in automatic speech recognition systems. The problem in speech recognition is in the increase of error in the presence of noise such as in a typical office environment [9].

Speech signal

Preproce-ssing

Feature Extraction

Classifica-tion

S (t) Speech features feature vector

Fig 1: Basic speech recognition system [18]

Preprocessing

Analog to Digital Conversion (A/D)

The input speech signal is transformed into an electrical signal by using a microphone. Before performing A/D conversion, a low pass filter is used to eliminate the aliasing effect during sampling. A continuous speech signal has a maximum frequency component at about 16 KHz. The sampling rate of the A/D converter should be at least double according to the Nyquist sampling theorem. Therefore a sampling rate of about 32 KHz should be used [3, 18, 19].

Pre-emphasis

Before the digital speech signal can be used for feature extraction, a process called pre-emphasis is applied. High frequency formants have lower amplitudes than low frequency formants. Pre-emphasis aims at reducing the high spectral dynamic range [20]. Pre-emphasis is accomplished by passing the signal through an FIR filter [3]. It also involves de-noising which is a process of eliminating the noise components while keeping the important information of speech [7, 4].

Frame Blocking

Frame blocking is the process of segmentation of the speech wave into frames. This is because of the fact that the vocal tract moves mechanically slowly and as a result, speech can be assumed to be a random process with slowly varying properties. Hence, the speech can be divided into frames, over which the speech signal is assumed to be stationary with constant statistical properties. Overlapping is done on the different frames to ensure the continuity of the speech signal [3]. However, the number of frames is not equal between signals, due to the various speaking rates and the processing time for all frames is time consuming [9, 5].

Windowing

The process of frame blocking is succeeded by windowing in order to reduce the energy at the edges and decrease the discontinuities at the edges of each frame, consequently preventing abrupt changes at the endpoints. The mostly used window is the Hamming window [3, 18]. This is done by tapering the signal to zero or near zero. There are two famous types of windows are rectangular and hamming [18, 5].

W (n) =

1 0 ≤ n ≤ N

0 Otherwise

Rectangular window Eqn. 1

W (n) =

1 0.54-0.46*cos (2πn/(N-1)) 0 ≤ n ≤ N

0 Otherwise

Hamming window Eqn. 2

Feature extraction

The feature extraction plays a very important role in sound identification. As a result of irregularities in human speech features, human speech can be sensibly interpreted using frequency-time interpretations such as a spectrogram. Spectrogram Shows change in amplitude spectra over time. It has three dimensional axis, X Axis- Time, Y Axis-Frequency and Z axis- Color intensity which represents magnitude [21]. Some of the methods used to extract features from sound, speech and music include end point detection, linear predictive coding and cepstral analyses [11].

Endpoint Detection

The end point detection technique is applied to extract the region of interest from the speech signal. In other words, it removes the silent region in speech signals. The basic technique of end point detection is to find the energy level of a signal. Signal energy level is calculated in frames, where each frame consists of a number of samples. The frames usually overlap with the adjacent frames to produce a smooth energy line. Accurate end-point detection is important to reduce processing load and increase the accuracy of a speech recognition system. Basically there are two famous endpoint detection algorithms. The first algorithm uses signal features based on energy level and the second algorithm uses signal features based on the rate of zero crossings. However, the combination of both gives good results, but increases the complexity of the program and also the processing time [9, 22]. This has found usage in word recognition [9] and speech recognition [21].

Linear Predictive Coding (LPC)

Linear predictive coding can be described as a method of encoding a speech signal in which a particular value is predicted by a linear function of the past values of the signal. It is one of the methods of compression that models the process of speech production [9]. The Levinson-Durbin algorithm is used to derive the LPC parameter set from an autocorrelation computation with the highest autocorrelation value being the order of the LPC analysis [20]. An improvement to the set of LPC coefficients is the derivation of the set of cepstrum coefficients. However, the factor that compels the modification of LPC to cepstral coefficients is the sensitive to the spectral slope and noise [3]. Also, a spectral envelope reconstructed from a truncated set of cepstral coefficients is much smoother than one reconstructed from LPC coefficients [17, 9].

The LPC coefficients of the selected frames are used as the inputs for the neural network [23]. Feature extraction techniques derived from LPC are cepstral coefficients , Linear Prediction Coefficients, Reflection Coefficients (RC), Linear Predication Cepstral Coefficients (LPCC), Log Area Ratio (LAR), Arcus Sine Coefficients (ARCSIN) and Line Spectral Frequencies (LSF) [24]. LPCC was used for extraction speech feature [3, 18, 25], LPC was used to extract sound features of car sound [23], extract speaker feature [7], word recognition [9], environmental sound [10] and phoneme recognition [19].

Frame Blocking

Windowing

Autocorrelation Analysis

LPC Analysis

LPC Parameter conversion

Fig 2: Feature extraction procedure using Linear Prediction Coding [19]

PERCEPTUAL LINEAR PREDICTION (PLP)

Perceptual Linear Prediction provides a representation corresponding to a smoothed short-term spectrum that has been compressed and equalized much as done in human hearing. It can be assumed similar to mel-cepstrum based features. PLP provides reduced resolution at high frequencies that indicates auditory filter bank based methods, yet provides the orthogonal outputs that typify cepstral analysis. PLP uses linear predictions for spectral smoothing; hence the name is perceptual linear prediction [5]. One reason for better performance of PLP features is that, they are cepstral features. Cepstral features are useful because they operate in a domain in which the excitation function and the vocal tract filter function are separable [8].

Mel Frequency Cepstral Coefficient (MFCC)

After, the audio sequence is loaded and decomposed into successive frames, which are then converted into the spectral domain. The spectra are converted from the frequency domain to the Mel-scale domain and the frequencies are then rearranged into frequency bands called Mel-bands. The envelope of the Mel-scale spectrum is described with the MFCC, which are obtained by applying the Discrete Cosine Transform to the Mel-scale spectrum [26]. Usually only a restricted number of them (for instance the 13 first ones) are selected [1]. The mel-frequency cepstrum is highly effective in audio recognition and in modeling the subjective pitch and frequency content of audio signals. Mel scale is calculated as

Eqn. 3

Where Mel (f) is the logarithmic scale of the normal frequency scale, f is the actual frequency (Hz) [4, 6].

Mel scale has a constant mel- frequency interval, and covers the frequency range of 0 Hz - 20050 Hz [4, 6]. The filter banks for MFCC are based on the human auditory system and have been shown to work particularly well for structured sounds, like speech and music. However, their performance degrades in the presence of noise [10]. MFCC can represent the low frequency region more accurately than the high frequency region and hence it can capture formants which lie in the low frequency range. However, other formants can also lie above 1 kHz and these are not effectively captured by the larger spacing of filters in the higher frequency range [4]. MFCC was used for extraction of features in speaker identification [4, 6, 25], speaker verification (voice lock system) [7], environmental sound [10] and emotion recognition [12].

THE INVERTED MEL FREQUENCY CEPSTRAL COEFFICIENT (IMFCC)

The Inverted Mel Scale can be described as a competing filter bank structure which is an indication of a hypothetical auditory system which has followed a diametrically opposite path of evolution to the human auditory system. The idea is to capture those information which otherwise could have been missed by original MFCC. The new filter bank structure, which is obtained by inverting the original filter bank, is defined by the following filter response [4]:

Eqn. 4

The IMFCC is able to represent the higher frequency range more finely. In this scale, pitch increases more and more rapidly as the frequency increases. Hence it will effectively capture those high frequency formants missed out by the original MFCC. However, it was observed that IMFCC somewhat distorted in telephone speech where higher frequency information used. Furthermore, high level features used will incur little extra computational burden in the calculation of the feature set [4].

Pitch Extraction

The harmonic-peak-based method has been used to extract pitch from the wave sound. Since harmonic peaks occur at integer multiples of the pitch frequency, then the peak frequencies are compared at each time to locate the fundamental frequency in order to find the highest three magnitude peaks for each frame. Therefore, the differences between them computed. Using the differences, we can derive our estimate for the fundamental frequency. The peak vector consists of the largest three peaks in each frame. The advantage of this method is that it is very noise-resistive. Even as noise increases, the peak frequencies should still be detectable above the noise. It has been used in speaker identification system [23].

Matching Pursuit (MP)

Matching pursuit is an attempt to obtain the minimum number of bases to represent a signal, resulting in a sparse and efficient representation. Use a dictionary that consists of a wide variety of basis, MP provides an efficient way of selecting a small basis set that would produce meaningful features coupled with flexible representation. Elements in the dictionary are selected based on maximizing the energy removed from the residual signal at each step. The MP result relies on the choice of the dictionary. A dictionary is a set of basis of obtaining a linear combination to produce an approximated representation of the signal. Several dictionaries have been proposed for MP, including frequency dictionaries, time-scale dictionaries, time-frequency dictionaries. One of the key advantages of this representation is the ability to be potentially invariant to background noise and could capture characteristics where MFCCs tend to fail. It provides an approximate representation and reduces the residual energy with as few atoms as possible. It has found its use in extraction of environmental sound features [10].

Power Spectral Density (PSD)

This is computed by taking the magnitude-squared result of the fourier transform. These periodograms are the averaged and scaled. The PSD of a voice contains unique features attributed to an individual and these are used. These PSD values are represented in a vector form to the pattern matching network. It has found its use in voice recognition [11].

Vector Quantization (VQ)

Vector Quantization is the process of taking a large set of feature vectors, and producing a smaller set of feature vectors, that represent the same centroid of the distribution. The optimization of the system is achieved by using vector quantization in order to compress and subsequently reduce the variability among the feature vectors derived from the frames. After the feature extraction, the similarity between the parameters derived from the collected sound and the reference parameters are computed. VQ is used to compare the parameter matrices [23] and to match feature [6]. The essential elements in vector quantization are the distortion measure, the distance measure, and the clustering algorithm. The distance measure represents the distance between the input vector and the codebook vector [3, 5]. It has found it s usage in speech recognition [3], sound recognition [23] and speaker identification [5, 6].

Classification algorithms

This stage actually entails classification and identification of sound, speech or music based on the features that have been extracted using any of the above discussed technique. The commonly used methods of classification are hidden markov model and neural network algorithms such as multi layer perceptron, recurrent neural network and self organizing maps.

Hidden Markov Model (HMM)

The Hidden Markov Model is a statistical method that has been used extensively in the field of automatic speech recognition. Given data representing an unknown speech signal, a statistical model of a possible utterance is chosen that most likely resembles the data. Therefore, every possible speech utterance should have a model governing the set of likely acoustic conditions that realize it. Comparison between the HMM models of the word uncovers the speaker's utterance. The use of discrete Markov Model removes the burden of computing continuous probability distributions [3, 26].

However, the Markov models identify the states generated from the speech frame independent from one another, which is not true for speech. Hence the HMM tries to follow the piece-wise smooth trajectories and this systems are called Segmental HMM. In addition, HMM tries to model complex phonological rules by using phones rather than features. It is observed that complex phonological rules can be written concisely if features were used rather than phones. Also a feature based recognizer will perform better than a phone based recognizer in a noisy environment [7, 3]. HMM was used for classification of isolated word prediction, isolated sentence prediction, prediction of relatively different sentences, relatively close sentences [3], speaker verification system [7] and speech recognition [21].

Gaussian Mixture Model (GMM)

Gaussian mixture model is a type of density model which comprises a number of component Gaussian functions. These component functions are combined with different weights to result in a multi-modal density. Gaussian mixture models are a semi-parametric alternative to non-parametric histograms (which can also be used to approximate densities) and it has greater flexibility and precision in modeling the underlying distribution of sub-band coefficients. Gaussian Mixture density is the weighted sum of component densities and has been used emotion recognition [12].

Neural Networks

Artificial Neural network tries to model the way human neurons process information. They often have one or more hidden layers of sigmoid neurons followed by an output layer of linear neurons. Multiple layers of neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors.

The Multilayer Perceptron (MLP)

Multilayer perceptrons are one of many different types of existing neural networks. They comprise a number of neurons connected together to form a network. A neural network is able to classify the different aspects of those behaviors, recognize what is going on at the moment, diagnose whether this is correct or faulty, predict what it will do next, and if necessary respond to what it will do next. If an MLP network has n input nodes, one hidden-layer of m neurons, and two output neurons, the output of the network is given by

Eqn. 5

where fk , k =1,2,...,m , and fi , i =1,2 denote the activation functions of the hidden-layer neurons and the output neurons, respectively; wki and wkj , j =1,2,..., n denote the weights connected to the output neurons and to the hidden-layer neurons, respectively; xj denotes the input [3, 14].

However, a common problem when using Multilayer Perceptrons is the way to choose the number of neurons in the hidden layer. There have been any suggestions on how to choose the number of hidden neurons in Multilayer Perceptrons. An example is:

Eqn. 6

Where h is the minimum number of neurons, p is the number of training examples and n is the number of inputs of the network .This equation can be used as a reference for choosing the number of neurons in the hidden layer. The number of neurons in the hidden layer also affects the performance of the system [9]. However, multi-layer neural network are inherently unable to deal with time varying information like time-varying spectra of speech sounds [27]. MLP has been used for isolated word prediction, isolated sentence prediction, prediction of relatively different sentences and relatively close sentences [3] and word recognition [9].

Recurrent Neural Network (RNN)

Recurrent Neural Network is a feedback connection is used to pass output of a neuron in a certain layer to the previous layer(s). This network has three layers such as input layer, hidden layer and output layer. Each of the output layer units has feedback connection with itself. The output of each input layer is fed, through connections between the input and hidden layers, to all the hidden layer units and in the same manner the output of each hidden layer unit is supplied, through connections between the hidden and output layers, to all the output layer units. The output of each output layer unit is feedback to itself. RNN has the ability to process short-term spectral features but yet respond to long-term temporal events [27]. However, RNN have limited memory, and perform less once the memory is exceeded [14]. It has been used in speech recognition (Arabic) [27, 25] and phoneme recognition [28].

Self Organizing Maps (SOMs)

Self Organizing maps have different structures from feed forward neural networks. The hidden layer neurons are arranged in a grid pattern and there is no output layer in this type of ANNs. Input vectors activate hidden neurons according to how similar their weights are to the input values. SOM training adjusts and rearranges these weights so that similar inputs activate the same or nearby neurons. Thus large amounts of data can be organized into unlabelled categories corresponding to various activation clusters on the hidden layer grid map. The organizing mechanism of Kohonen learning is used in training SOMs [11, 18]. However, SOMs are unsuitable for supervised learning [14] and have been used for voice recognition [11].

Radial Basis Function Network (RBF)

Another approach to classify the speech samples is to make use of Radial Basis Function Network. This network also consists of three layers: an input layer, a hidden layer and an output layer. The main difference of this type of network is that the hidden layer has (Gaussian) mapping functions. Mostly they are used for function approximation, but they can also solve classification problems. The input layer is similar to the input layer of the Multilayer Feedforward Network. The RBF network consists of one hidden layer of neurons with basis functions. At the input of each neuron, the distance between the neuron's centre and the input vector is calculated. The output of the neuron is then formed by applying the basis function to this distance. The RBF network output is formed by a weighted sum of the neuron outputs and the unity bias. This type of neural network is practical for large training sets and it performs very well for a small spread. The amount of hidden layer neurons needed increases very fast the more words need to be recognized [21, 29]. It has been used in speech recognition [21, 29].

Probabilistic Neural Network (PNN)

Probabilistic Neural Network as statistical classifier is applied to determine the initial topology of the system. PNN is also used to recognize silence at low level. The Probabilistic Neural Network (PNN) algorithm represents the likelihood function of a given class as the sum of identical, isotropic Gaussians. In practice, PNN is often an excellent pattern classifier, outperforming other classifiers including backpropagation. However, it is not robust with respect to affine transformations of feature space, and this can lead to poor performance on certain data. PNN is a pattern classification algorithm which falls into the broad class of nearest neighbor-like algorithms. It is called a neural network because of its natural mapping onto a two layer feedforward network. It clarifies how to classify a new sample with the maximum probability of success given enough prior knowledge. PNN employs Parzen window density estimation with Gaussian windowing function estimator. It has found its use in phoneme recognition [28].

Training

This is the process of teaching the network using the input-output patter so that if it receives any data it can always check if it has knowledge of the pattern.

Backpropagation

Backpropagation refers to the manner in which the gradient is computed for nonlinear multilayer networks. The back propagation method is used to train the neural network where the method iteratively repeats the misclassified samples in the training [3]. Back-propagation was derived from applying the Widrow-Hoff learning rule to multiple-layer networks and nonlinear differentiable transfer functions [9, 24]. The performance of the algorithm is very sensitive to the proper setting of the learning rate. A momentum term is used in the back propagation algorithm to achieve a faster global convergence [11, 18]. As with momentum, if the new error exceeds the old error by more than a predefined ratio, the new weights and biases are discarded. In addition, the learning rate is decreased. Otherwise, the new weights are kept. If the new error is less than the old error, the learning rate is increased [9]. Back-Propagated Delta Rule Networks (BP) (sometimes known Multi-layer Perceptrons (MLPs) is well-known developments of the Delta rule for single layer, networks (itself a development of the Perceptron Learning Rule). MLP can learn arbitrary mappings or classifications. The most common form of neural network is the 3-layer, fully connected, feed forward MLP [18].

However, if the learning rate is set too high, the algorithm may oscillate and become unstable. If the learning rate is too small, the algorithm will take too long to converge. It is not practical to determine the optimal setting for the learning rate before training but the optimal learning rate changes during the training process. Also, towards convergence, the algorithm becomes slow. But, the selection of adaptive learning rates solves this problem [14].

Back Propagation Through Time (BPTT)

In training the recurrent neural network, Backpropagation through time is used as the learning algorithm. This architecture also have been proved that this architecture better than MLP in phoneme recognition accuracies by using Backpropagation algorithm. BPTT algorithm is based on converting the network from a feedback system to purely feedforward system by folding the network over time. Thus, if the network is to process a signal that is time steps long, then copies of the network are created and the feedback connections are modified so that they are feedforward connections from one network to the subsequent network. The network can then be trained if it is one large feedforward network with the modified weights being treated as shared weight [27, 14, 25].

Real-Time Recurrent Learning (RTRL)

Real-Time Recurrent Learning (RTRL) algorithm is based on recursively updating the derivatives of the output and error. These updates are computed using a sequence of calculations for iteration. The weights are updated either after iteration or after the final iteration of the epoch. RTRL is another training algorithm used in the training of recurrent neural networks. However, the major disadvantage of this algorithm is that it requires an extensive amount of computation at iteration. Additionally, this algorithm is very slow because the RTRL has many weights to compute and therefore, the training process will be more slowly [27, 14, 25].

Summary

Although the preprocessing stage has some steps given different names by different authors, they are all trying to explain the process of preparing the sound signal for feature extraction. MFCC is perhaps the best known and most popular and a combination of features (MFCC, LPC, LPCC, MP etc) may be used to implement a robust parametric representation for speaker identification. Also, the multilayer perceptron has had its usage in a variety of sound applications and has performed excellently.

Problem statement

There has been a lot of work done on speech recognition, speaker recognition, phoneme recognition and voice recognition. These pattern recognition methods have been simulated and validated using various models and techniques including Gaussian mixture models, hidden markov model and artificial neural networks. But the need to carry out research on the process of classifying of unstructured sound using neural network is very necessary as it would have a wide use particularly in areas of fault detection in mechanical devices. This would help in designing and implementing a robust architecture which could be used for classifying most classes of unstructured sound.

Aim

The aim of this research is to design and implement a robust system that can detect and classify unstructured sound.

Objectives

The objectives of this research include:

To design and implement a robust filtering system

To process the signal before feature is extracted

To design and implement a sound feature extraction system

To design and implement a sound classification and identification system

To test and analyze the overall system

Research methodology

Database

The database which will be used in this system will be built from the recorded sounds which would be recorded from different places and also from the internet. Sound of cars, aero planes, ships and trucks would be used for this research.

Collecting Samples

A number of samples would be collected from the sound of cars, aero planes, ships and train would be collected from different places such as the highways, airports, rail station and dockyards. These samples will be taken at a time that there would be least possible noise. A mobile phone with very good sound resolution will be used. It will be at a place close to the sound source, since the proposed system should be beside or on the sound source. Most of the sounds would be recorded at a sample frequency of 44Khz to make sure that the sound has a high quality, and all the component of the sound will be shown when converting the sound to the frequency domain.

Preprocessing

Preprocessing includes filtering and scaling of the incoming signal in order to reduce the noise and other external effect. Filtering speech signal before recognition task is an important process to remove noise related to speech signal which may be either low frequency or high frequency noise. After filtering, the processes of Segmentation, Pre-emphasis, Frame Blocking and Windowing would follow so as to prepare the signal for feature extraction. This preprocessing stage would be carried out using Matlab signal processing toolbox.

Feature Extraction

In order to recognize the sound source from other sounds we need to extract the parameters from the sound signal, these parameters help us to distinguish the sounds domain from others (insect sound, human voice). Feature extraction consists of choosing the features which will most effectively preserve class separately. These extracted features will serve as input to the neural network that would be tasked with th classification process. In order to have a robust classification and identification, more than one feature extraction technique would be used in combination. These are MFCC and PLP. The computation of both the MFCC and PLP would be carried out using Matlab signal processing tool box. MFCCs would be derived by following steps, as below [12]:

1. Take the Fourier transform of (a windowed excerpt of) a signal.

2. Map the powers of the spectrum obtained above onto the Mel scale, using triangular overlapping.

3. Take the logs of the powers at each of the Mel frequencies.

4. Take the discrete cosine transform of the list of Mel log powers, as if it were a signal.

5. The MFCCs are derived as the amplitudes of the resulting spectrum.

PLP would be derived by following steps, as below [5]:

1. Compute power spectral estimate for the windowed speech signal.

2. Integration of the power spectrum within overlapping critical band filter responses.

3. Make a pre-emphasis of the spectrum.

4. Spectral amplitude is compressed.

5. Inverse DFT is performed to obtain autocorrelation coefficients.

6. Perform spectral smoothing.

7. Convert the autoregressive coefficients to cepstral variables.

Classification and Identification

For the sound recognition, a multi layer perceptron using Back propagation neural network would be utilized. A three-layer MPL neural network typically composing of one input layer, one output layer and one hidden layers. In the input layer, each neuron corresponds to a feature; while in the output layer, each neuron corresponds to a predefined pattern. It would be trained by back propagation algorithm with adaptive learning rate training algorithm. First and second layer would have a tan sigmoid function, while for the output layer, log sigmoid function would be chosen.

Expected outcomes

The expected outcomes of this research include the design and implementation of a robust filter, the design and implementation of set of preprocessing steps, the design and implement a robust sound feature extraction system that will successfully extract features from unstructured sound, design and implement a sound classification and identification system that will successfully classify unstructured sound and the implementation of the overall system. Also, it is expected that this research will produce a conference and a journal paper.