2.1 Structure of phone and speech recognizers. Any phone/speech recognition system can be seen in simpliï¬cation as three main blocks, as in Figure 2.1 - feature extraction, acoustic matching (classiï¬cation) and a decoder.
signal decodedfeature extractionmatching speechdecoder acousticphone/word sequence
Figure 2.1: Block diagram of a phone/speech recognizer
Feature extraction is the process of extracting a limited amount of useful information from speech signal, while discarding redundant and unwanted information. The purpose of feature extraction is many, reduce the dimensionality of the speech frame and transform the speech frame to parameters that are more efficient for classiï¬cation, suppress the channel variations, speaker variations, provide immunity to adverse conditions such as noise.
The acoustic matching block matches parts of the signal with some stored examples of speech units. The decoder ï¬nds the best path through the acoustic units (their order), optionally using additionalknowledge aboutthe language throughlanguage models. Section 2.4 later in this chapter briefly outlines language modeling using statistical techniques.
6
2.2. Feature Extraction 7
Language model usually is a phone or word n-gram, as the case may be. Further, speech recognition systems use a lexicon which describes the order in which the phones appear in words. Though the language models in general can help enhance the performance of recognition systems, there are cases when language models are not to be used, for example, language identiï¬cation using phonotactic approach [108-110].
2.2 Feature Extraction
Speech signal is divided into overlapping frames. Then, few parameters representing the characteristics of these frames that can efficiently describe the event (phone/subword/word) associated with the frame are extracted.
A sequence of features extracted from the consecutive frames represent the trajectory of the speech patterns, and the temporal variations contained in these variations are also useful in characterizing the acoustic unit, whether it is a phone, sub-word or word, as the case may be. Trajectory can be imagined to represent the speech generation process, and the rate of variation in the temporal domain is also important in characterizing the sound.
The evolution of the feature extraction was inspired by the knowledge of speech production and perception. The feature extraction for phone/speech recognition consist of the following steps:
1. Preprocessing - This step is applied optionally on the signal to increase the quality of the recording. Techniques like noise suppression or echo cancellation can be used here. The signal processing commonly used in acoustic decoding of speech is "pre-emphasis" - high-pass ï¬ltering which increases the energy of higher frequency components.
2. Segmentation - Acoustic signal is divided into segments/frames which can be regarded stationary. The typical duration of the segment is 25 ms. To preserve information about time evolution of speech signal, segments are taken with some overlap - typically 10 ms. The overlapping between adjacent frames
2.2. Feature Extraction 8
would also help abrupt changes in the estimation of the temporal evolution of the signal.
3. Spectrum computation - Short time Fourier power spectrum or some parameters that contain the spectral information is computed for each frame.
4. Auditory-like modiï¬cations - Modiï¬cations inspired by physiological and psychological ï¬ndings about human perception of loudness and different sensitivity of different frequencies are performed on spectra of each speech frame. Human perception is different for different frequency bands, and higher the perception for a band, the better the frequency resolution expected for this band. Usually, frequency bands are combined to form bins with larger bandwidths ( lower resolution ) for frequencies that are perceptually less important and lower bandwidths ( higher resolution ) for bands that are perceptually more important - usually referred as critical bands.
5. De-correlation- The de-correlation of the features can help reduce the number of features to be used in the system. Another purpose of this processing step is to enhance the effectiveness of the features.
6. Derivatives - Feature vectors are usually completed by ï¬rst and second order derivatives of their time trajectories (delta and acceleration coefficients). These coefficients describe the time evolution of the features.
2.2.1 Mel Frequency Cepstral Coefficients (MFCC)
1Individual steps involved in the extraction of MFCC [1] features are illustrated in the block diagram in Figure 2.2 [92]. Short time DFT of the signal is evaluated ï¬rst, and they are windowed to emulate the mel ï¬lterbanks. The log mel ï¬lterbank energies are transformed using a DCT for de-correlation and the ï¬rst few (usually 13) coefficients are selected as MFCC features. The effect of this truncation is to discard the fast variations in the signal spectral envelope. The DCT de-correlates the features so that the covariance of the features is nearly diagonalsuiting the
1http://www.icsi.berkeley.edu/~ dpwe/respite/multistream/msgmfcc.html
2.2. Feature Extraction 9
S(1,t)
| . |
Sl(1,t)
ln
C1
Short
S(2,t)
| . |
Mel
Sl(2,t)
ln
C2
Input Speech
Term
DFT
. . . . .
S(129,t)
Filter
Bank
. . . . .
Sl(23,t)
DCT
C13
| . | ln
Figure 2.2: Block diagram showing steps of MFCC computation.
HMM-GMM modeling with diagonal covariance matrix.
2.2.2 Critical Band Energies (CBE)
CBE is similar to MFCC discussed earlier, except for the DCT and the subsequent discarding of the last few DCT coefficients. Thus, there is no smoothing of the spectrum, de-correlation and dimensionality reduction in CBE. Though, it is quite apparent why MFCC should be the favorite choice for HMM-GMM systems with diagonal covariance matrix, for other architectures like HMM-NN systems, the effectiveness of CBE is no less, as would be obvious in the experiments reported in Chapter 4.
2.2.3 Linear Predictive Coding (LPC)
thIn LPC, segments of speech are represented by an autoregressive model [2,3]. LPC models can approximate the spectral envelope of the speech signal and is widely used for low bit rate speech transmission, and also is used in other speech processing applications like speech synthesis. Usually, a 13order LPC is found to be adequate
2.2. Feature Extraction 10
to represent the speech signal sampled at 8 kHz. However, for acoustic decoding, LPC is not very popular.
2.2.4 Perceptual Linear Prediction (PLP)
It is understood from the previous section on MFCC that certain frequencies are more important than other frequencies for human perception of speech. In [7], Hermansky reviews a class of linear transforms that modify the spectrum of the speech signal prior to its approximation by an autoregressive model (LPC). In PLP, speech is pre-emphasized by an equal loudness curve, and the auditory spectrum is convolved with a simulated critical band masking pattern. Subsequently, the critical band spectrum is re-sampled at 1 Bark intervals, and the resulting spectrum is compressed using a cubic root non-linearity, simulating the intensity loudness power law [7]. A lower order all pole LPC model of such an auditory spectrum is consistent with several acoustic phenomena observed in speech perception. PLP is seen to be very effective for acoustic decoding applications.
2.2.5 Frequency Modulation (FM)
Frequency modulation [4,5]has recently been proposed as a useful feature forspoken language processing applications. Conventionally, amplitude based features have been used as front-ends in these systems. Since these features alone do not seem adequateforspokenlanguageprocessing, phasebasedfeatureshavereceived research attention. Original work on the extraction of these features [4,5] was found to be less robust due to its fast variations and recent work in [6] suggests a smoothed estimation of the FM features using Gabor ï¬lters and second order LPC parameters.
The signal is ï¬rst split into a set of frequency bands using Gabor ï¬lters [6] and subsequently a two pole LPC ï¬lter parameters are extracted for each band.
2.2.6 Tandem Features
Tandem is a nonlinear data guided feature extraction method [27,31,76,77,96]. A neural network is discriminatively trained using labeled data-set to estimate poste-
2.2. Feature Extraction 11
rior probabilities of phones. The distribution of the posteriors is heavily skewed and they are correlated. A nonlinearity is applied to the posterior probability feature vector to make the distribution smooth and then they are de-correlated before using inHMM-GMM systems. Grezletal.[20-22]proposed anapproachforde-correlation and dimensionality reduction using a bottle-necked neural network, showing significant performance improvements over other transformations.
2.2.7 Features looking at longer temporal context
The analysis [8-10] of speech showed that signiï¬cant information about phone is spread over few hundreds of milliseconds. The phones are not completely separated in time but they overlap due to fluent transition of speech production organs from one conï¬guration to another (co-articulation). This suggests that features or models that are able to catch such long temporal span are needed in speech recognition. Another support for using of such long temporal spans is the study of modulation frequencies of band energies important for speech recognition [11]. The most important frequencies are between 2 and 16 Hz with maximum at 4 Hz. The 4 Hz frequency corresponds to a time period of 250 ms, but to capture frequencies of 2 Hz, an interval of half a second is needed.
Delta coefficients
One common technique allowing to distinguish crossing trajectories are delta features. This technique adds an approximation of the ï¬rst time derivatives of basic features(forexample MFCCs) tothefeaturevector. The derivatives represent rough estimation of direction of trajectory in the feature space in time and are estimated as [92-94]:
t+i
i(ct+i
N i=1
- ct- i
2i
(2.1)
dt
where dt
N i=1=
) 2
is vector of delta coefficients for frame t computed from vectors of static coefficients ctoct- i. The usualwindow lengthis 5frames, therefore deltafeatures use 65 ms long temporal context (4Ã-10 ms + 1 Ã- 25 ms).
2.2. Feature Extraction 12
Delta-delta ( acceleration ) coefficients
Equation 2.1 can be applied to delta features to derive delta-delta features [92-94]. Delta-delta features introduce even longer temporal context. If the window has also 5 frames, the temporal context is 9 frames which is 105 ms (8Ã-10 ms + 1Ã-25 ms). Delta-delta features can say whether there is a peak or a valley on the investigated part of trajectory.
Triple-delta coefficients and reduction of dimensionality
Use of triple-delta was seen to be beneï¬cial on some larger databases [95]. These features are attached to the vector of static features, deltas and double-deltas, but the vector is not fed into the classiï¬er directly. Its dimensionality is reduced by a linear transform usually estimated by Linear Discriminant Analysis (LDA) or HeteroscedasticLinearDiscriminantAnalysis(HLDA)onthetrainingdataorsimply by using Principal Component Analysis (PCA). In this work, however, triple deltas are not used due to the moderate size of the database used.
Shifted Delta Cepstra (SDC)
SDC [112] are features widely used in acoustic language identiï¬cation. These features do not look at trajectory in one place of feature space but they look at trajectory from more surrounding places by shifting deltas. This allows to catch even word fractions by the features directly. This feature is also not used in this work, and is presented here only for completeness.
Blocks of features
2Very often, a block of consequent MFCC or PLP features, appended with delta and double delta is used as features. This block can be used directly in a classiï¬er, for example in neural networks, or its dimensionality can be reduced and the features de-correlated by a transformfor GMM.
2described later in this chapter
2.3. Acoustic Matching 13
The TRAPS (TempoRAl PatternS) [29,74] feature extraction described in Section 2.3.4 and used later in this thesis falls in this category too. For TRAPS, features could be any of MFCC, CBE, PLP, or FM, though the original work proposing TRAPS used CBE [74].
2.3 Acoustic Matching
The acoustic matching block assigns scores to acoustic units hypothesized by the decoder. The hidden Markov models (HMMs) [78,83,94,98] are commonly used for this purpose. The HMMs introduce an assumption of statistical independence of frames. This implies that the ï¬nal score (likelihood) of an acoustic unit is given by product (or sum of log-likelihoods) coming from frames. The per frame likelihood is modeled by a probability density function, usually by Gaussian mixture models (GMM) [94] or a neural network [23,80].
2.3.1 Hidden Markov Models (HMM)
Hidden Markov models [83] are parametric stochastic models for sequences. They can be used to represent distributions over sequences in which context can be represented by discrete states. An HMM represents a stochastic process generated by an underlying Markov chain composed of a number of states, and a set of observation distributions associated to these states.
The probability of the observation sequence X having the word sequence W and the hidden Markov model M (as acoustic model) can be written as [83]:
Sp(X| W,M) = p(X,S| W,M)
S= p(X| S,W,M)p(S| W,M) (2.2)
where S is a state sequence and Sshows the summation over all possible state sequences S in W to have produced X. Assuming statistical independence of the observations X = (x1,x2,...,xT), p(X| S,W,M) is approximated as:
2.3. Acoustic Matching 14
p(X| S,W,M) = p(x1,x2,...,x) ˜ p(xt| stT) (2.3)
where stis the state at time t. The second term, p(S| W), is resolved based on the assumption that the current
state st
depends only on the previous state s(t- 1)
(ï¬rst order Markov assumption):
p(S| W) = p(s1,s2
,...,sT
| W)
˜ p(s1)
T
t=2
P(st| st- 1) (2.4)
Based on equations 2.3 and 2.4, the likelihood P(X| W) can be rewritten as [83]:
T
Sp(X| W,M) = p(s1).p(x1| s1)
t=2
p(st| s(t- 1)).p(xt| st) (2.5)
In practice, equation 2.5 is evaluated and maximized using a computationally efficientalgorithm, forward-backwardalgorithm [3]. Itmayalsobeapproximatedby computing the likelihood of the best state sequence using Viterbi algorithm [81,82] as follows:
p(X| W,M) = max
S
T
p(s1).p(x1| s1)t=2
p(st| st- 1).p(xt| st) (2.6)
In the HMM models, every state is parameterized in terms of two probability distributions, namely state transition probability aij:
aij
= p(st
= j| s(t- 1)
= i) (2.7)
and emission probability density function:
bj(xt) = p(xt| st
= j) (2.8)
2.3. Acoustic Matching 15
Typically, HMMs when used for speech recognition are built for sub-word units such as phones [94]. The sub-word units HMMs are connected together to form word HMMs. The lexicon of a speech recognition system contains the transcription of words in terms of sub-word units. Although it is theoretically possible to build acoustic models directly for words, in practice it is difficult to have sufficient training samples (realization of each word) in a large vocabulary system. Therefore, the practical solution is to train phone models, and connect these models based on the lexicon to create word models. There are two types of phone models, context independent (CI), and context-dependent (CD) phone models. In context-independent modeling, each phone model is trained independent of the preceding and succeeding phones, context of the phone is not considered. Speech has a signiï¬cant amount of co-articulation effect, and the phones in the neighborhood, before and after, every phone have an effect on the acoustic characteristics of the phone. This would mean that the same phone when it appears in different contexts will have different acoustic characteristics. CI modeling therefore can not take into account the effect of co-articulation effects. In order to model co-articulation, context-dependent (CD) phone models are used [94]. Context-dependent models are created based on the current CI phone model and typically with one preceding and one succeeding context phones (triphones). The number of CD phone models are much more than CI phone models. This may result in insufficient data for training CD models, also encounter new triphones that were not seen during training. In order to overcome this problem, parameter tying technique using data driven phonetic decision tree clustering [97] is used. First, the phones are grouped into phonetically similar clusters, like vowels, consonants, nasals. Each question will relate to a phonetic context to the immediate left or right of the current phone. One tree is constructed for each state of each phone to cluster the corresponding states of all associated triphones. The states in each subset are tied to form a single state. The question set and tree topology are chosen to maximize the likelihood of the training data while ensuring there is sufficient data to train a mixture Gaussian distribution for each tied state [98]. Once all such trees have been constructed, an unseen triphone can be synthesized by traversing the phone trees to ï¬nd the local nodes corresponding to
2.3. Acoustic Matching 16
the unseen triphone context, and using the associated tied states to construct the new triphone.
However, for phone recognition systems CI (monophone) models are preferred for computational considerations.
Decoding
For decoding the spoken utterance for a sequence of phones/words, Viterbi [81] decoding as implemented in HTK [94] is used. In Viterbi decoding, we try to get the best state sequences from all possible state sequences (with or without using the language models) and the state sequence represented by the best likelihood is decoded for the sequence of phones/words represented by the utterance.
2.3.2 Gaussian Mixture Models (GMM)
The explicit modeling of data allows for a simple training based on mathematically well founded analytic formulas [83,94]and is very well understood. There are several approaches to train GMMs, simplest being maximum likelihood (ML) re-estimation criterion that iteratively maximizes the likelihood of the data for the model whose parameters are re-estimated. While the simplicity of the ML approach is well appreciated, its inability to discriminate the model likelihood from the competing hypotheses has been a disadvantage. To solve this problem, there are also discriminative techniques [12,13] such as maximum mutual information (MMI), minimum phone error (MPE) criteria and other similar strategies. Discriminative training tries to maximize the likelihood of the model corresponding to the data and tries to minimize the likelihood of the competing hypotheses for the training data. The ML, MMI and MPE formulas use accumulation of statistics that allows to easily parallelize the training.
Another advantage of the GMM based acoustic modeling is its ease for adaptation [79]to different speakers/genders from speaker/gender independent models. On the opposite, the explicit modeling of probability distributions of data needs more parameters in the model. The recognition phase is therefore slower in comparison to using neural networks (NN), where a single neural network is used to evaluate the
2.3. Acoustic Matching 17
probabilities. The GMMs need to estimate covariance matrices during the estimation. The number of parameters in the covariance matrix (and therefore the amount of training data) grows quadratically with the feature vector dimension. The common approach is using of diagonal covariance matrices to minimize the amount of training data required, the model is simpler and faster for evaluation in this case, but the input features must be de-correlated for optimum performance. Some of the features like MFCC meets the nearly diagonal covariance matrix requirement. MFCC is a favorite choice for GMM based acoustic modeling.
The formula for computing bj(xt) is then
M
bj(xt) =
i=1
ciN(xt,µji,Sji), (2.9)
whereM isthenumberofmixturecomponents, ciistheweightofthei-thcomponent and N(x,µ,S) is the output value of a multivariate Gaussian with mean vector µ and covariance matrix S, that is
N(x,µ,S) = 1 (2p)n
Training
-| S| e
1 2
T(x- µ)
S- 1(x- µ)
(2.10)
In this work, all models are ï¬rst trained using ML criteria and then trained using MMI criteria for several iterations at the ï¬nal stage. It was seen that using MMI at the ï¬nal stage enhanced the performance of the recognizer.
In ML criteria, the likelihood FML(T) [12,13] is maximized:
J
Nj
FML(T) =
j=1
k=1
lnpT
(xj k| ωj) (2.11)
andinMMI, the objective functionFMMI(T)[12,13]is tomaximize the posterior probability of all training segments being correctly recognized:
J
Nj
| xj k)
j k| ωj)P(ωj
J l=1
T
(xj k| ωl)P(ωl)
FMMI(T) =
=
j=1
J
j=1
k=1
Nj
k=1
lnPT(ωj
ln pT(x
p)
2.3. Acoustic Matching 18
where j represents the model index, and k represents the sample index.
2.3.3 Neural Networks (NN)
Neural network is a discriminatively trained classiï¬er that separates classes by hyperplanes. Therefore, parameters are not wasted for the feature space that can not affect the classiï¬cation. This makes the classiï¬er small and simple. It can run very fast and therefore it can be easily ported to low-end devices. The neural networks can process large dimensional feature vectors more easily than GMMs. They also can process correlated features and has the ability to de-correlate features.
One of the simplest neural network structures is multilayer perceptron, which was widely accepted for speech recognition [23]. It is a three layer neural network - the ï¬rst layer copies inputs, the second (hidden) has the sigmoidal nonlinearities and the ï¬nal (third) layer in HMM-NN uses the SoftMax nonlinearity. This ï¬nal nonlinearity ensures that all outputvalues sum toone so thatthey can be considered probabilities. The network is trained to optimize the cross-entropy criteria. Such networks were adopted for the NN implementation in this thesis.
Another kind of NNs are Recurrent Neural Networks (RNN) [24]. The recurrent neural network has just two layers - input layer, which copies input feature vector, and an output layer. The output layer have not only neurons that represent outputs, but also has some neurons that represent hidden states. These states are sent with a timeshiftofoneframebacktotheinput. Thisallowstomodeltheoreticallyinï¬nitely long time context (to the past) and reach better results than MLPs. Although the RNN can model only one context, some works model both contexts independently and merge the outputs [25].
2.3.4 Capturing phonetic context for acoustic modeling
Learning phonetic context using wider temporal context than the context obtained using delta and acceleration coefficients is the focus of TRAPS system development. Details of a TRAPS system is shown in Figure 2.3.
Critical bands energies are obtained in conventional way using an FFT followed
2.3. Acoustic Matching 19
by windowing [94]. Speech signal is divided into 25 ms long frames with 10 ms shift. The mel ï¬lter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities. TRAPS feature vector describes a segment of temporal evolution of such critical band spectral densities within a single critical band. The usual size of TRAPS feature vector is 101 points [26]. The central point is the actual frame and there are 50 frames in past and 50 in future. That results in 1 second long time context. The mean and variance normalization can be applied to such temporal vector. Finally, the vector is weighted by Hamming window. This vector forms an input to a neural network classiï¬er. The outputs of the these band classiï¬ers are posterior probabilities of sub-word (phone or states) classes which we want to distinguish. Such a classiï¬er is applied to each critical band. The output of all the band classiï¬ers are merged using another neural network to generate the posterior probabilities for the current frame.
The multilayer perceptron is one possibility for acoustic matching. But people (for example [26] and [28]) found that more complicated neural network structures can be beneï¬cial for speech recognition.
temporal vector
Critical bands
phone
101 points TRAP
101 points TRAP
Normalization
Normalization
Posterior probability
estimator
Posterior probability
time
estimator
Figure 2.3: TRAPS system [85]
Pratibha Jain [30] showed that information coming from one critical band is not enough for band classiï¬er and extended the input temporal vectors for band classiï¬er with vectors from neighboring bands (one from each side). The error rate of TRAPS system was signiï¬cantly reduced in this work. This meant that instead of using separate neural networks for each critical band, several critical bands are used for one neural network.
frequency
Merger
posterior probabilities
2.3. Acoustic Matching 20
Chen and Zhu [28,29,103,104] supposed that the mapping to posteriors by the band classiï¬ers is useless and all the valuable information for merger was already extracted by the ï¬rst-layer. Therefore the ï¬nal layer was removed for all the band classiï¬ers after the networks were trained, and they called the new formulation as Hidden Activation TRAPS (HATS).
TheTonotopicMulti-layeredPerceptron(TMLP)[28]hasexactlythesamestructure as Hidden Activation TRAPS. The difference is in the training. In case of TMLP, a large (composite) neural network is built and this network is trained while optimizing one criteria function. It is a de facto four layer network with some constraints applied for neurons in the second layer (ï¬rst layer of neurons). The author showed an improvement against conventional TRAPS system but worse results than HATS.
Schwarz et al. studies [85,93] the amount of data needed to train long temporal context based systems and proposes split temporal context (STC) architectures in an effort to optimize the performance for small amounts of training data. STC introduces the assumption that two temporal parts of a phone can be processed independently. In STC, the feature extractors are split to left and right contexts. Left context (LC) captured the temporal spectral variations from the left to the current frame, while the right context (RC) captured the variations to the right of the current frame, with the current frame shared between LC and RC features. Subsequently, the phone state posteriors from the LC and RC neural networks are merged [85] using another neural network to generate the phone/state posteriors of the classiï¬er. In this implementation, critical bands are not treated separately. The details of the STC HMM-NN phone recognition system is shown in Figure 2.4.
It is well known that task speciï¬c knowledge (temporal phonetic context for example) can be used to enhance performance of the phone/speech recognition systems. Usually, this is achieved by using phone/word n-grams, and a lexicon that explains the order in which the phones appear in words. It would be interesting, however, to explore if such sources of knowledge could be taken into account to estimate/derive features. Recently, there has been some studies [87-89] with the goal of integrating context and prior knowledge in the prior estimation. In these studies,
2.3. Acoustic Matching 21
extraction of mel-bank energies
extraction of temporal vectors
1.. band16 s
1.. band16 s
mean and variance normalization acrosstrain data set context classifier
logarithm
windowing linear
transformation
DCT
DCT
mean and variance normalization across
concatenation
Left part
Right part
train data set merger classifier Viterbi decoder
sil ax m h ax sil
Figure 2.4: Block diagram of the Split Temporal Context system [93]
different methods for estimating posterior probability of a word hypothesis, given all acoustic observations of the utterance are proposed. These posteriors are estimated on word graphs or HMMs by the forward-backward algorithm and used for word conï¬dence measurement. However, these studies were restricted to conï¬dence measurement.
Ketabdar et al. [90,91] studies a principled framework for enhancing the estimation of local posteriors (particularly phone posteriors) by integrating long acoustic context, as well as phonetic and lexical knowledge. Two approaches were explored in their work. The ï¬rst approach uses an HMM to integrate the prior phonetic and lexical knowledge. The phonetic and lexical knowledge is encoded in the topology of the HMM. The second approach uses a secondary neural network to post-process a temporal context of regular phone posteriors and learn long term intra and inter
2.4. Language Modeling (LM) 22
dependencies between regular phone posteriors estimated initially by the ï¬rst neural network. These long term dependencies are phonetic knowledge. The learned phonetic knowledge is integrated in the phone posterior estimation during the inference (forward pass) of the second neural network, resulting in enhanced posteriors. It was observed that the enhanced posteriors have lower entropy than the original posteriors used.
Sinceposteriorvectorsequences, alsoreferredas"posteriogram",generatedusing a neural network converges to a sequence of binary vectors, it is more effective in learning the longer term information and act as a ï¬lter, smoothing out evidences which are not matching the learned phonetic and lexical knowledge [90]. This, in a way, implies that, posterior features are more suitable to learn the phonetic and lexical knowledge than using long term contexts at the input of the ï¬rst neural network deriving the posterior features.
The hierarchical tandem architecture presented in [86] combines phone posteriors with the original features used for deriving the posterior features hierarchically to derive new temporal features, and reports performance improvements. The performance improvement in this case may be understood in the context of the work in [90] that the posterior features help capture the phonetic context more efficiently into the features.
Adding task speciï¬c knowledge into the features help enhance the performance of phone/speech recognition systems.
2.4 Language Modeling (LM)
The goal of a language model [32] is to estimate the probability of a symbol sequence,ˆ P(w1,w2,...,wm) which can be decomposed as a product of conditional probabilities:
m
ˆ P(w1,w2,...,wm) =
i=1
ˆ P(wi
| w1,...,wi- 1) (2.12)
Limiting the context in equation 2.12 results in:
2.4. Language Modeling (LM) 23
m
ˆ P(w1,w2,...,wm) ≃
i=1
ˆ P(wi
| wi- n+1,...,wi- 1) (2.13)
for n 1 with values of n in the range of 0 to 4 inclusive are typically used, and there are also practical issues of storage space for these estimates to consider. For context lengths 0, 1, 2 are referred as unigram, bigram and trigram respectively. The symbols w are either phones or words, and they are referred as phone n-grams and word n-grams respectively.
Estimates of probabilities in n-gram models are commonly based on maximum likelihood estimates - that is, by counting events in context on some given training text:
ˆ P(wi| wi- n+1,...,wi- 1) = C(wi- n+1,...,w) C(wi- n+1,...,wii- 1) (2.14)
where C(.) is the count of a given word sequence in the training text. Data sparsity is one major problem associated with the estimation of n-grams
and smoothing and interpolation techniques are usually used [3,32] for a meaningful estimate of higher order n-grams. Then, there are class based n-grams, where words having grammatically or semantically similar behavior are grouped together. This groupingcouldbedone usingrules orinadatadriven manner. The detailsofn-gram estimation techniques is beyond the scope of this work and can be seen in [3,32].
Another view of statistical language modeling is grounded on information theory [34,35]. Languageisconsidered asaninformationsourceLwhich emitsasequence of symbols wi(phone or word as the case may be) from a ï¬nite alphabet (vocabulary). The distribution of the next symbol is highly dependent on the identity of the previous ones. The information source has a certain entropy H and this is the averageamount ofnon-redundant informationconveyed persymbol by L. According to Shannon's theorem [36], any encoding should use at least H bits per symbol, on average. The quality of the LM can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T:
H(PT,PMX) =
PT
(x)· logPM
(x) (2.15)
A good measure of the expected beneï¬t provided by A0
in predicting B is the
2.5. Transforms for de-correlation and dimensionality reduction of features 24
average mutual information between the two [34,35]:
I(A0,B) = P(A0,B)log P(B| A0) P(B)
P( ¯ A0,B)log P(B|¯ A0) P(B)
+P(A0, ¯ B)log
+P( ¯ A0, ¯ B)log
P( ¯ B| A0) P(¯ B)
P( ¯ B|¯ A0) P(¯ B)
+
(2.16)
[37] uses a variant of equation 2.16 to automatically identify co-locational constraints.
In Chapter 6, equation 2.16 is used to compute the mutual information between a phone and a language.
2.5 Transforms for de-correlation and dimensionality reduction of features
In many pattern classiï¬cation applications, a large number of features can hurt the performance of classiï¬ers, especially for HMM-GMM systems. When we have a set of N features, it is most preferred if they are de-correlated - a feature cannot be used to predict the other feature. Predictability means redundancy and need to be removed for optimum performance as this redundancy comes at the expense of the increased dimensionality for no beneï¬t. Further, de-correlated features are found to be advantageous when modeling with HMM-GMM with diagonal co-variance matrix. In this section, a few transforms, linear and non-linear, useful for speech processing applications are studied.
2.5.1 Principal Component Analysis (PCA)
This technique allows to ï¬nd new variables (in decreasing order of importance) that are linear combinations of the original variables and that are uncorrelated. Eigenvalues obtained from PCA indicate variability in each dimension. This allows to keep as many bases as necessary to keep certain variability in input features, so that the features can be reconstructed as precisely as possible. However, it may need to be understood that the dimension that has the maximum variance need not
2.5. Transforms for de-correlation and dimensionality reduction of features 25
be contributing the maximum to the classiï¬cation task. PCA may be considered to be a blind transformation that does not consider the effect of each of the dimensions to the classiï¬cation task. PCA is an unsupervised method based on a correlation or covariance matrix:
Ty = A.x (2.17)
The transformation matrix A is orthogonal and maximizes the variances of the individual components of y. A can be derived from the total covariance matrix ST
of the measurements x.
ST
T= ASA
(2.18)
where:
ST
= E[xx
T
T]- E[x]E[x
] (2.19)
and S = diag(s1,s2,...,sp) (2.20)
is a matrix whose diagonal consists of the eigenvalues of Sand columns of A are the eigenvectors of STT. The most common application of PCA in speech is to de-correlate the features x
andtoreduce dimensionality. Ithas beenshown thatifxisgenerated by aï¬rst order Markov process, then the principal components are Cosine bases [16]. Cosine bases are used extensively in de-correlating logarithm of mel ï¬lter bank energy values, to compute MFCC (Section 2.2.1).
2.5.2 Linear Discriminant Analysis (LDA)
LDA is a supervised method to ï¬nd a linear combination of the variables that maximizes linear separability of the classes. We seek the direction along which the two classes are best separated in some sense.
In addition to considering the properties of input during estimation of the transform, this technique also takes into account distributions of classes (states, phones
2.5. Transforms for de-correlation and dimensionality reduction of features 26
...). LDA allows to derive linear transform whose bases are sorted by their importance for discrimination among classes. It maximizes a ratio of across-class and within-class-variance.
The transformation assumes that the data is normally distributed. In the transformed space, the dimensions are ordered in terms of importance of discrimination. The within-class covariance represents how much the samples within a class vary. Between-class covariance matrix is the covariance of the class conditional means. It gives a measure of the separability of class conditional means and hence overlap between classes in feature space.
A widely used criterion for class separability is deï¬ned by:
where Sw
F = Sw - 1Sb
and Sb
(2.21)
are within-class covariance matrix and between-class covariance matrix.
Assumption that features belonging to each particular class obey Gaussian distribution and that all the classes share the same covariance matrix is quite limiting the optimal functionality of LDA.
2.5.3 Heteroscedastic linear discriminant analysis (HLDA)
HLDA which was ï¬rst proposed by N. Kumar [17,18] can be viewed as a generalization of LDA. HLDA again assumes that classes obey multivariate Gaussian distribution, the assumption of the same covariance matrix shared by all classes is relaxed. HLDA assumes that the n-dimensional original feature space can be split into two statistically independent subspaces, with p useful dimensions containing discriminative information, while (n - p) dimensions are nuisance dimensions, with overlapping distributions of classes. The goal of HLDA is to ï¬nd the n Ã- p transformation matrix A for the n-dimensional feature vectors.
HLDA allows to derive such projections that best de-correlates the features associated with each particular class (maximum likelihood linear transformation for diagonalcovariance modeling [17-19,92]). To perform de-correlation and dimensionality reduction, n-dimensional feature vectors are projected into ï¬rst p < n rows,
2.5. Transforms for de-correlation and dimensionality reduction of features 27
Figure 2.5: Illustrating the effect of PCA and LDA/HLDA for dimensionality reduction, PCA cannot discriminate the useful dimension from the nuisance dimension
ak=1...p, of nÃ- n HLDA transformation matrix, A. Original formulation of HLDA [17,18] uses gradient based algorithm to solve the A matrix making it computationally very intensive, while Burget uses an efficient iterative algorithm [19] in [92] to estimate the matrix A, where individual rows are periodically re-estimated using the following formula:
ˆ ak
= ck
(k) - 1G Tck
(k) - 1Gck T
(2.22)
whereci
thisthei
j
rowvectorofco-factormatrixC = | A| A- 1forthecurrentestimate of A and
ˆ S k > p
where ˆ S and
ˆ S(j)
 
 
=
(k)G
J j=1T akˆ SaT k
γ
akˆ S(j)aT k
ˆ S(j)
th
k = p
(2.23)
class and
thare estimates of global covariance matrix and covariance matrix of jclass, γjis the soft count of training feature vectors belonging to j
2.5. Transforms for de-correlation and dimensionality reduction of features 28
20 Original space
10
0
-10
-50 0 50 -20
5 LDA transform (y axis ~ discriminative dimension)
0
-5 0 5 -5
0
50 HLDA transform (x axis ~ discriminative dimension)
-50 0 50 -50
Figure 2.6: Illustrating the difference in performance of LDA and HLDA in transforming the data, the data in the original space was generated using matlab
T is the total number of training feature vectors. The selection, that feature vector o(t) belong to class j, is given by the value
of occupation probability γj(t) from the standard GMM training algorithm. New HLDA projection, A, is then derived using the occupation probabilities and the estimated class covariance matrices,ˆ Sj. Figure 2.5 illustrates the difference between target insensitive PCA and target
sensitive LDA/HLDA. Of the two dimensions, one of the dimensions contributes meaningfully to classiï¬cation task, while the other dimension is a nuisance dimension. It may be noted that considering the variance of the features alone, PCA cannot distinguish between the two dimensions, while LDA and HLDA can select the dimension that contributes maximum to the class separability and discard the other dimension (nuisance dimension). Further, Figure 2.6 illustrates an example to compare the effectiveness of HLDA over LDA. It is clear from the transformed spaces that the HLDA is indeed more effective in transforming the original data to reduce the feature dimension.
2.6. Combining acoustic and language models for decoding speech 29
input feature
bottle-neck
class posterior probabilities
2.5.4 Bottle-necked neural network (BNNN)
In this technique, a neural network is used to learn from the data to classify, with a bottle-neck [20-22] with fewer number of nodes in one of the hidden layers necessitated by the ï¬nal dimension of the feature. The network is trained for the target class for each feature vector. Instead of using the output of the network, the output at the bottle-neck is tapped before the non-linearity. This technique is widely used for transforming phone/state posteriors to a meaningful size suitable for speech recognition applications using HMM-GMM implementation.
2.6 Combining acoustic and language models for decoding speech
Natural language can be viewed as a stochastic process [34,35]. Every sentence, document, or other contextual unit of a text is treated as a random variable with some probability distribution. Given an acoustic signal A, the goal is to ï¬nd the linguistic hypothesis L that is most likely to have given rise to it. Thus, we seek L that maximizes P(L| A). Using Baye´s law:
argmax
L
P(L| A) = argmax
L
= argmax
L
P(A| L)· P(L) P(A)
P(A| L)· P(L) (2.24)
2.7. Summary 30
For a given signal A, P(A| L) is estimated by the acoustic matcher, which compares A to its stored models of all speech units. Providing an estimate for P(L) is the responsibility of the language model. When the language model is not used in the decoding, it assumes a uniform distribution for all symbols, meaning that no prior knowledge is available.
2.7 Summary
This chapter ï¬rst reviewed the features used for acoustic modeling in phone/speech recognition systems, and briefly describes the two popularly used approaches for acoustic modeling:
1. hidden Markov model - Gaussian mixture models
2. hidden Markov model - neural networks
The section on language modeling summarizes the language modeling concepts from two perspectives:
1. maximum likelihood
2. minimum cross entropy
Techniques used for dimensionality reduction and de-correlation applicable to acoustic modeling was reviewed in the subsequent section. The ï¬nal section explains how acoustic and language models are combined to enhance the performance.