Speech Technologies For Language Learning And Assessment English Language Essay

Published: November 21, 2015 Words: 3377

Engineers in the field of speech research have long hoped to be able to use their technology to help language learners, and language pedagogues have searched for ways in which the spoken language could be better supported by Computer-Assisted Language Technologies (CALT) materials. This article addresses the application of speech technologies in language education. It begins by exploring the main issues in Automatic Speech Recognition. After a concise review of whole tutoring systems using ASR in training and assessing language learners, as well as providing them with appropriate feedback on their performance, it deeply investigates three such systems including Spoken English Test (SET), SpeechRater V1.0, and a speech liveliness evaluation program.

Keywords: Speech technology; ASR; Individual error detection; pronunciation assessment; perception training; Prosody detection; Resynthesis method; SET; SpeechRater; Speech liveliness

1 Introduction

One of the major focuses of language instruction is to enhance learners' ability to communicate, that is, to enhance their oral communication skills. Therefore language assessment should emphasize the competent use of language in spoken communication. Traditionally, Oral Proficiency Interviews (OPIs) are often viewed as assessments well-aligned with this goal. However, an intrinsic limitation of OPIs is that the number of tests that can be administered in a given language is constrained by the number of available trained interviewers (Suzuki and Harada, 2004). The first attempts in using technologies for oral language assessment taps into Simulated Oral proficiency Interviews (SOPIs) which are tape-mediated assessments. SOPI is based on The American Council On the Teaching of Foreign Languages'(ACTFL) face-to-face oral proficiency interviews. On SOPIs, students record their responses to fifteen tasks on either a tape recorder or a computer which would then be scored by human raters. However, problems to do with reliability, time, cost, and the number of the students there still exist with SOPI as in OPI. (Malone, 2007) Obviously there was a need for systems which could automatically be administered and scored spoken language tests. Over the past decade, advances in speech technology have enable the development of such systems.

"Such technologies are the products of the cooperation of several fields such as computer science, signal processing, statistics, second language acquisition, cognitive science and linguistics" (Eskenazi, 2009).

This paper aims at reviewing significant developments in the use of spoken language technologies for language learning and assessment. It begins by describing Automatic Speech Recognition (ASR) through giving a brief background and reviewing the main issues of the field including the goal of an ASR system, its application for pronunciation feedback and evaluation, perception training, prosody detection; as well as different types of error and different approaches to detecting variation in speech signals. Then, it reviews some of the noteworthy whole language training and assessment systems which tap into ASR. Finally it goes through exploring three such systems, including Spoken English Test (SET), SpeechRaterSM V 1.0, and a speech liveliness evaluation program in detail.

2 ASR

in the late 1970's Destombes developed a software for her deaf daughter which through displaying pitch and intensity versus time could help her learn to speak. Martony (1968), Nickerson and Stevens (1972) also contributed to the field by developing automatic speech processing system especially for the deaf. Then in 1988 Flege, tapping into visual aids, introduced a system for training Speakers to produce correct vowels. Finally, in the early 1980's speaker-independent automatic speech recognition (ASR), which is now the main language technology utilized for language learning systems, has emerged. Burnstein, Russell, and Franco are amongst the pioneers in developing ASR.

The goal of an ASR system can be seen as transcribing the acoustic signal into a textual representation and is mediated by two models, the acoustic model (AM) and the language model (LM), as well as by a pronunciation dictionary. The AM associates probabilities with speech units called phones that represent a given phoneme while LM models the prior probabilities of word sequences in English that are called n-grams. For example, a trigram is a sequence of three words where the probability of the third word occurring in the context of the first and second word is estimated. Finally, a pronunciation dictionary needs to be built where every word of the chosen vocabulary of the recognizer needs to have at least one associated pronunciation in terms of a sequence of phonemes. Some words may have a number of alternative pronunciations. (Xi et al., 2008)

Since human judgment in testing speaking and more specifically pronunciation is time consuming and it can be difficult for raters to be consistent, Speech recognition is now used for practicing and evaluating second language. However, systems using ASR are not perfect in voice recognition and there are two types of errors in this regard. First, false positive which refers to the kind of errors that are incorrect pronunciations in effect , but the system considers them as correct ones. The second type is false negative in which the system announces the input pronunciation as incorrect while it is really correct. Trying to minimize each of these types of errors will lead to maximizing the others. However, designers attempt more to minimize false negative because of the greater psychological negative impact it has on the learners (Eskenazi, 2009).

As mentioned above speech technologies can also be used for pronunciation feedback and evaluation. They can also provide an opportunity for controlled interactive speaking practice outside the classroom. One such system is the European Community project SPELL for automated assessment and improvement of foreign language pronunciation which utilizes expert knowledge about systematic pronunciation errors by L2 adult learners in order to diagnose and correct such errors. It can be very effective in diagnosing and correcting known problems of L1 interference, but less effective in detecting rare and more idiosyncratic pronunciation errors (Ehsani and Knodt, 1998). It is also worth mentioning that according to a research by Rebecca Hinks (2003) , Practicing with programs such as Talk to Me from Aurolog "is beneficial to those students who began the course with a strong foreign accent but is of limited value for students who began the course with better pronunciation"(2003:3).

There are two main approaches to detecting variations in the speech signal that are linked to non-native speech: individual error detection and pronunciation assessment. Individual error detection refers to individual errors in basic skills such as the pronunciation of an individual phone and requires the calculation of a score at a local (phoneme for phonetics, syllable or word for prosodics) level for each phone (Eskenazi, 2009). A number of different approaches for the detection of individual errors have been proposed and implemented, including Dynamic Time Wrapping, Template Matching, Knowledge-based Expert Systems, Neural Nets, and Hidden Markov Modeling (HMM). HMM-based modeling applies sophisticated statistical and probabilistic computations to the problem of pattern matching at the sub-word level. HMM-based is the most effective method for creating high-quality performance speaker-independent recognition engines that can cope with large vocabularies; the vast majority of today's commercial systems deploy this technique (Ehsani and Knodt, 1998). On the other hand, pronunciation assessment refers to determination of the overall impression of fluent speech and that the speech is judged here on its natural flow (use of pauses, rhythm, pitch, etc.) (Eskenazi, 2009).

One of the other main issues in ARS is to do with perception training. At present resynthesis method can well be utilized through changing and improving learners' own voice, representing in a new voice or set of comparison and contrast. The main advantage of this method is that everybody can learn better through the correct pronunciation form of his own voice. ART laboratories made a commercial language learning product benefiting this method. Some researchers such as Hazan et al. (2005) added visual aide including waveform and pitch contour to this kind of products. One the most recent development in this regard is talking head software which can receive a sound or utterance and while resynthesizing it, deploy a face to display the place of articulators. This talking head comes in two versions: with skin on and with skin removed. Learners are even able to rotate it and watch the voice articulation from different perspectives (Eskenazi, 2009)

Prosody detection and correction is another important issue in systems using ASR. Prosody detection can be achieved through comparison of a student's speech to that of a native speaker. Sundsrom (1998) used a speech recognizer to label incoming student speech and then to align it with a teachers' correct pronunciation. After taking pauses into account, the student's speech was modified in duration an F0 and resynthesized using PSOLA. This allows student to practice improving her prosody while hearing a modified version of his own voice. Delmonte (1998, 2000) segmented the incoming speech and then aligned it with a native model and its transcription and gave it a phonetic description. Yamashita et al. (2005) used a multiple regression model to predict proficiency using F0, power and duration, comparing non-native to native utterance. Acknowledging all these improvements in the area of prosody tutoring, much remains to be done in this filed (Eskenazi, 2009).

3 An overview of complete tutoring and assessment systems

At present there are systems which can conduct a fairly natural dialogue with the student. The recent version of the dialog systems spoke a prompt to the student and he freely respond to it. Then the student's response is matched to the closest amongst a set of utterances that the system expected to hear at that point in the dialogue. Each correct response then will lead the dialogue along a different path. Rux and Eskenazi (2004) added a new strategy to such systems. The system takes the closest correct utterance and sends it to a synthesizer with marking of where it was different from the student's utterance. Hence it acts like a human who will correct someone by emphasizing the correction. A whole tutoring system must present a curriculum of what skills the student is expected to learn, assess progress, model the student and furnish progress report (Eskenazi, 2009). Hiller et al. (1993), Rypa and Price (1999), Franco and colleagues (2000), LaRoca et al. (200), Raux and Kawahara (2002) are amongst the famous whole tutoring system developers. Video games on the other hand, because of their competitive nature and the intrinsic motivation they create have recently been of great interest for language learners, especially for children. Johnson and his colleagues (2004), Mote et al. (2004) create games designed to teach American culture. With the first application aimed at American soldiers going to Iraq, Johnson developed a series of scenarios in which the student plays a role, communicating with avatars. From this the graphics were used for games to teach vocabulary and other linguistic knowledge. This work has been commercialized in 2008. Chao and colleagues (2007) created a game for learning Chinese through translation. Wik and Colleagues (2007) created the DEAL game using an avatar that plays roles to carry out a dialogue with the student. There are also some of the commercialized systems using ASR. "NaitveAcent" to teach pronunciation; "Versant" to assess the fluency of non-native speech; "Auralog", "Saybot", and "RosettaStone" to teach vocabulary and grammar; "Aelo" to teach culture and language including course such as mission to Iraq; "Soliloquecy" to teach children to read aloud , are amongst the most well known commercialized Language tutoring systems which tap into ASR technology (Eskenazi, 2009).

So far in this article, some software using ASR technologies has been introduced in brief. But before going through some of them in depth, a noteworthy question to ask is how to evaluate such software. Chapelle (2001) suggests six criteria for such an evaluation: language learning potential, learner fit, meaning focus, impact, authenticity and practicality (Hinks, 2003). For further information on software evaluation, see Chapelle, C (2001) Computer Applications in Second Language Acquisition.

4 Three Whole systems elaboration

4.1 Spoken English Test (SET)

One the most well known automated spoken tests, developed by Ordinate corporation, is Spoken English Test(SET) which was first built on top of a common testing framework. The framework consists of three components: a test delivery system, a computerized scoring system, and a validation process. Seven tasks have been developed to measure facility in the delivery system of a spoken language. The seven tasks are Reading, Repeat sentences, Opposites, Short Answer Questions, Sentence Builds, Open Questions, and Story Telling. Ordinate uses an HMM-based ASR, speech to text alignment, and non-linear models to perform automatic scoring. The general approach to validation in the common testing framework highlights three metrics as evidence of the tests' quality: high reliability, ability to show effective separation between samples of native and non-native test takers, and strong correlations with the other established measures of oral language proficiency (Suzuki and Harada, 2004).

This test is automatically administered over a telephone and is automatically scored by the computerized scoring system. A score report becomes available on Ordinate's website within a few minutes after a test has been completed. The score report consists of one overall score and four sub scores: Sentence Mastery, Vocabulary, Fluency, and pronunciation (Suzuki and Harada, 2004).

These test administration procedures are schematized in Figure1.

Figure1. Test administration Scheme Source: Adapted from Suzuki and Harada (2004)

The benefit of this type of test, compared to human conducted and scored interviews, is that they can be administered in large volumes and scored rapidly without sacrificing reliability or quality.

4.2 SpeechRaterSM Version1.0

SpeechRater (v1.0) is an automated scoring system deployed for the Test of English as a Foreign Language(TOEFL) Internet-Based test (iBT) Speaking Practice Test, which is used by Prospective test takers to prepare for the official TOEFL iBT test. The technologies that support the automated evaluation of speaking proficiency are ASR and analysis technologies as well as natural language processing tools (Xi et al., 2008).The architecture of an automated speech scoring system is illustrated in figure 2.

Audio file

Speech Recognizer

Recognized Words and utterances

Input speech signals

Feature extraction programs

f

Scoring features

Scoring model

f

Speaking scores

User Interface

f

Figure2. ASR architecture Score report and user advisories

Source: Adapted from Xi et al. (2008)

One of the most controversial issues in designing software which can automatically score the speech of non-native speakers is in determination the appropriate scoring features. The features are then computed based on the output of the recognition engine. The output consists of (a) start and end time of every token and hence potential silence in between (used for most features); (b) identity of filler word (for disfluency-related features); (c) word identity (for content features) (Zechnar and Bejar, 2006). In order to determine the most appropriate feature, Zechnar and Bejar (2006) conducted a research in which they used two different machine learning approaches including Support Vector Machines (SVM) and Classification and Regression Trees (CART), to analyze the features. They concluded that the SVM models are more useful for a quantitative analysis, whereas the CART models allow for a more Transparent summary of the patterns underlying the data.

The scoring rubric for human grading represent the construct of speaking that is of interest to both the operational TOEFL iBT speaking and the TOEFL iBT speaking practice test. It is well illustrated in the following figure. (Figure 3)

Figure 3. TOEFL Scoring Rubrics. Source: Adapted from Xi et al. (2008)

Another issue which should be taken into consideration is the type of tasks in automated spoken tests. Some assessment such as TOEFL Practice Online Speaking test focus on spontaneous, high-entropy responses (because the test takers are usually rather proficient) while other tests like SET-10 are focused mostly on lower level aspects of language, using tasks such as "reading" or "repetition", where expected sequence of words is highly predictable. Zechnar and Xi (2008) conducted a spoken test with heterogeneous task types. They used different features and scoring models for each task. For example, for opinion task which is a high-entropy task type, multiple regression models employing different weights for the features were developed, namely an Equal Weights model, an Experts Weights model and an Optimal Weights model while for picture description task which is a medium- height entropy, CART was used to predict the score class each response should be assigned to. Finally, they concluded that task-specific modeling efforts did not seem to be necessary for the two task types investigated (i.e. high entropy and medium-high entropy). This does not preclude the possibility, though, that task-specific scoring models are superior for other task types in which the expected content is much more restricted (such as the constrained short-answer questions). (Zechnar and Xi, 2008)

Recognizers tend to perform best, with reduced error rate, When trained on (or adapted to) the voice characteristics or speaking style of the speaker (Ehsani and Knodt, 1998). So due to this reason and also since their tests includes heterogeneous task types which are different from TOEFL iBT Practice tasks, they undertook a series of adaptation and optimization steps with the goal of maximizing the word accuracy on the two task types for this speaking test (Test with Heterogeneous Tasks, abbreviated as THT). They "first adapted the acoustic model in batch mode with supervised maximum aposteriori (MAP) adaptation using the combined data from both tasks, then the language model, optimized the filler cost parameter and finally conducted unsupervised maximum likelihood linear regression (MLLR) acoustic model adaptation based on individual speakers" (Zechnar and Xi, 2008)

4.3 Speech liveliness evaluation program

Another fascinating research regarding using speech technologies is to do with liveliness of speech which is conducted by Rebbeca Hinks (2005). The hypothesis tested is that speakers who had high pitch variation as they spoke would be perceived as livelier speakers. A metric (termed PVQ), derived from the standard deviation in fundamental frequency, is proposed as a measure of pitch variation. This hypothesis has already been tested and accepted in a work by Traunmuller and Eriksson (1995) using synthesis of a single utterance, but Hinks (2005) tested it on naturally occurring speech. The speech analysis function of the program WaveSurfer has been used to process the speech files in this research.

According to the manuals on public speaking (eg. Lamerton 2001; Grandstaff 2004), a lively speech is achieved by consciously modifying the three prosodic dimensions of loudness, pitch and tempo. In this research only Pitch and tempo were investigated. This work suggests that a computer could fill the role of friend or colleague and give automatic, objective and valuable feedback on speaker prosody. It can also be used as a feedback or evaluative mechanism for public speaking. Since public speaking difficulties are magnified for second language users, it can be of great benefit for them. An appropriate and friendly feedback interface would be an animated face that could respond alertly to lively speech but would lose attention, perhaps even fall asleep, if the prosody failed to show any characteristics of liveliness. Figure 4 illustrates how an automatic feedback mechanism could consist of two parallel processing operations, one conducted by the recognizer, and the other by speech analysis. The level of liveliness could be adapted to the speaking genre; while one level would be suitable for an evangelist, another is clearly more appropriate to an academic conference presentation. Furthermore, people's perception of what is appropriate and pleasing may be individually and culturally determined to one extent or another (Hinks, 2005).

Figure4. Schematic design of an automatic feedback mechanism for public speaking Source: Adapted from Hinks (2005)

In the future a feedback mechanism could also incorporate a camera and software for processing speaker gaze, facial expression and body language.

5 Conclusion

As current technologies become more accessible, language education has greatly benefited of computer technologies in training and assessment. This article examines recent development in language learning and assessment which involves systems using new speech technologies. It investigates the development and architecture of Automatic Speech Recognition (ASR) engines and then reviews its pedagogical application in systems that provide automated language training as well as automatically scoring oral proficiency tests and giving valuable feedback on such tests. Systems that automatically score speaking tests are more reliable, efficient, and economical than traditional approaches to language learning and testing. In fact "systems that incorporate spoken dialogue and game are at the leading edge of the field and they will soon be central, providing not only tutoring, but also test beds for development of new algorithms and strategies" (Eskenazi, 2009).