Emotech: Bridge pedagogy and technology 

In this project, I took the principal UX research role to examine an AI-powered multimodal English pronunciation correction software, English class representative. The software was developed by the London based emotional AI company, Emotech. We were preliminarily interested in examining a pronunciation training method that utilized sentence-based drills delivered by three multimodal interfaces. That whether the computer-animated avatar could benefit stakeholders with the pronunciation training compared with the conventional audio-only method.

____________________________________________________________________

Supervisor: Yvonne Rogers
Technical support: Jara Alvarez Masso, Yijun Yu

images.png
 
 

Related work —

In an effort to improve English as a second language (ESL) learner’s pronunciation, many pedagogical methods and perspectives have been examined. Many researchers argued that modern pedagogy shifts from teacher centred education philosophy towards student centred and digitalized instruction techniques. Advances in technology facilitate the integration of computer-assisted language learning (CALL) software into such pedagogical transition.

Previous studies have greatly examined the effectiveness of using multimodal materials to help L2 learners with particular pronunciation obstacles. However, those lab-based settings mostly emphasise multimodal approach advancements in syllable and word level, with few studies examine its effectiveness in more naturalistic and sophisticated settings with sentences-based training. In addition, even fewer studies delivered such training through synthesised speech and a realistic computer-animated talking head. Therefore, the presented research concerned whether a realistic computeranimated avatar with synthesised speech would benefits stakeholders in sentence-based drills practice.

0.png
 

Interface —

Figure2.jpg

The software enables the realtime synthesis of pronunciation animation materials to instruct participants on sentence-based drills. The pronunciation animation is synchronized with the synthesised speech as presented below the canvas.

The animation has three variations, an audio-face, an audio-mouth and an audio-only interface. In the audio-face condition, a front view of a photorealistic talking head will be rendered according to the presented sentence. In the audio-mouth interface, a skew orientation of the avatar mouth will be presented in which lip and tongue movements act as salient visual elements, whereas other parts of the face will not be visible (middle). In audio-only condition, the canvas will be filled with solid colours to ensure no articulation animations will be delivered to participants (right).

 
Snapshot of the WCMS to auto translate text to video.

Snapshot of the WCMS to auto translate text to video.

 
1.gif

Step 1. Users were supposed to first watch the animated articulatory drill and repeat after it.

 
2.gif

Step 2. Their practices would be recorded and analysed by the intelligent scoring system which evaluates phonetic articulation quality and generates report

 
3.gif

Step 3. Candidates can review their pronunciation ratings in the feedback page and concentrate on improving those pronunciation impairments in later training.

 
 

Methodology —

We recruited 24 recent high school students as research participants and randomly assigned them into three interface conditions, which includes audio-face, audio-mouth and audio-only condition. A pilot English speaking test was held to make sure students performances in each condition would not be biased by their previous language experience. We manipulated a between-subject design to examine the software effectiveness in Chinese student pronunciation correction. During the 10 days intensive training, subjects were required to finish 10-20 sentences on the daiy basis, in total 160 sentences during the research period.

 

Conclusion —

To summerise some general conclusion draws from this research that 1️⃣ all three interfaces made comparable improvements with audio-face group presented certain advancements. Although multimodal interfaces did not directly contribute to statistically significant benefits, the additional visual stimuli could potentially make the synthesised speech sound more accurate to subjects and induce more improvements for low proficiency candidates.

2️⃣ Participants praised the system for its reliability and engagement, demonstrating great pedagogical feasibility.

3️⃣ Results suggests a trade-off between information salience and interface effectiveness in sentence-based drills. This reminds researchers to well balance the perception benefits and cognitive overload detriment when utilizing multimodal materials.