Browsing by Author "Havras, Anna"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- Speech recognition systems and russian pronunciation variation in the context of VoiceInteractionPublication . Havras, Anna; Moniz, Helena Gorete Silva; Paulo, Sérgio Manuel Gaspar FerreiraThe present thesis aims to describe the work performed during the internship for the master’s degree in Linguistics at VoiceInteraction, an international Artificial Intelligence (AI) company, specializing in developing speech processing technologies. The goal of the internship was to study phonetic characteristics of the Russian language, attending to four main tasks: description of the phonetic-phonological inventory; validation of transcriptions of broadcast news; validation of a previously created lexicon composed by ten thousand (10 000) most frequently observed words in a text corpus crawled from Russian reference newspapers websites; and integration of filled pauses into the Automatic Speech Recognizer (ASR). Initially, a collection of audio and text broadcast news media from Russian-speaking regions, European Russian, Belarus, and the Caucasus Region, featuring different varieties of Russian was conducted. The extracted data and the company's existing data were used to train the acoustic, pronunciation, and language models. The audio data was automatically processed in a proprietary platform and then revised by human annotators. Transcriptions produced automatically and reviewed by annotators were analyzed, and the most common errors were extracted to provide feedback to the community of annotators. The validation of transcriptions, along with the annotation of all of the disfluencies (that previously were left out), resulted in the decrease of Word Error Rate (WER) in most cases. In some cases (in European Russian transcriptions), WER increased, the models were not sufficiently effective to identify the correct words, potentially problematic. Also, audio with overlapped speech, disfluencies, and acoustic events can impact the WER. Since we used the model that was only trained with European Russian to recognize other varieties of Russian language, it resulted in high WER for Belarus and the Caucasus region. The characterization of the Russian phonetic-phonological inventory and the construction of pronunciation rules for internal and external sandhi phenomena were performed for the validation of the lexicon – ten thousand of the most frequently observed words in a text corpus crawled from Russian reference newspapers websites, were revised and modified for the extraction of linguistic patterns to be used in a statistical Grapheme-to-phone (G2P) model. Two evaluations were conducted: before the modifications to the lexicon and after. Preliminary results without training the model show no significant results - 19.85% WER before the modifications, and 19.97% WER after, with a difference of 0.12%. However, we observed a slight improvement of the most frequent words. In the future, we aim to extend the analysis of the lexicon to the 400 000 entries (total lexicon size), analyze the type of errors that are produced, decrease the word error rate (WER), and analyze acoustic models, as well. In this work, we also studied filled pauses, since we believe that research on filled pauses for the Russian language can improve the recognition system of VoiceInteraction, by reducing the processing time and increasing the quality. These are marked in the transcriptions with “%”. In Russian, according to the literature (Ten, 2015; Harlamova, 2008; Bogradonova-Belgarian & Baeva, 2018), these are %a [a], %am [am], %@ [ə], %@m [əm], %e [e], %ɨ [ɨ], %m [m], and %n [n]. In the speech data, two more filled pauses were found, namely, %na [na] and %mna [mna], as far as we know, not yet referenced in the literature. Finally, the work performed during an internship contributed to a European project - Artificial Intelligence and Advanced Data Analysis for Authority Agencies (AIDA). The main goal of the present project is to build a solution capable of automating the processing of large amounts of data that Law Enforcement Agencies (LEAs) have to analyze in the investigations of Terrorism and Cybercrime, using pioneering machine learning and artificial intelligence methods. VoiceInteraction's main contribution to the project was to apply ASR and validate the transcriptions of the Russian (religious-related content). In order to do so, all the tasks performed during the thesis were very relevant and applied in the scope of the AIDA project. Transcription analysis results from the AIDA project showed a high Out-of-Vocabulary (OOV) rate and high substitution (SUBS) rate. Since the language model used in this project was adapted for broadcast content, the religious-related words were left out. Also, function words were incorrectly recognized, in most cases, due to coarticulation with the previous or the following word.
