Automatically standardising two multilingual code-switched corpora
In collaboration with Christopher Bryant at the Cambridge Computer Lab in 2018, I developed a tool to process CanVEC, an originally collected natural code-switched speech corpus for my PhD. Although multilingualism is now the norm, NLP tools capable of processing more than one language per ‘sentential unit’ are almost non-existent. This necessarily limits important applications such as Machine Translation and Information Retrieval, and also the utility of NLP-based technology in contexts where language-users readily employ two or more languages side by side. We developed an algorithm to semi-automatically annotate the corpus with language information and part-of-speech (POS) tags, obtaining a >90% accuracy rate for almost every task.
This work has recently been awarded a grant by the Cambridge Language Sciences Research Incubator Fund to further refine and extend its applicability to Hindi-English. The ultimate goal is to eventually create a standardised code-switching corpus of multiple language pairs, where researchers can freely access a large amount of comparable, consistently transcribed and tagged dataset for future research. Once in place, this linguistic repository would also be available not only for linguistic research of different kinds, but also for use in more complex downstream NLP tasks such as speech recognition of accented speech, parsing, and automated translation.
The pilot stage of this project is expected to complete by September, 2019.
Migrant and refugee language repertoire
With a long-standing interest in migrant language practice, I am also putting together a research program that systematically investigates how migrant communities foster their language practice. Ultimately, the recent immigration ‘crisis’ in the Western world is in part a struggle of navigating the changing communities as a result of globalisation; and language as an identity and cultural marker has proved a particularly important point of contention in the midst of these changes. As people from different cultures merge, will language boundaries also evaporate? How are linguistic norms re-negotiated and re-defined during this process?
The Migrant and Refugee Language Repertoire project aims to answer these questions and better our understanding of why and how migrant communities develop their local, or even hybrid, linguistic strategies in specific contexts. The outcome may not only inform linguistic theories, but also provide tangible, empirical evidence to further improve the practice of English teaching and language assessment for migrant and refugee speakers, a particularly challenging area yet still little is known. The situation is further complicated by the fact that this minority group of speakers all come from diverse cultural, educational and linguistic backgrounds, possibly with little or no literacy skills, ranging in age from children to senior citizens. Research output from this may help inform government language policy for migrants and refugees in the future.
This project is in its infancy stage, and I welcome interests and collaboration.