Ongoing Projects

sarves

1. K Sarveswaran

Project Title: A deep syntactic parser for the Tamil language.

K. Sarveswaran is a doctoral student assistant at the National Languages Processing Centre, University of Moratuwa. He has been attached to the centre since its initiation. Sarves has been developing a deep syntactic parser for the Tamil language. Despite its large speaker population around the world and historical time depth, Tamil is an under-researched language that is also under-resourced from the perspective of Natural Language Processing. Therefore, he believes that developing a parser would be a useful contribution to the field of Natural Language Processing, specifically to the development of a Machine Translation application. He has published a language processing tool stack for Tamil called ThamizhiLP, which consists, so far, a pre-processor, a part of speech tagger, a morphological analyser/generator and a dependency parser. In addition, he jointly works with other members in the NLPC team. Sarves has published several research papers, and a journal. He has also organised an international summer school and workshops for the centre. Apart from these he collaborates with members at University of Konstanz and University of Hyderabad on Tamil language resource development. In this relation, he has also spent a few months at University of Konstanz in 2018 and 2019.

Supervisors: Prof. Gihan Dias (University of Moratuwa) and Prof. Miriam Butt (University of Konstanz, Germany).

 

ans

2. Anosha Ignatius

Project title: Speech Embedding with Segregation of Paralinguistic information for Tamil Language

Project Description: The presence of paralinguistic information such as speaker characteristics, accent, and emotion expression causes performance degradation in speech processing applications where only the linguistic content is needed. This project is intended to develop a speech embedding model for Tamil language, that disentangles the underlying paralinguistic information in the speech signal while preserving the linguistic content.

Supervisor: Dr. Uthayasanker Thayasivam

 

chath

3. Chathuri Jayaweera

Project title: Automatic Post Editing (APE) seeks to automatically refine the output of a black-box machine translation (MT) system through human post-edits (PE). 

The objectives of this technique are to replace the time and energy-consuming human post-editing process with a rather fast approach and to provide a better solution to improve MT quality instead of building new MT systems from scratch. APE systems assume the availability of source language input text, MT output and target language PE data. In the early stages of APE implementations, only MT output and target language PE were used as inputs, but later it was observed that integrating source-language information together with the other two types of inputs was useful in conveying context information to improve APE performance. There have been several implementations with varying architectural qualities that have shown significant results on certain language pair translations such as English – German translation. Hence, the research carried out by me is focused on utilizing these currently available APE approaches to improve the quality of Sinhala - English translations.

Achievements: A part of my study was submitted and presented at the 6th annual Symposium on Natural Language Processing - 2020 (SNLP 2020) organised by the National Language Processing Centre of University of Moratuwa.


Supervisor : Prof. Gihan Dias

 

kalindu

4. Kallindu Kumarasinghe

Project title: Sinhala Grammar Error Corrector


He is currently working on the Sinhala Grammar Error Corrector project which will be able to identify grammatical errors in Sinhala sentences and give suggestions to correct them. In addition to that he is currently contributing to the Sin-Morphy, Sinhala Morphological Analyzer.

Supervisor : Prof. Gihan Dias

 

jana

5. Janarthanasarma Baskarakurukkal

Project Title : neural machine translation (NMT) system for English-Tamil language pairs

He is currently working on neural machine translation (NMT) system for English-Tamil language pairs. He is working on improving the efficiency of deep learning systems for machine translation involving Tamil which is a low resource and morphologically rich language. He is mainly focusing on techniques like sub word segmentation for NMT. Bio Currently, Janarthanasarma is a postgraduate research student at the University of Moratuwa. He earned his undergraduate degree in Computer Science and Engineering from the same university in 2018. He also has one and a half years industry experience as a software engineer.


Supervisors : Dr. Uthayasanker Thayasivam, Dr.Surangika Ranathunga
Achievement: Paper was accepted in ICICT 2021 conference 
 

 

koshiya

6. Koshiya Epaliyana


Project title: Using back-translation to improve Domain-Specific English-Sinhala Neural Machine Translation


Use data selection, iterative back-translation, and filtering altogether to improve the quality of back-translation in a low resource setting. With respect to Sinhala-English MT, none of these techniques have been tried out. The objective of this research is to implement the optimal Back Translation algorithm for Sinhala-English, by improving the existing Back Translation algorithms considering features specific to Sinhala-English. And use the generated synthetic data to improve Sinhala-English Neural Machine Translation (NMT) over the baseline Neural Machine Translation (NMT) model built just by using domain-specific parallel data.

Supervisors: Dr Surangika Ranathunga, Prof. Sanath Jayasena

 

anushuat

7.Anushika Liyanage


Project Title: Automatically inducing bilingual lexicon for Sinhala-English by generating multiple word embedding models for each language and mapping the generated embeddings to one latent space.

Utilize the generated bilingual lexicon in a task specific downstream system to enhance the results of the overall system


Supervisors: Dr.Surangika Ranathunga, Prof.Sanath Jayasena
 

 

sarubi

8. Sarubi Thillainathan


Project title: Multilingual Neural Machine Translation for Sinhala-Tamil-English Languages.


Neural Machine Translation (NMT) is a recently emerged new paradigm for machine translation and consistently outperformed Statistical Machine Translation (SMT). However, NMT showed effectiveness in high-resource languages and was less efficient for low-resource languages like Sinhala and Tamil. Multilingual Neural Machine Translation can alleviate this data scarcity issue. Multilingual NMT model generalizes better than the single-pair translation model for low resource languages. Studies proved that in Multilingual NMT knowledge transfer has strongly observed from high resource languages to low resource languages. To the best of our knowledge, there are no prior studies conducted on multilingual translation on Sri-Lankan languages. Thus, this study proposes the Multilingual NMT model to improve the quality of translation and build a joint model rather than having separate models for each language pair translation.
Keywords:
Neural Machine Translation, Multilingual Neural Machine Translation, Multi-task Learning, Transfer Learning, Low Resource Language.

Supervisors:  Dr. Surangika Ranathunga, Prof. Sanath Jayasena


 

Rameela

9. Rameela Azeez


Project title: Named Entity Recognition


Currently working towards implementing a Named Entity Recognition (NER) system that supports Sinhala, English and Tamil using a common Named Entity (NE) tag set, and use its output to improve Machine Translation between the said three languages

Supervisors:  Dr. Surangika Ranathunga


 

aloka

10. Aloka Fernando


Project title: Addressing out-of-vocabulary (OOV) problem for Sinhala-English domain specific Neural Machine Translation (NMT)

The Neural Machine Translation (NMT) model is limited to producing reliable translations to the top most frequent words in the training corpus. The vocabulary that is not covered by the NMT model, referred to as Out-of-Vocabulary (OOV), leads to producing weaker translation outputs. Low-resourced languages such as Sinhala-English have a limited parallel corpus, and Sinhala being a morphologically rich language OOV problem becomes severe. The research would focus on two techniques, data augmentation and subword-based encodings to address the OOV problem.

Data augmentation technique would focus on synthetically inducing a parallel corpus which confirms to syntactic and semantic correctness. The second technique, would utlize linguistically motivated n-gram character subword unit based encodings in the NMT. Therefore the author would use data augmentation techniques coupled with linguistically motivated subword units towards addressing the OOV problem in NMT specific to Sinhala-English languages.

Supervisors:  Prof. Gihan Dias, Dr. Surangika Ranathunga