Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing : The need for a Yo...

Avatar for Lekan Lekan
September 15, 2018

Natural Language Processing : The need for a Yoruba Corpus

Natural language powered platforms have one big problem;Supporting indigenous African languages.This talk analyses some of the problems with these platforms and the work being done to introduce support for more African languages especially Yoruba

Avatar for Lekan

Lekan

September 15, 2018
Tweet

More Decks by Lekan

Other Decks in Technology

Transcript

  1. 3 1 2 3 4 Who am I? Open Source

    Contributor Machine Learning Engineer and Aspiring Machine Learning Researcher Huge fan of Saheed Osupa Advocate for Open Machine Learning www.openml.org
  2. 7 Choose a corpus STEP 1 Choose or implement an

    NLP algorithm STEP 3 Choose annotations(labels) to use STEP 2 NLP Pipeline
  3. 8 nltk.corpus nltk.corpus Gutenberg Shakespear Inaugural Reuters Ijapa Ti Roko

    Bibeli Yoruba Odu Ifa Langbodo http://www.nltk.org/nltk_data/
  4. 9 Problem Statement How do we build an extensive and

    standard corpus for the Yoruba Language?
  5. 11 Work Done or To be Done What has been

    done and what is yet to be?
  6. 13

  7. https://corplinguistics.wordpress.com/tag/ yoruba/ ASP corpus Yoruba Wikipedia LDC Lexicon Database Kola

    Tunbosun Google Internalization Corpus Crawler Babatunde Obalalu 14
  8. 15 Usage Words An average of 2000 words/ sentences on

    usage of Yo r u b a i n d i f f e r e n t contexts. Ife, Ibadan and Ede Manual translations of orikis of my home town (Ede), Ibadan and Ijebu Ode Yoruba names Scrapped Yoruba names from Kola Tubosun’s project. Saheed Osupa, Pasuma Collaborated with kiosk disc sellers to work on getting some songs by SO and Pasuma written and manually translated to English. Bible Oriki Yoruba Names Fuji Music KJV A combination of manually translated and scrapped b i b l e v e r s e s o f t h e Pentateuch chapters. ASP Corpus Lexicology A database containing lexical and morphological usage of the Yoruba language LDC Lexical Database What We Have Done
  9. 16 Talk to more linguists about the work, ask for

    advise and generally go more into academia for a solution. 1 Do more for open source. Try to bring more interested developers into the work and generally be more open. 2 Look more into existing (new) solutions. 3 More language support 4
  10. Ingestion I. Scheduling II. Adding new feeds III.Synchronising feeds, finding

    duplicates IV.Parsing different feeds/entries into a standard form V. Monitoring
  11. 19

  12. 21

  13. Storage I. Database choice II.Data representation, indexing, fetching III.Connection and

    configuration IV.Error tracking and handling V.Exporting
  14. 24 Building a corpus is not an easy task. 1

    Optimise for flexibility and easy iteration 2 Talk to people 3 Diversify 4 Lessons