Natural Language Processing : The need for a Yoruba Corpus

1 Natural Language Processing The Need for a Yoruba Corpus
Olamilekan Wahab

2 The Agenda Lessons Work Done/To be Done Motivation Things
That Can Go Wrong Me Questions

3 1 2 3 4 Who am I? Open Source
Contributor Machine Learning Engineer and Aspiring Machine Learning Researcher Huge fan of Saheed Osupa Advocate for Open Machine Learning www.openml.org

4 Motivation What led to this project and what is
driving it?

5 NLP Everybody is doing it NLP

6 A little Story How did I get drawn into
this mess?

7 Choose a corpus STEP 1 Choose or implement an
NLP algorithm STEP 3 Choose annotations(labels) to use STEP 2 NLP Pipeline

8 nltk.corpus nltk.corpus Gutenberg Shakespear Inaugural Reuters Ijapa Ti Roko
Bibeli Yoruba Odu Ifa Langbodo http://www.nltk.org/nltk_data/

9 Problem Statement How do we build an extensive and
standard corpus for the Yoruba Language?

10 Corpus A source of language use example. Annotated Unified
Electronic Balanced

11 Work Done or To be Done What has been
done and what is yet to be?

12 HISTORY CULTURE NLP HEALTH 10 2 7 18

https://corplinguistics.wordpress.com/tag/ yoruba/ ASP corpus Yoruba Wikipedia LDC Lexicon Database Kola
Tunbosun Google Internalization Corpus Crawler Babatunde Obalalu 14

15 Usage Words An average of 2000 words/ sentences on
usage of Yo r u b a i n d i f f e r e n t contexts. Ife, Ibadan and Ede Manual translations of orikis of my home town (Ede), Ibadan and Ijebu Ode Yoruba names Scrapped Yoruba names from Kola Tubosun’s project. Saheed Osupa, Pasuma Collaborated with kiosk disc sellers to work on getting some songs by SO and Pasuma written and manually translated to English. Bible Oriki Yoruba Names Fuji Music KJV A combination of manually translated and scrapped b i b l e v e r s e s o f t h e Pentateuch chapters. ASP Corpus Lexicology A database containing lexical and morphological usage of the Yoruba language LDC Lexical Database What We Have Done

16 Talk to more linguists about the work, ask for
advise and generally go more into academia for a solution. 1 Do more for open source. Try to bring more interested developers into the work and generally be more open. 2 Look more into existing (new) solutions. 3 More language support 4

17 What Can Go Wrong ? It’s not an undocumented
feature. It’s a bug

Ingestion I. Scheduling II. Adding new feeds III.Synchronising feeds, ﬁnding
duplicates IV.Parsing different feeds/entries into a standard form V. Monitoring

20 Corpus Batch

Storage I. Database choice II.Data representation, indexing, fetching III.Connection and
conﬁguration IV.Error tracking and handling V.Exporting

23 Lessons What did we learn from all of this?

24 Building a corpus is not an easy task. 1
Optimise for flexibility and easy iteration 2 Talk to people 3 Diversify 4 Lessons

25 CKAN Baleen Mink NLTK Tool set YellowBrick

26 thank you Twitter : __olamilekan__ Github : olamyy Email
: [email protected]

Natural Language Processing : The need for a Yo...

Natural Language Processing : The need for a Yoruba Corpus

Lekan

More Decks by Lekan

Other Decks in Technology

Featured

Transcript

1 Natural Language Processing The Need for a Yoruba Corpus

2 The Agenda Lessons Work Done/To be Done Motivation Things

3 1 2 3 4 Who am I? Open Source

4 Motivation What led to this project and what is

5 NLP Everybody is doing it NLP

6 A little Story How did I get drawn into

7 Choose a corpus STEP 1 Choose or implement an

8 nltk.corpus nltk.corpus Gutenberg Shakespear Inaugural Reuters Ijapa Ti Roko

9 Problem Statement How do we build an extensive and

10 Corpus A source of language use example. Annotated Unified

11 Work Done or To be Done What has been

12 HISTORY CULTURE NLP HEALTH 10 2 7 18

13

https://corplinguistics.wordpress.com/tag/ yoruba/ ASP corpus Yoruba Wikipedia LDC Lexicon Database Kola

15 Usage Words An average of 2000 words/ sentences on

16 Talk to more linguists about the work, ask for

17 What Can Go Wrong ? It’s not an undocumented

Ingestion I. Scheduling II. Adding new feeds III.Synchronising feeds, ﬁnding

19

20 Corpus Batch

21

Storage I. Database choice II.Data representation, indexing, fetching III.Connection and

23 Lessons What did we learn from all of this?

24 Building a corpus is not an easy task. 1

25 CKAN Baleen Mink NLTK Tool set YellowBrick

26 thank you Twitter : olamilekan Github : olamyy Email