Slide 1

Slide 1 text

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho PyCon ID - 14 November 2020

Slide 2

Slide 2 text

Hello, call me Cahyo :) ■ Software Engineer ITC and Data Center Universitas Widyagama, Malang - Indonesia ■ Master Student Faculty of Computer Science Universitas Brawijaya, Malang - Indonesia Intelligent Systems Laboratory Affective Computing Research Interest Group (ACRIG)

Slide 3

Slide 3 text

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 4

Slide 4 text

Text → Unstructured Data Today, more than 80% of the data is unstructured. To achieve better insights or build better algorithms, it is necessary to ‘play’ around the data to make it clean. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 5

Slide 5 text

How would machine read this? Sources: twitter.com, play.google.com, tokopedia.com, shopee.co.id, suratwarga.malangkab.go.id Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 6

Slide 6 text

Garbage ‘in’, garbage ‘out’ Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Garbage Data Garbage Result

Slide 7

Slide 7 text

Talk Outline 1. Introduction ■ Why this topic? ■ What’s text preprocessing? ■ Text preprocessing pipeline 2. Text Preprocessing ■ Python library ■ Text preprocessing techniques using python ■ Rule of thumb. Do you need all techniques? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 8

Slide 8 text

■ Why Text Preprocessing? Text preprocessing is a severely overlooked topic. Sometimes, wrong techniques of text preprocessing. ■ Why Bahasa (Indonesia)? Bahasa is one of the top ten languages spoken throughout the world. However, there are still few studies that have been published in international journals or proceedings. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Why this Topic?

Slide 9

Slide 9 text

An approach for cleaning and preparing text data for use in a specific task. A task is combination of approach and domain. Task = approach + domain What’s Text Preprocessing? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 To reduce indexing (or data) file size of the text. Improve the efficiency and effectiveness.

Slide 10

Slide 10 text

‘Pipeline’ refer to a finite of steps taking ‘raw’ text as input and returning properly preprocessed ‘clean’ text as output. Text preprocessing pipeline may vary as well, depending on the task. Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 11

Slide 11 text

1. NLTK (Natural Language Toolkit); 2. Sastrawi; 3. SpaCy; 4. Flair; 5. Pandas; 6. Scikit-learn; 7. etc. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Python Library for Text Preprocessing

Slide 12

Slide 12 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Lower Case & Remove Whitespace Simplest and most effective of text preprocessing. Indonesia ≠ INDONESIA ≠ indonesia

Slide 13

Slide 13 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove HTML tag

Slide 14

Slide 14 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove URLs & Email

Slide 15

Slide 15 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Numbers & Punctuations Removing number and punctuation from the text. like ‘123....?!’ and also the symbols like ‘@#$’

Slide 16

Slide 16 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoji

Slide 17

Slide 17 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoticon

Slide 18

Slide 18 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoji Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, → grinning_face

Slide 19

Slide 19 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoticon Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, :-) → happy_face_smiley

Slide 20

Slide 20 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Non-ASCII Characters https://www.ascii-code.com

Slide 21

Slide 21 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Slang Word Normalization Generating normal word from the slang words. For example, gmn → bagaimana, jwb → jawab, gueh → saya

Slide 22

Slide 22 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stemming Process of reducing inflection in words. For example, the words ‘mendengarkan’, ‘dengarkan’, ‘didengarkan’ will be transformed into the word ‘dengar’.

Slide 23

Slide 23 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Word & Sentence Tokenize Process of separating the text (word or sentence) into pieces called tokens. Words, numbers, symbols, punctuation marks and other important entities can be considered tokens. ‘Selamat datang di Pycon ID 2020.’ → ‘Selamat’ ‘datang’ ‘di’ ‘Pycon’ ‘ID’ ‘2020’ ‘.’

Slide 24

Slide 24 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (NLTK) Removing low information (noise) words from text. Examples of stopwords in Indonesian are ‘yang’, ‘dan’, ‘di’, ‘dari’, etc.

Slide 25

Slide 25 text

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (Sastrawi)

Slide 26

Slide 26 text

Combine into Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 https://github.com/ksnugroho/pycon-id-2020

Slide 27

Slide 27 text

Rule of thumb. Do you need all techniques? There are no definite rules for the steps in text preprocessing. Not all tasks need the same level of text preprocessing. For some tasks, you can get away with the minimum effort. Must Do: ■ Noise removal ■ Lowercasing (can be task dependent in some cases) Should Do: ■ Simple normalization Task Dependent: ■ Advanced normalization ■ Stop-word removal ■ Stemming ■ Text enrichment / augmentation Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Slide 28

Slide 28 text

Thank you ! See you on PyCon ID 2021 :) Would love to connect, feel free to reach out. Discussion, any question? https://www.linkedin.com/in/ksnugroho https://github.com/ksnugroho