Slide 24
Slide 24 text
3. Feature Extraction and
Engineering
Bukalapak IFCS – 2019 - Thessaloniki
Having a set of text data, we have to
transform it into numeric values, so that
our computer could understand and
learn from the data. We used 3
distinguish feature extraction for NLP,
Normalized TF-IDF (Term Frequency –
Inverse Document Frequency) , FastText ,
and Word2Vec.
Academic Paper Related
1. TF-IDF: Allahyari, et al. 2017. A Brief Survey of Text Mining:
Classification, Clustering and Extraction Techniques. In
Processings of KDD Bigdas, Halifax, Canada, August 2017, 13
pages.
2. FastText: Joulin, et al. 2017. FastText.zip: Compressing Text
Classification Models. 5th International Conference on Learning
Representations Proceedings.
3. Word2Vec: Mikolov, et al. 2013. Distributed Representations of
Words and Phrases and their Compositionality.
“
“
The appearance
frequency for each
word vs the
sentence, and the
whole document
TF-IDF(1)
TF Formula
IDF Formula
Normalization
Pre-trained DNN
Model from
Facebook Research
FastText(2)
Pre-trained DNN
Model using skip
grams
Word2Vec(3)