Kaggle for Data Scientists (Women in AI)

Women in AI Myanmar WAI Master Class Data Science Series
#2 Kaggle for Data Scientists Aye Hninn Khine (Ph.D. Candidate, Computer Science) Zin Tun (Senior Data Scientist)

Contents Data Science Concepts Kaggle for Data Scientists

What is Data Science? Data science is a “concept to
unify statistics, data analysis, machine learning, and their related models” in order to “understand and analyze actual phenomena” with data 3 Image Credit : https://vas3k.com/blog/machine_learning/

4 What is Data? • A datum is a single
measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements • Claim: everything is (can be) data! *CS 109A– Harvard University

6 • Giphy serves up 4,800,000 gifs • Netflix users
stream 694,444 hours of video • Instagram users post 277,777 stories • YouTube users watch 4,500,000 videos • Twitter users send 511,200 tweets • Skype users make 231,840 calls • Airbnb books 1,389 reservations • Uber users take 9,772 rides • Tinder users swipe 1,400,000 times • Google conducts 4,497,420 searches • Twitch users view 1,000,000 videos EVERY MINUTE https://www.weforum.org/agenda/2019/07/why-big-data-keeps-getting-bigger? (Source) Why Data Science? (Bigger Data)

[Source: Data Flair Training]

Why Data Science? (High Demand) 8

Why Data Science? (High Salary) 9

10 Applications of Data Science

11 Data Science Adds Value Descriptive Analytics Diagnostic Analytics Predictive
Analytics Prescriptive Analytics Business Intelligence Data Science What happened? Why did it happen? What will happen? How can we make it happen? Value Complexity

Do you ever wonder how Netflix recommends you shows?

Ever think about how you enjoyed the show recommended by
Netflix, much more than the one recommended by your friend?

14 Let’s try breaking the process down Where is Netflix
getting our data from? Data Collection Exploring Data What does Netflix even do with this data? Deriving Results Putting the results to use What patterns is Netflix looking for? How are these results used to recommend a new show to you?

Netflix Data Netflix has data about • User Profiles: account
type, year joined • User preferences: Search history, watch history • Shows: Genre, duration, actors How can they: • Make recommendations • Evaluate if their recommendations are correct • Improve on the recommendations

Skill sets of a Data Scientist 16 STATISTICS AND PROBABILITY
LINEAR ALGEBRA NUMERICAL ANALYSIS EFFICIENCY IN A PROGRAMMING LANGUAGE (PYTHON OR R) RESEARCH METHODOLOGY ETHICS (DATA PRIVACY) BUSINESS AND DOMAIN KNOWLEDGE

Programming Languages 17

Skill sets (Ethics) • Data is the most valuable asset
and the most powerful weapon nowadays. • If you have data, you are the KING who can rule the world. 18

19 Data Science Process

20 Data Science Process Credit: CS 109A– Harvard University Ask
an interesting question • What is the scientific goal? • What would you do if you had all data? • What do you want to predict or estimate? Get the data • How were the data sampled? • Which data are relevant? • Are there any privacy issues? Explore the data • Plot the data. • Are there any anomalies? • Are there patterns? Model the data • Build a model. • Fit a model. • Validate a model. Communicate and Visualize the result • What did we learn? • Do the results make sense? • Can we tell a story?

21 Machine Learning Process Image Credit: https://elearningindustry.com/machine-learning-process-and-scenarios , Akhil Mittal

Internal sources: already collected by or is part of the
overall data collection of you organization. For example: business-centric data that is available in the organization database to record day to day operations; scientific or experimental data. Existing External Sources: available in ready to read format from an outside source for free or for a fee. For example: public government databases, stock market data, Yelp reviews, [your favorite sport]-reference. External Sources Requiring Collection Efforts: available from external source but acquisition requires special processing. For example: data appearing only in print form, or data on websites. 22 Data Collection: Where do data come from?

23 Data Collection: Ways to gather online data •Using a
prebuilt set of functions developed by a company to access their services. Often pay to use. •For example: Google Map API, Facebook API, Twitter API API (Application Programming Interface) •using software, scripts or by-hand extracting data from what is displayed on a page or what is contained in the HTML file (often in tables). •For example: BeautifulSoup Web Scraping

Characteristics of Data: Data Format 24 STRUCTURED DATA - DATABASES,
CSV, XLSX UNSTRUCTURED DATA - TEXT, IMAGE, ACOUSTIC SEMI-STRUCTURED DATA - XML, JSON

25 Characteristics of Data: Data Objects StudentID Name Date of
Birth Address 1 John 02/01/ 1997 New York 2 Yumi 12/09/ 1997 Tokyo Data Objects A sample student table Attributes • Datasets are made up of data objects. • A data object represents an entity. • E.g. University database: Students, Professors, Courses • Also called samples, examples, instances, data points, objects, tuples. • Data objects are described by attributes. • Database rows - data objects; columns - attributes

26 Characteristics of Data: Types of Attributes Data Numeric Categorical
Binary, Nominal, Ordinal Text, Image, Video, Audio

Characteris tics of Data: Types of Attributes 27 Nominal: categories,
states or “names of things” •Gender = {Male, Female}, Hair_Color={auburn, black, blond, brown, grey, red, white} •Marital status, Occupation, ID numbers, zip codes Binary •Nominal attribute with only two states (0 and 1) •Symmetric binary: Both outcomes equally important. E.g. Gender •Asymmetric Binary: Outcomes not equally important. E.g. medical test (positive vs. negative) Ordinal •Values have a meaningful order (Ranking) but magnitude between successive values is not known. •Size = {Small, Medium, Large}, Grades, Army Rankings

28 Quiz: Characteristics of Data Emp_ID Name Age Gender Role
Salary (US$) 1 Amy 25 F Supervisor 300 2 Jane 27 F Assistant Manager 500 3 Peter 30 M Manager 800 Q1. How many data objects are in the employee dataset? Q2. How many attributes does an data object have? Q3. Define data type for each attribute.

Data Quality Measures for data quality: A multidimensional view Accuracy:
correct or wrong, accurate or not Completeness: not recorded, unavailable,.. Consistency: some modified but some not,.. Timeliness: timely update? Reliability: how trustable the data are correct? Interpretability: how easily the data can be understood? 29

Data Quality: Common Issues Missing values: how do we fill
in? Wrong values: how can we detect and correct? Messy format:how do we restructure? Not usable: the data cannot answer the question posed 30

31 Data Quality: Messy Data The following is a table
accounting for the number of produce deliveries over a weekend. • What are the variables in this dataset? What object or event are we measuring? • What’s the issue? How do we fix it?

32 Data Quality: Messy Data We’re measuring individual deliveries; the
variables are Time, Day, Number of Produce. Problem: each column header represents a single value rather than a variable. Row headers are “hiding” the Day variable. The values of the variable, “Number of Produce”, is not recorded in a single column.

33 Data Quality: Fixing Messy Data We need to reorganize
the information to make explicit the event we’re observing and the variables associated to this event.

Data Preprocess ing (Major Tasks) 34 Data cleaning • Fill
in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration • Integration of multiple databases, data cubes or files Data reduction • Dimensionality Reduction • Feature Selection Data transformation and data discretization • Normalization

35 Preprocessing: Data Cleaning Data in the Real World is
Dirty: Lots of potentially incorrect data, e.g. instrument faulty, human or computer error, transmission error: Incomplete •Lacking attribute values Noisy •Contains noise, errors or outliers •Eg. Salary = -10 Inconsistent •Contains discrepancies in codes or names •Eg. Age = 42, Birthday = 03/07/2010 Intentional •Disguised missing data •eg . Jan 1 as everyone’s birthday

36 Data Cleaning: Incomplete (Missing) Data • Data is not
always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • Equipment malfunction • Inconsistent with other recorded data and thus deleted • Data not entered due to misunderstanding • Not register history or changes of the data 36 Image: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

37 Data Cleaning: Noisy Data 37 Incorrect attribute values may
be due to Other data problems which require data cleaning Noise: random error or variance in a measured variable

Preprocessing: Data Integration 38 •E.g., combining the customer information dataset
and daily sale items dataset Combines data from multiple sources into a coherent store •Identify real world entities from multiple data sources, e.g., Bill Clinton= William Clinton Entity identification problem: •For the same real-world entity, attribute values from different sources are different •Possible reasons: different representations, different scales, e.g., metric vs British units Detecting and resolving data value conflicts •Object identification: The same attribute or object may have different names in different databases Redundant data occur often when integration of multiple databases

Preprocessing: Data Reduction 39 Data reduction strategies Dimensionality Reduction –
remove unimportant attribute (Feature Selection) Sampling – obtain a small sample to represent the whole dataset Why data reduction? --- A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete dataset. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

40 Data Reduction: Feature Selection Redundant attribute • Duplicate much
or all of the information contained in one or more other attributes • E.g., purchase price of a product and the amount of sales tax paid Irrelevant attributes • Contain no information that is useful for the data mining task at hand • E.g., students’ ID is often irrelevant to the task of predicting students’ GPA 40

41 Data Reduction: Sampling • Sampling: obtaining a small sample
to represent the whole dataset • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Key Principle: Choose a representative subset of the data ◦ Sample random sampling – There is an equal probability of selecting any particular item ◦ Sample without replacement – Once an object is selected, it is removed from the population ◦ Sampling with replacement – A selected object is not removed from the population ◦ Stratified sampling – Partition the data set, and draw samples from each partition 41 Image: https://research-methodology.net/sampling-in-primary-data-collection/

Preprocessing: Data Transformation 42 A function that maps the entire
set of values of a given attribute to a new set of replacement values i.e. each old value can be identified with one of the new values Methods • Smoothing: Remove noise from data • Attribute/feature construction • New attributes constructed from the given ones • Aggregation: Summarization, data cube construction • Normalization: Scaled to fall within a smaller, specified range • Min-max normalization • Z-score normalization

43 Data Transformation: Encoding Categorical Variables 43 Drop Categorical Variables
The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information. Label Encoding Assigns each unique value to a different integer. One-Hot Encoding Creates new columns indicating the presence (or absence) of each possible value in the original data.

44 Data Transformation: Normalization Min-max normalization: to [new_minA , new_maxA
] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (μ: mean, σ: standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then

45 References • Data Mining: Concepts and Techniques (3rd Edition).
Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois and Simon Fraser University • CS109A, Introduction to Data Science, Harvard University (https://harvard- iacs.github.io/2019-CS109A/)

46 Kaggle: A Playground for Data Scientists https://www.kaggle.com/alexisbcook/getting- started-with-kaggle-competitions Create
An Account at www.Kaggle.com Data Science Competitions Short Courses

47 Kaggle • Training GPU • Ranking/Competitions • What can
we do in Kaggle? • Team Merger/Group Competitions • Notebook Submissions

48 https://www.kaggle.com/c/widsdatathon 2022 • WIDS 2022 Kaggle (Women in Data
Science)

Ensemble Methods: Kaggle Champion 49

50 Bagging https://www.kaggle.com/satishgunjal/ensemble-learning-bagging-boosting-stacking

51 Boosting Gradient Boosting Machine (GBM) Extreme Gradient Boosting Machine
(XGBM) LightGBM CatBoost

52 Stacking

Kaggle for Data Scientists (Women in AI)

Kaggle for Data Scientists (Women in AI)

More Decks by Aye Hninn Khine

Other Decks in Technology

Featured

Transcript