Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Data Wrangling - Working with Text Data

Kan Nishida
January 15, 2019

Exploratory: Data Wrangling - Working with Text Data

Kan will be introducing various data wrangling techniques to clean and transform Text data.

Kan Nishida

January 15, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Science

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to make Data Science available for everyone. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Smart Waves - Machine Learning / AI Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users Exploratory
  4. Questions Data Science Workflow Communication Data Access Data Wrangling Data

    Visualization Machine Learning / Statistics Exploration
  5. Questions What you can do with Exploratory Communication Data Access

    Data Wrangling Visualization Machine Learning / Statistics Exploratory Data Analysis
  6. Regular Expression [\d,]+(\.\d+)? “Exchange Rate for Bitcoin to USD is

    14,850.00” 14,850.00 Extract numeric values from this text.
  7. • Scrape the Unicorn Data from CB Insight web page

    (https:// www.cbinsights.com/research-unicorn-companies). • Each row represents an Unicorn Company. • Each company record contains 6 columns: Company Name, Investors, Valuation, Date Joined, Country, Industry Data
  8. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  9. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  10. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  11. When we have the same number of NAs for every

    single column, that seems to be suspicious…
  12. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  13. “Valuation ($B)” column has data that looks numeric, but it’s

    registered as Character data type. We want to take the number part of the data and register it as Numeric data type.
  14. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  15. “Date Joined” column has a date data but registered as

    Character data type. We want to convert this to Date data type so that we can understand the trend over the time series data better.
  16. From Column Header Menu 1. Select “Change Data Type” 2.

    Select “Convert to Date / Time” 3. Select “Month, Day, Year”
  17. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  18. We want to make each entry to have its own

    row so that we can count and compare the unicorns by investors easier.
  19. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  20. We want to merge all Google related entries into one

    Google by removing ‘ Capital’ and ‘ Ventures’ from the data.
  21. Add “ Ventures” to the previous function. We can use

    a vertical bar which is an equivalent of ‘OR’ condition.
  22. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  23. • Some entries contain special characters. • These are the

    reserved special characters for ‘new line’, ‘tab’, which can often found in the data that is scraped from webpages. • We want to remove those characters.
  24. str_clean function removes extra spaces - at the beginning /

    at the end, and multiple spaces in the middle of text.
  25. 1. Scrape Data from Web Page. 2. Remove Empty Rows

    3. Convert Character to Numeric 4. Convert Character to Date 5. Separate Text to Multiple Rows 6. Remove Letters from Text Data 7. Clean up by Removing Extra Spaces and Special Characters 8. Create ‘Other’ group
  26. At Viz View: 1. Select “Line” to Type 2. Select

    “Date Joined” to X Axis 3. Select “(Number of Rows)” to Y Axis 4. Select “Select Investors” to Color By
  27. Define “Cumulative Sum” window function. 1. Click the hamburger icon

    next to Y Axis and select “Window Calculation” 2. Select “Cumulative” at Calculation Type 3. Select “Sum” at Summarize Values Using
  28. We can create ‘Other’ group based on the frequency -

    number of rows for each investor. We want to keep the top 10 investors based on their number of unicorns and group other investors under ‘Other’.
  29. From Column Header Menu 1. Select “Create ‘Other Group for”

    2. Select “Least/Most Frequent Values”
  30. Now ‘Other’ is the largest group with the most of

    the unicorns. We want to filter this out.
  31. Create a chart filter to exclude ‘Other’. 1. Select “Select

    Investors” at Column 2. Select “is not equal to” at Operator 3. Select “Other” at Value