Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory Seminar: Data Wrangling with Date /...

Kan Nishida
January 08, 2019

Exploratory Seminar: Data Wrangling with Date / Time Data & Visualizing It

Kan will be introducing various data wrangling techniques to clean and transform Date and Time data.

Kan Nishida

January 08, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to make Data Science available for everyone. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Smart Waves - Machine Learning / AI Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users Exploratory
  4. Questions Data Science Workflow Communication Data Access Data Wrangling Data

    Visualization Machine Learning / Statistics Exploration
  5. Questions What you can do with Exploratory Communication Data Access

    Data Wrangling Visualization Machine Learning / Statistics Exploratory Data Analysis
  6. • User Activity Data • Each row represents an user

    access for a fictional online service. • There are 6 columns, timestamp, user id, event type, IP address, OS, and OS version. • Download EDF Data
  7. Questions 1. What is the duration (date range) of this

    data? 2. What is DAU (Daily Active Users) and how it’s been changed over time? 3. Which days of week (e.g. Monday) and hours are more active?
  8. Character vs. Date/Time Date data is recognized as character. Dates

    duration is igonored Sorted as character. e.g. 10 (Oct.) comes after 1 (Jan) Data: Date-unicorn.csv
  9. Character vs. Date/Time Various transformation on date data is available

    Data is sorted as dates. Duration honors date interval.
  10. By making it to Date & Time data type, you

    can do a lot of cool things.
  11. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  12. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  13. Only codes you need to know • Year • Month

    • Day • Hour • Minute • Second
  14. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  15. • 2017-06-24 • June • 24th • 2017 • 175th

    day in 2017 • Saturday Date Attributes • 2017-06-24
  16. • AM • 8 hours • 10 minutes • 10

    seconds Time Attributes • 2017-01-01 08:10:10
  17. Extract Day of Week From Column Header Menu 1. Select

    “Extract” 2. Select “Day of Week - Short Name (Mon)”
  18. Ordinal - Ordered Factor • Month, Day of Week should

    be sorted in the natural order. • R’s factor data type supports ‘Order’ information. • Functions like ‘wday’, ‘month’, take care of it.
  19. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  20. Today 2018 2017 Previous Year This Year 2016 Last 2

    Years 2019 Year to Date Relative Date
  21. Absolute Date • equal to / not equal to •

    is in / is not in • earlier than • later than • between
  22. Today 2018 2017 Year == 2017 Date > 1 year

    ago 2016 Year > 2016 2019 Between 2017-06-01 and 2018-1-30 Absolute Date
  23. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  24. 3 weeks 4 weeks 2 weeks First Date Last Date

    First Date Last Date First Date Last Date Duration
  25. From Column Header Menu 1. Select “Change Data Type” 2.

    Select “Convert to Number” 3. Select “Days” 2. Convert the lifetime to numeric data type (in days)
  26. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  27. # round round_date(start_time, unit="week") # ceil ceiling_date(start_time, unit="week") # floor

    floor_date(start_time, unit="week") Round vs. Ceiling vs. Floor
  28. To round date to week.. From Column Header Menu 1.

    Select “Round” 2. Select “Round Date ” 3. Select “Week”
  29. 1. Convert Character to Date / Time 2. Extract Date

    / Time Attributes 3. Filter with Date / Time 4. Duration 5. Round Date / Time 6. Timezone Common Tasks
  30. • We have Temperature Data of London and Tokyo •

    Each row represents a temperature for a certain date/time in year 2016. There are 17,498 temperature data of London and 19,489 temperature data of Tokyo • Each temperature record has date/time, longitude, latitude, temperature, etc • Filename: Date-London-temp.csv and Date-Tokyo-temp.csv Timezone - Data
  31. For London, 2:00pm is the peak of Average temperature →

    It sounds reasonable. For Tokyo, 5:00am is the peak of Average temperature → ??? When you compare hourly temperature data between London and Tokyo Data: Date-London-temp.csv, Date-Tokyo-temp.csv
  32. • From the hourly temperature data of Tokyo, I want

    to know what time is the most hot in the day, but the time indicated by the date / time data is different from the actual time in Tokyo • We would like to compare average hourly temperatures of two cities with different time zones Problem
  33. UTC (Coordinated Universal Time) • It is the base point

    for all other time zones in the world • POSIXct is basically based on the UTC • UTC and GMT (Greenwich Mean Time) are almost identical. (→ That is why the hourly temperature data for London is displayed correctly on the previous chart.)
  34. with_tz # Append Timezone information with_tz(ymd_hms("2015-10-01 02:20:34”)) → "2015-09-30 19:20:34

    PDT" Default value of with_tz is local machine’s timezone. In this example, PDT (Pacific Daylight Time)
  35. with_tz(ymd_hms("2015-10-01 02:20:34”)) → "2015-09-30 19:20:34 PDT" with_tz(ymd_hms("2015-10-01 02:20:34"), tz =

    "Asia/Tokyo") → "2015-10-01 11:20:34 JST" with_tz By specifying timezone information, You can convert date/time to any timezone
  36. January 15th (Tuesday), 2019 • Data Wrangling: Working with Text

    Data Planned • Analytics 101 - When to use which algorithms? • Data Wrangling: Introduction to Regular Expression https://exploratory.io/online-seminar