Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A closer look: 
Exploratory Data Analysis with Spark and IntelliJ IDEA

A closer look: 
Exploratory Data Analysis with Spark and IntelliJ IDEA

DataConLA, August 17, 2019
Links:
Zeppelin Notebook: 
https://github.com/MKhalusova/DataConDemo
Sign up for plugin early preview: 
http://bit.ly/BigDataToolsPreview

MKhalusova

August 17, 2019
Tweet

More Decks by MKhalusova

Other Decks in Technology

Transcript

  1. A closer look: 
 Exploratory Data Analysis with Spark and

    IntelliJ IDEA — Maria Khalusova, JetBrains @mariakhalusova
  2. !2 About me — • Python • Jupyter • PyCharm

    • pandas, NumPy • matplotlib, seaborn, etc • scikit-learn, PyTorch, … • Scala • Zeppelin • IntelliJ IDEA • Spark
  3. !3 EDA — We all do it Scope may vary:

    - IDA: assessing data structure, missing values, descriptive statistics, distribution, etc. - Detecting outliers, exploring correlations, discovering patterns, etc. Techniques may vary: - Non-visual - Visualizations: scatterplots, histograms, barplots, etc.
  4. !5 Sampling — Pros: fits in memory, we can use

    pandas Cons: sampling techniques are hard, analysis is less accurate
  5. !6 Spark it is then — Ticks all the boxes:

    (all the data, interactive, experimental) Various APIs Python, Java, Scala
  6. !7 Interactivity => Built-in visualization capabilities => As a code

    editor => Zeppelin — http://zeppelin.apache.org
  7. !9

  8. !11 • Data: IntelliJ IDEA repo as a .txt file

    (1.16 GB) • Spark • Zeppelin • IntelliJ IDEA + BigDataTools + Scala plugin The setup for the EDA (aka Exciting Detective Adventure): —
  9. Links: • Zeppelin Notebook: 
 https://github.com/MKhalusova/DataConDemo • Sign up for

    plugin early preview: 
 http://bit.ly/BigDataToolsPreview @mariakhalusova