Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyLadies Dublin August Meetup: PySpark 101: Tips and Tricks

PyLadies Dublin August Meetup: PySpark 101: Tips and Tricks

We have Olga Minguett talking about "Text Classification using HuggingFace Transformers" and Sahana Hegde talking about "PySpark 101: Tips and Tricks".

Big thanks to Optum for partnering with us and having Olga and Sahana giving their talks.

👉 Event Page: https://www.meetup.com/PyLadiesDublin/events/279318562/

🎤 TALKS
=========
TALK 1: Text Classification using HuggingFace Transformers
----------------------------------------------------------------------------------
Explanation about HuggingFace Transformers, described in 2 sections. Theory: What it is? and How to use it? datasets and tasks that you can perform with it. Practice: Example using Text Classification using HFT

ABOUT OLGA: Olga Minguett is a Master’s in Artificial Intelligence student with interest in AI in Healthcare. She currently works as a data scientist for a technology and healthcare services company part of UnitedHealth Group. https://Linkedin.com/in/olgaminguett

TALK 2: PySpark 101: Tips and Tricks
--------------------------------------------------
In this session, I'd like to share a few tips and tricks that I've learnt over the years while using PySpark in my day-to-day activities by showing code snippets. These elements will help you create more efficient code that leads to better/faster results.

ABOUT SAHANA: I am a Data Scientist working with UnitedHealth Group during office hours, and I'm a passionate cook and yoga enthusiast outside. I love to travel in my free time and use my phone's lens to capture beautiful moments. https://www.linkedin.com/in/sahana-hegde

❤️ A BIG THANK YOU
====================
I'd like to thank all those who have been attending and watching our videos, we appreciate your support as it took a lot of work to set it up, if you are curious, you can read Vicky's post about it: https://dev.to/pyladiesdub/live-streaming-from-zoom-meet-via-obs-to-youtube-2l3h - any feedback would be helpful to make this process smoother and easier to manage. 🥰

📢 CALL FOR SPEAKERS for 2021 (from Sep onwards)
=========================================
Interested in speaking at our upcoming meetups, please submit talk details to: https://pyladiesdublin.typeform.com/to/VvW3iME6

If you have referrals of speakers you want us to invite, let us know also, being a virtual event helps close the boundaries of inviting speakers further afield than Ireland. 😊

🤔 QUESTIONS
==============
Email [email protected].

PyLadies Dublin

August 17, 2021
Tweet

More Decks by PyLadies Dublin

Other Decks in Technology

Transcript

  1. PySpark Tips
    and Tricks
    Sahana Hegde
    17 August, 2021

    View Slide

  2. Sahana Hegde (not ‘Hedge’)
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 2
    2017
    2014
    Home
    Present
    Interests
    2018
    Native: Mangalore, India
    • Graduated from BE in CS
    • Worked as an OMS Dev
    in Retail domain
    • Came to Ireland with the hope
    of starting afresh
    • Graduated from MSc in CSNL
    • Interned with SAP
    • Joined Optum in Oct 2018
    as a Data Science Grad
    • Currently a Data Scientist
    • Got married in 2020
    amidst COVID-19
    • Travelled to make
    memories
    • Yoga • Singing
    • Cooking • Art • Travel

    View Slide

  3. Architecture
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3
    • SparkContext spins an independent process
    • Cluster Manager assigns work to the workers
    (‘n’ workers)
    • Task executes a unit of work
    • Benefit from in-memory computation
    powered by caching data
    • Aggregation results are either sent back to
    the driver or saved onto the disk

    View Slide

  4. Concepts and properties of Spark
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 4
    • Immutable
    • Fault Tolerant
    • Lazy Evaluation
    • Data Locality
    • Predicate Pushdown
    • Catalyst Optimiser

    View Slide

  5. RDDs, Datasets and DataFrames
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5
    • Data Import and low-level coding
    • Application programming
    • Table-based functionality
    • SQL-style query access
    • More intense application
    programming in Java
    • Require Class declaration
    and definition
    DataFrames
    RDDs Datasets

    View Slide

  6. Data manipulations
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 6
    Initiating PySpark session
    Reading data from a file
    Missing values

    View Slide

  7. Data manipulations
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 7
    One-way frequencies
    Sorting and filtering
    Unique/distinct values
    and counts
    Distinct occurrences

    View Slide

  8. Filtering
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8
    All titles with ‘Mrs’
    Names ending with ’s’
    Names containing ‘ove’
    Names containing ‘ove’
    Identify columns with
    prefix or suffix

    View Slide

  9. Column manipulation
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 9
    Creating a new column
    with mean age
    Using when condition to create
    a new variable

    View Slide

  10. Window function
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 10
    Select the subset of
    columns
    Define partition function
    Apply partition
    Find the required value

    View Slide

  11. Other useful functions
    © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11
    Gather all occurrence as a list
    Random sampling with and without replacement
    Sample (with Replacement, fraction, seed=None)
    Cache: Save data temporarily in memory
    Persist: Persist for all levels

    View Slide

  12. View Slide