Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyLadies Dublin August Meetup: PySpark 101: Tips and Tricks

PyLadies Dublin August Meetup: PySpark 101: Tips and Tricks

We have Olga Minguett talking about "Text Classification using HuggingFace Transformers" and Sahana Hegde talking about "PySpark 101: Tips and Tricks".

Big thanks to Optum for partnering with us and having Olga and Sahana giving their talks.

👉 Event Page: https://www.meetup.com/PyLadiesDublin/events/279318562/

🎤 TALKS
=========
TALK 1: Text Classification using HuggingFace Transformers
----------------------------------------------------------------------------------
Explanation about HuggingFace Transformers, described in 2 sections. Theory: What it is? and How to use it? datasets and tasks that you can perform with it. Practice: Example using Text Classification using HFT

ABOUT OLGA: Olga Minguett is a Master’s in Artificial Intelligence student with interest in AI in Healthcare. She currently works as a data scientist for a technology and healthcare services company part of UnitedHealth Group. https://Linkedin.com/in/olgaminguett

TALK 2: PySpark 101: Tips and Tricks
--------------------------------------------------
In this session, I'd like to share a few tips and tricks that I've learnt over the years while using PySpark in my day-to-day activities by showing code snippets. These elements will help you create more efficient code that leads to better/faster results.

ABOUT SAHANA: I am a Data Scientist working with UnitedHealth Group during office hours, and I'm a passionate cook and yoga enthusiast outside. I love to travel in my free time and use my phone's lens to capture beautiful moments. https://www.linkedin.com/in/sahana-hegde

❤️ A BIG THANK YOU
====================
I'd like to thank all those who have been attending and watching our videos, we appreciate your support as it took a lot of work to set it up, if you are curious, you can read Vicky's post about it: https://dev.to/pyladiesdub/live-streaming-from-zoom-meet-via-obs-to-youtube-2l3h - any feedback would be helpful to make this process smoother and easier to manage. 🥰

📢 CALL FOR SPEAKERS for 2021 (from Sep onwards)
=========================================
Interested in speaking at our upcoming meetups, please submit talk details to: https://pyladiesdublin.typeform.com/to/VvW3iME6

If you have referrals of speakers you want us to invite, let us know also, being a virtual event helps close the boundaries of inviting speakers further afield than Ireland. 😊

🤔 QUESTIONS
==============
Email dublin@pyladies.com.

3476530ee3199731f810cb41daadad79?s=128

PyLadies Dublin

August 17, 2021
Tweet

More Decks by PyLadies Dublin

Other Decks in Technology

Transcript

  1. PySpark Tips and Tricks Sahana Hegde 17 August, 2021

  2. Sahana Hegde (not ‘Hedge’) © 2021 Optum, Inc. All rights

    reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 2 2017 2014 Home Present Interests 2018 Native: Mangalore, India • Graduated from BE in CS • Worked as an OMS Dev in Retail domain • Came to Ireland with the hope of starting afresh • Graduated from MSc in CSNL • Interned with SAP • Joined Optum in Oct 2018 as a Data Science Grad • Currently a Data Scientist • Got married in 2020 amidst COVID-19 • Travelled to make memories • Yoga • Singing • Cooking • Art • Travel
  3. Architecture © 2021 Optum, Inc. All rights reserved. Confidential property

    of Optum. Do not distribute or reproduce without express permission from Optum. 3 • SparkContext spins an independent process • Cluster Manager assigns work to the workers (‘n’ workers) • Task executes a unit of work • Benefit from in-memory computation powered by caching data • Aggregation results are either sent back to the driver or saved onto the disk
  4. Concepts and properties of Spark © 2021 Optum, Inc. All

    rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 4 • Immutable • Fault Tolerant • Lazy Evaluation • Data Locality • Predicate Pushdown • Catalyst Optimiser
  5. RDDs, Datasets and DataFrames © 2021 Optum, Inc. All rights

    reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5 • Data Import and low-level coding • Application programming • Table-based functionality • SQL-style query access • More intense application programming in Java • Require Class declaration and definition DataFrames RDDs Datasets
  6. Data manipulations © 2021 Optum, Inc. All rights reserved. Confidential

    property of Optum. Do not distribute or reproduce without express permission from Optum. 6 Initiating PySpark session Reading data from a file Missing values
  7. Data manipulations © 2021 Optum, Inc. All rights reserved. Confidential

    property of Optum. Do not distribute or reproduce without express permission from Optum. 7 One-way frequencies Sorting and filtering Unique/distinct values and counts Distinct occurrences
  8. Filtering © 2021 Optum, Inc. All rights reserved. Confidential property

    of Optum. Do not distribute or reproduce without express permission from Optum. 8 All titles with ‘Mrs’ Names ending with ’s’ Names containing ‘ove’ Names containing ‘ove’ Identify columns with prefix or suffix
  9. Column manipulation © 2021 Optum, Inc. All rights reserved. Confidential

    property of Optum. Do not distribute or reproduce without express permission from Optum. 9 Creating a new column with mean age Using when condition to create a new variable
  10. Window function © 2021 Optum, Inc. All rights reserved. Confidential

    property of Optum. Do not distribute or reproduce without express permission from Optum. 10 Select the subset of columns Define partition function Apply partition Find the required value
  11. Other useful functions © 2021 Optum, Inc. All rights reserved.

    Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11 Gather all occurrence as a list Random sampling with and without replacement Sample (with Replacement, fraction, seed=None) Cache: Save data temporarily in memory Persist: Persist for all levels
  12. Thank you sahana.hegde@optum.com