Slide 1

Slide 1 text

PySpark Tips and Tricks Sahana Hegde 17 August, 2021

Slide 2

Slide 2 text

Sahana Hegde (not ‘Hedge’) © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 2 2017 2014 Home Present Interests 2018 Native: Mangalore, India • Graduated from BE in CS • Worked as an OMS Dev in Retail domain • Came to Ireland with the hope of starting afresh • Graduated from MSc in CSNL • Interned with SAP • Joined Optum in Oct 2018 as a Data Science Grad • Currently a Data Scientist • Got married in 2020 amidst COVID-19 • Travelled to make memories • Yoga • Singing • Cooking • Art • Travel

Slide 3

Slide 3 text

Architecture © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3 • SparkContext spins an independent process • Cluster Manager assigns work to the workers (‘n’ workers) • Task executes a unit of work • Benefit from in-memory computation powered by caching data • Aggregation results are either sent back to the driver or saved onto the disk

Slide 4

Slide 4 text

Concepts and properties of Spark © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 4 • Immutable • Fault Tolerant • Lazy Evaluation • Data Locality • Predicate Pushdown • Catalyst Optimiser

Slide 5

Slide 5 text

RDDs, Datasets and DataFrames © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5 • Data Import and low-level coding • Application programming • Table-based functionality • SQL-style query access • More intense application programming in Java • Require Class declaration and definition DataFrames RDDs Datasets

Slide 6

Slide 6 text

Data manipulations © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 6 Initiating PySpark session Reading data from a file Missing values

Slide 7

Slide 7 text

Data manipulations © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 7 One-way frequencies Sorting and filtering Unique/distinct values and counts Distinct occurrences

Slide 8

Slide 8 text

Filtering © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8 All titles with ‘Mrs’ Names ending with ’s’ Names containing ‘ove’ Names containing ‘ove’ Identify columns with prefix or suffix

Slide 9

Slide 9 text

Column manipulation © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 9 Creating a new column with mean age Using when condition to create a new variable

Slide 10

Slide 10 text

Window function © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 10 Select the subset of columns Define partition function Apply partition Find the required value

Slide 11

Slide 11 text

Other useful functions © 2021 Optum, Inc. All rights reserved. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11 Gather all occurrence as a list Random sampling with and without replacement Sample (with Replacement, fraction, seed=None) Cache: Save data temporarily in memory Persist: Persist for all levels

Slide 12

Slide 12 text

Thank you [email protected]