Slide 1

Slide 1 text

INTRODUCTION TO DATAFRAMES IN SPARK Jyotiska NK, DataWeave @jyotiska

Slide 2

Slide 2 text

TALK AGENDA • Overview • Creating DataFrames • Playing with different data formats and sources • DataFrames Operations • Integrating with Pandas DF • Demo • Q&A

Slide 3

Slide 3 text

OVERVIEW • Distributed datasets loaded into named columns (similar to relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.

Slide 4

Slide 4 text

CREATING DATAFRAMES • From existing RDDs • JSON datasets • Hive Tables • Relational DBs - PostgreSQL, MySQL etc. • External data source - Amazon S3, CSV, Avro, ElasticSearch, Cassandra etc.

Slide 5

Slide 5 text

CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =  SQLContext(sc)   df  =  sqlContext.read.format(‘com.databricks.spark.csv').options   (header=‘true').load('cars.csv')   df.select(‘year','model').write.format('com.databricks.spark.csv').save   ('newcars.csv')

Slide 6

Slide 6 text

CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =  SQLContext(sc)   df  =  sqlContext.read.json("examples/src/main/resources/people.json")   df.show()   df.printSchema()   df.registerTempTable("people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19")

Slide 7

Slide 7 text

CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext,  Row   sqlContext  =  SQLContext(sc)   lines  =  sc.textFile("examples/src/main/resources/people.txt")   parts  =  lines.map(lambda  l:  l.split(","))   people  =  parts.map(lambda  p:  Row(name=p[0],  age=int(p[1])))   schemaPeople  =  sqlContext.createDataFrame(people)   schemaPeople.registerTempTable(“people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=   19")

Slide 8

Slide 8 text

CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  HiveContext   sqlContext  =  HiveContext(sc)   sqlContext.sql("CREATE  TABLE  IF  NOT  EXISTS  src  (key  INT,  value  STRING)")   sqlContext.sql("LOAD  DATA  LOCAL  INPATH  'examples/src/main/resources/ kv1.txt'  INTO  TABLE  src")   #  Queries  can  be  expressed  in  HiveQL.   results  =  sqlContext.sql("FROM  src  SELECT  key,  value").collect()

Slide 9

Slide 9 text

SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType • StringType • BinaryType • BooleanType • Datetime Types - TimestampType, DateType • Complex Types - ArrayType, MapType, StructType

Slide 10

Slide 10 text

DATAFRAMES OPERATIONS Agg Describe Drop Filter GroupBy Join Replace Select Sort

Slide 11

Slide 11 text

AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()). >>>  df.agg({"age":  “max"}).collect()                  [Row(max(age)=5)]   >>>  from  pyspark.sql  import  functions  as  F   >>>  df.agg(F.min(df.age)).collect()                  [Row(min(age)=2)]

Slide 12

Slide 12 text

DESCRIBE Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. >>>  df.describe(['age',  'name']).show()                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |summary|age|  name|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |    count|    2|        2|                  |      mean|3.5|  null|                  |  stddev|1.5|  null|                  |        min|    2|Alice|                  |        max|    5|    Bob|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+

Slide 13

Slide 13 text

DROP Returns a new DataFrame that drops the specified column. >>>  df.drop('age').collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.drop(df.age).collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,  'inner').drop(df.name).collect()                  [Row(age=5,  height=85,  name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,'inner').drop(df2.name).collect()                  [Row(age=5,  name=u'Bob',  height=85)]

Slide 14

Slide 14 text

FILTER Filters rows using the given condition. >>>  df.filter(df.age  >  3).collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where(df.age  ==  2).collect()                  [Row(age=2,  name=u'Alice')]   >>>  df.filter("age  >  3").collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where("age  =  2").collect()                  [Row(age=2,  name=u'Alice')]

Slide 15

Slide 15 text

GROUPBY Groups the DataFrame using the specified columns, to run aggregations on them. >>>  df.groupBy().avg().collect()                  [Row(avg(age)=3.5)]   >>>  df.groupBy('name').agg({'age':  'mean'}).collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(df.name).avg().collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(['name',  df.age]).count().collect()                  [Row(name=u'Bob',  age=5,  count=1),  Row(name=u'Alice',  age=2,  count=1)]

Slide 16

Slide 16 text

JOIN Joins with another DataFrame using the given join expression. >>>  df.join(df2,  df.name  ==  df2.name,  'outer').select(df.name,  df2.height).collect()                  [Row(name=None,  height=80),  Row(name=u'Alice',  height=None),   Row(name=u'Bob',  height=85)]   >>>  cond  =  [df.name  ==  df3.name,  df.age  ==  df3.age]   >>>  df.join(df3,  cond,  'outer').select(df.name,  df3.age).collect()                  [Row(name=u'Bob',  age=5),  Row(name=u'Alice',  age=2)]   >>>  df.join(df2,  'name').select(df.name,  df2.height).collect()                  [Row(name=u'Bob',  height=85)]   >>>  df.join(df4,  ['name',  'age']).select(df.name,  df.age).collect()                  [Row(name=u'Bob',  age=5)]

Slide 17

Slide 17 text

REPLACE Returns a new DataFrame replacing a value with another value. >>>  df4.na.replace(['Alice',  'Bob'],  ['A',  'B'],  'name').show()                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |  age|height|name|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |    10|        80|      A|                  |      5|    null|      B|                  |null|    null|  Tom|                  |null|    null|null|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+

Slide 18

Slide 18 text

SELECT Projects a set of expressions and returns a new DataFrame. >>>  df.select('*').collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.select('name',  'age').collect()                  [Row(name=u'Alice',  age=2),  Row(name=u'Bob',  age=5)]   >>>  df.select(df.name,  (df.age  +  10).alias('age')).collect()                  [Row(name=u'Alice',  age=12),  Row(name=u'Bob',   age=15)]

Slide 19

Slide 19 text

SORT Returns a new DataFrame sorted by the specified column(s). >>>  df.sort(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort("age",  ascending=False).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.orderBy(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort(asc("age")).collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.orderBy(desc("age"),  "name").collect()   [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]

Slide 20

Slide 20 text

INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of this DataFrame as Pandas pandas.DataFrame   *  WARNING  -­‐  Since  Pandas  DataFrame  stays  in  memory,  this  will  cause  problems  if  the   dataset  is  too  large.   • df.dtypes() - Returns all column names and their data types as a list.

Slide 21

Slide 21 text

DEMO

Slide 22

Slide 22 text

Q&A