Introduction to DataFrames in Spark

Slide 1

Slide 1 text

INTRODUCTION TO DATAFRAMES IN SPARK Jyotiska NK, DataWeave @jyotiska

Slide 2

Slide 2 text

TALK AGENDA • Overview • Creating DataFrames • Playing with different data formats and sources • DataFrames Operations • Integrating with Pandas DF • Demo • Q&A

Slide 3

Slide 3 text

OVERVIEW • Distributed datasets loaded into named columns (similar to relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.

Slide 4

Slide 4 text

CREATING DATAFRAMES • From existing RDDs • JSON datasets • Hive Tables • Relational DBs - PostgreSQL, MySQL etc. • External data source - Amazon S3, CSV, Avro, ElasticSearch, Cassandra etc.

Slide 5

Slide 5 text

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format(‘com.databricks.spark.csv').options (header=‘true').load('cars.csv') df.select(‘year','model').write.format('com.databricks.spark.csv').save ('newcars.csv')

Slide 6

Slide 6 text

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("examples/src/main/resources/people.json") df.show() df.printSchema() df.registerTempTable("people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

Slide 7

Slide 7 text

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) schemaPeople = sqlContext.createDataFrame(people) schemaPeople.registerTempTable(“people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

Slide 8

Slide 8 text

CREATING DATAFRAMES (CONTD.) from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/ kv1.txt' INTO TABLE src") # Queries can be expressed in HiveQL. results = sqlContext.sql("FROM src SELECT key, value").collect()

Slide 9

Slide 9 text

SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType • StringType • BinaryType • BooleanType • Datetime Types - TimestampType, DateType • Complex Types - ArrayType, MapType, StructType

Slide 10

Slide 10 text

DATAFRAMES OPERATIONS Agg Describe Drop Filter GroupBy Join Replace Select Sort

Slide 11

Slide 11 text

AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()). >>> df.agg({"age": “max"}).collect() [Row(max(age)=5)] >>> from pyspark.sql import functions as F >>> df.agg(F.min(df.age)).collect() [Row(min(age)=2)]

Slide 12

Slide 12 text

DESCRIBE Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. >>> df.describe(['age', 'name']).show() +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+ |summary|age| name| +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+ | count| 2| 2| | mean|3.5| null| | stddev|1.5| null| | min| 2|Alice| | max| 5| Bob| +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+

Slide 13

Slide 13 text

DROP Returns a new DataFrame that drops the speciﬁed column. >>> df.drop('age').collect() [Row(name=u'Alice'), Row(name=u'Bob')] >>> df.drop(df.age).collect() [Row(name=u'Alice'), Row(name=u'Bob')] >>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect() [Row(age=5, height=85, name=u'Bob')] >>> df.join(df2, df.name == df2.name,'inner').drop(df2.name).collect() [Row(age=5, name=u'Bob', height=85)]

Slide 14

Slide 14 text

FILTER Filters rows using the given condition. >>> df.filter(df.age > 3).collect() [Row(age=5, name=u'Bob')] >>> df.where(df.age == 2).collect() [Row(age=2, name=u'Alice')] >>> df.filter("age > 3").collect() [Row(age=5, name=u'Bob')] >>> df.where("age = 2").collect() [Row(age=2, name=u'Alice')]

Slide 15

Slide 15 text

GROUPBY Groups the DataFrame using the speciﬁed columns, to run aggregations on them. >>> df.groupBy().avg().collect() [Row(avg(age)=3.5)] >>> df.groupBy('name').agg({'age': 'mean'}).collect() [Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)] >>> df.groupBy(df.name).avg().collect() [Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)] >>> df.groupBy(['name', df.age]).count().collect() [Row(name=u'Bob', age=5, count=1), Row(name=u'Alice', age=2, count=1)]

Slide 16

Slide 16 text

JOIN Joins with another DataFrame using the given join expression. >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)] >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() [Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)] >>> df.join(df2, 'name').select(df.name, df2.height).collect() [Row(name=u'Bob', height=85)] >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() [Row(name=u'Bob', age=5)]

Slide 17

Slide 17 text

REPLACE Returns a new DataFrame replacing a value with another value. >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+ | age|height|name| +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+ | 10| 80| A| | 5| null| B| |null| null| Tom| |null| null|null| +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+

Slide 18

Slide 18 text

SELECT Projects a set of expressions and returns a new DataFrame. >>> df.select('*').collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] >>> df.select('name', 'age').collect() [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] >>> df.select(df.name, (df.age + 10).alias('age')).collect() [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

Slide 19

Slide 19 text

SORT Returns a new DataFrame sorted by the speciﬁed column(s). >>> df.sort(df.age.desc()).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.sort("age", ascending=False).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.orderBy(df.age.desc()).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.sort(asc("age")).collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] >>> df.orderBy(desc("age"), "name").collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]

Slide 20

Slide 20 text

INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of this DataFrame as Pandas pandas.DataFrame * WARNING -‐ Since Pandas DataFrame stays in memory, this will cause problems if the dataset is too large. • df.dtypes() - Returns all column names and their data types as a list.