Introduction to DataFrames in Spark

INTRODUCTION TO DATAFRAMES IN SPARK Jyotiska NK, DataWeave @jyotiska

TALK AGENDA • Overview • Creating DataFrames • Playing with
different data formats and sources • DataFrames Operations • Integrating with Pandas DF • Demo • Q&A

OVERVIEW • Distributed datasets loaded into named columns (similar to
relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.

CREATING DATAFRAMES • From existing RDDs • JSON datasets •
Hive Tables • Relational DBs - PostgreSQL, MySQL etc. • External data source - Amazon S3, CSV, Avro, ElasticSearch, Cassandra etc.

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext =
SQLContext(sc) df = sqlContext.read.format(‘com.databricks.spark.csv').options (header=‘true').load('cars.csv') df.select(‘year','model').write.format('com.databricks.spark.csv').save ('newcars.csv')

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext =
SQLContext(sc) df = sqlContext.read.json("examples/src/main/resources/people.json") df.show() df.printSchema() df.registerTempTable("people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext, Row sqlContext
= SQLContext(sc) lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) schemaPeople = sqlContext.createDataFrame(people) schemaPeople.registerTempTable(“people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

CREATING DATAFRAMES (CONTD.) from pyspark.sql import HiveContext sqlContext =
HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/ kv1.txt' INTO TABLE src") # Queries can be expressed in HiveQL. results = sqlContext.sql("FROM src SELECT key, value").collect()

SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType,
FloatType, DoubleType, DecimalType • StringType • BinaryType • BooleanType • Datetime Types - TimestampType, DateType • Complex Types - ArrayType, MapType, StructType

DATAFRAMES OPERATIONS Agg Describe Drop Filter GroupBy Join Replace Select
Sort

AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for
df.groupBy.agg()). >>> df.agg({"age": “max"}).collect() [Row(max(age)=5)] >>> from pyspark.sql import functions as F >>> df.agg(F.min(df.age)).collect() [Row(min(age)=2)]

DESCRIBE Computes statistics for numeric columns. This include count, mean,
stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. >>> df.describe(['age', 'name']).show() +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+ |summary|age| name| +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+ | count| 2| 2| | mean|3.5| null| | stddev|1.5| null| | min| 2|Alice| | max| 5| Bob| +-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+

DROP Returns a new DataFrame that drops the speciﬁed column.
>>> df.drop('age').collect() [Row(name=u'Alice'), Row(name=u'Bob')] >>> df.drop(df.age).collect() [Row(name=u'Alice'), Row(name=u'Bob')] >>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect() [Row(age=5, height=85, name=u'Bob')] >>> df.join(df2, df.name == df2.name,'inner').drop(df2.name).collect() [Row(age=5, name=u'Bob', height=85)]

FILTER Filters rows using the given condition. >>> df.filter(df.age >
3).collect() [Row(age=5, name=u'Bob')] >>> df.where(df.age == 2).collect() [Row(age=2, name=u'Alice')] >>> df.filter("age > 3").collect() [Row(age=5, name=u'Bob')] >>> df.where("age = 2").collect() [Row(age=2, name=u'Alice')]

GROUPBY Groups the DataFrame using the speciﬁed columns, to run
aggregations on them. >>> df.groupBy().avg().collect() [Row(avg(age)=3.5)] >>> df.groupBy('name').agg({'age': 'mean'}).collect() [Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)] >>> df.groupBy(df.name).avg().collect() [Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)] >>> df.groupBy(['name', df.age]).count().collect() [Row(name=u'Bob', age=5, count=1), Row(name=u'Alice', age=2, count=1)]

JOIN Joins with another DataFrame using the given join expression.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)] >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() [Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)] >>> df.join(df2, 'name').select(df.name, df2.height).collect() [Row(name=u'Bob', height=85)] >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() [Row(name=u'Bob', age=5)]

REPLACE Returns a new DataFrame replacing a value with another
value. >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+ | age|height|name| +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+ | 10| 80| A| | 5| null| B| |null| null| Tom| |null| null|null| +-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐+

SELECT Projects a set of expressions and returns a new
DataFrame. >>> df.select('*').collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] >>> df.select('name', 'age').collect() [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] >>> df.select(df.name, (df.age + 10).alias('age')).collect() [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

SORT Returns a new DataFrame sorted by the speciﬁed column(s).
>>> df.sort(df.age.desc()).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.sort("age", ascending=False).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.orderBy(df.age.desc()).collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')] >>> df.sort(asc("age")).collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] >>> df.orderBy(desc("age"), "name").collect() [Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]

INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of
this DataFrame as Pandas pandas.DataFrame * WARNING -‐ Since Pandas DataFrame stays in memory, this will cause problems if the dataset is too large. • df.dtypes() - Returns all column names and their data types as a list.

Introduction to DataFrames in Spark

Introduction to DataFrames in Spark

Jyotiska NK

More Decks by Jyotiska NK

Other Decks in Technology

Featured

Transcript

INTRODUCTION TO DATAFRAMES IN SPARK Jyotiska NK, DataWeave @jyotiska

TALK AGENDA • Overview • Creating DataFrames • Playing with

OVERVIEW • Distributed datasets loaded into named columns (similar to

CREATING DATAFRAMES • From existing RDDs • JSON datasets •

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext =

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext sqlContext =

CREATING DATAFRAMES (CONTD.) from pyspark.sql import SQLContext, Row sqlContext

CREATING DATAFRAMES (CONTD.) from pyspark.sql import HiveContext sqlContext =

SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType,

DATAFRAMES OPERATIONS Agg Describe Drop Filter GroupBy Join Replace Select

AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for

DESCRIBE Computes statistics for numeric columns. This include count, mean,

DROP Returns a new DataFrame that drops the speciﬁed column.

FILTER Filters rows using the given condition. >>> df.filter(df.age >

GROUPBY Groups the DataFrame using the speciﬁed columns, to run

JOIN Joins with another DataFrame using the given join expression.

REPLACE Returns a new DataFrame replacing a value with another

SELECT Projects a set of expressions and returns a new

SORT Returns a new DataFrame sorted by the speciﬁed column(s).

INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of

DEMO

Q&A