Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to DataFrames in Spark

Introduction to DataFrames in Spark

Avatar for Jyotiska NK

Jyotiska NK

July 18, 2015
Tweet

More Decks by Jyotiska NK

Other Decks in Technology

Transcript

  1. TALK AGENDA • Overview • Creating DataFrames • Playing with

    different data formats and sources • DataFrames Operations • Integrating with Pandas DF • Demo • Q&A
  2. OVERVIEW • Distributed datasets loaded into named columns (similar to

    relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.
  3. CREATING DATAFRAMES • From existing RDDs • JSON datasets •

    Hive Tables • Relational DBs - PostgreSQL, MySQL etc. • External data source - Amazon S3, CSV, Avro, ElasticSearch, Cassandra etc.
  4. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =

     SQLContext(sc)   df  =  sqlContext.read.format(‘com.databricks.spark.csv').options   (header=‘true').load('cars.csv')   df.select(‘year','model').write.format('com.databricks.spark.csv').save   ('newcars.csv')
  5. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =

     SQLContext(sc)   df  =  sqlContext.read.json("examples/src/main/resources/people.json")   df.show()   df.printSchema()   df.registerTempTable("people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19")
  6. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext,  Row   sqlContext

     =  SQLContext(sc)   lines  =  sc.textFile("examples/src/main/resources/people.txt")   parts  =  lines.map(lambda  l:  l.split(","))   people  =  parts.map(lambda  p:  Row(name=p[0],  age=int(p[1])))   schemaPeople  =  sqlContext.createDataFrame(people)   schemaPeople.registerTempTable(“people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=   19")
  7. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  HiveContext   sqlContext  =

     HiveContext(sc)   sqlContext.sql("CREATE  TABLE  IF  NOT  EXISTS  src  (key  INT,  value  STRING)")   sqlContext.sql("LOAD  DATA  LOCAL  INPATH  'examples/src/main/resources/ kv1.txt'  INTO  TABLE  src")   #  Queries  can  be  expressed  in  HiveQL.   results  =  sqlContext.sql("FROM  src  SELECT  key,  value").collect()
  8. SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType,

    FloatType, DoubleType, DecimalType • StringType • BinaryType • BooleanType • Datetime Types - TimestampType, DateType • Complex Types - ArrayType, MapType, StructType
  9. AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for

    df.groupBy.agg()). >>>  df.agg({"age":  “max"}).collect()                  [Row(max(age)=5)]   >>>  from  pyspark.sql  import  functions  as  F   >>>  df.agg(F.min(df.age)).collect()                  [Row(min(age)=2)]
  10. DESCRIBE Computes statistics for numeric columns. This include count, mean,

    stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. >>>  df.describe(['age',  'name']).show()                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |summary|age|  name|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |    count|    2|        2|                  |      mean|3.5|  null|                  |  stddev|1.5|  null|                  |        min|    2|Alice|                  |        max|    5|    Bob|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+
  11. DROP Returns a new DataFrame that drops the specified column.

    >>>  df.drop('age').collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.drop(df.age).collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,  'inner').drop(df.name).collect()                  [Row(age=5,  height=85,  name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,'inner').drop(df2.name).collect()                  [Row(age=5,  name=u'Bob',  height=85)]
  12. FILTER Filters rows using the given condition. >>>  df.filter(df.age  >

     3).collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where(df.age  ==  2).collect()                  [Row(age=2,  name=u'Alice')]   >>>  df.filter("age  >  3").collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where("age  =  2").collect()                  [Row(age=2,  name=u'Alice')]
  13. GROUPBY Groups the DataFrame using the specified columns, to run

    aggregations on them. >>>  df.groupBy().avg().collect()                  [Row(avg(age)=3.5)]   >>>  df.groupBy('name').agg({'age':  'mean'}).collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(df.name).avg().collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(['name',  df.age]).count().collect()                  [Row(name=u'Bob',  age=5,  count=1),  Row(name=u'Alice',  age=2,  count=1)]
  14. JOIN Joins with another DataFrame using the given join expression.

    >>>  df.join(df2,  df.name  ==  df2.name,  'outer').select(df.name,  df2.height).collect()                  [Row(name=None,  height=80),  Row(name=u'Alice',  height=None),   Row(name=u'Bob',  height=85)]   >>>  cond  =  [df.name  ==  df3.name,  df.age  ==  df3.age]   >>>  df.join(df3,  cond,  'outer').select(df.name,  df3.age).collect()                  [Row(name=u'Bob',  age=5),  Row(name=u'Alice',  age=2)]   >>>  df.join(df2,  'name').select(df.name,  df2.height).collect()                  [Row(name=u'Bob',  height=85)]   >>>  df.join(df4,  ['name',  'age']).select(df.name,  df.age).collect()                  [Row(name=u'Bob',  age=5)]
  15. REPLACE Returns a new DataFrame replacing a value with another

    value. >>>  df4.na.replace(['Alice',  'Bob'],  ['A',  'B'],  'name').show()                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |  age|height|name|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |    10|        80|      A|                  |      5|    null|      B|                  |null|    null|  Tom|                  |null|    null|null|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+
  16. SELECT Projects a set of expressions and returns a new

    DataFrame. >>>  df.select('*').collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.select('name',  'age').collect()                  [Row(name=u'Alice',  age=2),  Row(name=u'Bob',  age=5)]   >>>  df.select(df.name,  (df.age  +  10).alias('age')).collect()                  [Row(name=u'Alice',  age=12),  Row(name=u'Bob',   age=15)]
  17. SORT Returns a new DataFrame sorted by the specified column(s).

    >>>  df.sort(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort("age",  ascending=False).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.orderBy(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort(asc("age")).collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.orderBy(desc("age"),  "name").collect()   [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]
  18. INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of

    this DataFrame as Pandas pandas.DataFrame   *  WARNING  -­‐  Since  Pandas  DataFrame  stays  in  memory,  this  will  cause  problems  if  the   dataset  is  too  large.   • df.dtypes() - Returns all column names and their data types as a list.
  19. Q&A