Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to DataFrames in Spark

Introduction to DataFrames in Spark

Jyotiska NK

July 18, 2015
Tweet

More Decks by Jyotiska NK

Other Decks in Technology

Transcript

  1. TALK AGENDA • Overview • Creating DataFrames • Playing with

    different data formats and sources • DataFrames Operations • Integrating with Pandas DF • Demo • Q&A
  2. OVERVIEW • Distributed datasets loaded into named columns (similar to

    relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.
  3. CREATING DATAFRAMES • From existing RDDs • JSON datasets •

    Hive Tables • Relational DBs - PostgreSQL, MySQL etc. • External data source - Amazon S3, CSV, Avro, ElasticSearch, Cassandra etc.
  4. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =

     SQLContext(sc)   df  =  sqlContext.read.format(‘com.databricks.spark.csv').options   (header=‘true').load('cars.csv')   df.select(‘year','model').write.format('com.databricks.spark.csv').save   ('newcars.csv')
  5. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext   sqlContext  =

     SQLContext(sc)   df  =  sqlContext.read.json("examples/src/main/resources/people.json")   df.show()   df.printSchema()   df.registerTempTable("people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19")
  6. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  SQLContext,  Row   sqlContext

     =  SQLContext(sc)   lines  =  sc.textFile("examples/src/main/resources/people.txt")   parts  =  lines.map(lambda  l:  l.split(","))   people  =  parts.map(lambda  p:  Row(name=p[0],  age=int(p[1])))   schemaPeople  =  sqlContext.createDataFrame(people)   schemaPeople.registerTempTable(“people")   teenagers  =  sqlContext.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=   19")
  7. CREATING DATAFRAMES (CONTD.) from  pyspark.sql  import  HiveContext   sqlContext  =

     HiveContext(sc)   sqlContext.sql("CREATE  TABLE  IF  NOT  EXISTS  src  (key  INT,  value  STRING)")   sqlContext.sql("LOAD  DATA  LOCAL  INPATH  'examples/src/main/resources/ kv1.txt'  INTO  TABLE  src")   #  Queries  can  be  expressed  in  HiveQL.   results  =  sqlContext.sql("FROM  src  SELECT  key,  value").collect()
  8. SUPPORTED DATATYPES • Numeric Types - ByteType, ShortType, IntegerType, LongType,

    FloatType, DoubleType, DecimalType • StringType • BinaryType • BooleanType • Datetime Types - TimestampType, DateType • Complex Types - ArrayType, MapType, StructType
  9. AGGREGATION Aggregate on the entire DataFrame without groups (shorthand for

    df.groupBy.agg()). >>>  df.agg({"age":  “max"}).collect()                  [Row(max(age)=5)]   >>>  from  pyspark.sql  import  functions  as  F   >>>  df.agg(F.min(df.age)).collect()                  [Row(min(age)=2)]
  10. DESCRIBE Computes statistics for numeric columns. This include count, mean,

    stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. >>>  df.describe(['age',  'name']).show()                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |summary|age|  name|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+                  |    count|    2|        2|                  |      mean|3.5|  null|                  |  stddev|1.5|  null|                  |        min|    2|Alice|                  |        max|    5|    Bob|                  +-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐+
  11. DROP Returns a new DataFrame that drops the specified column.

    >>>  df.drop('age').collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.drop(df.age).collect()                  [Row(name=u'Alice'),  Row(name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,  'inner').drop(df.name).collect()                  [Row(age=5,  height=85,  name=u'Bob')]   >>>  df.join(df2,  df.name  ==  df2.name,'inner').drop(df2.name).collect()                  [Row(age=5,  name=u'Bob',  height=85)]
  12. FILTER Filters rows using the given condition. >>>  df.filter(df.age  >

     3).collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where(df.age  ==  2).collect()                  [Row(age=2,  name=u'Alice')]   >>>  df.filter("age  >  3").collect()                  [Row(age=5,  name=u'Bob')]   >>>  df.where("age  =  2").collect()                  [Row(age=2,  name=u'Alice')]
  13. GROUPBY Groups the DataFrame using the specified columns, to run

    aggregations on them. >>>  df.groupBy().avg().collect()                  [Row(avg(age)=3.5)]   >>>  df.groupBy('name').agg({'age':  'mean'}).collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(df.name).avg().collect()                  [Row(name=u'Alice',  avg(age)=2.0),  Row(name=u'Bob',  avg(age)=5.0)]   >>>  df.groupBy(['name',  df.age]).count().collect()                  [Row(name=u'Bob',  age=5,  count=1),  Row(name=u'Alice',  age=2,  count=1)]
  14. JOIN Joins with another DataFrame using the given join expression.

    >>>  df.join(df2,  df.name  ==  df2.name,  'outer').select(df.name,  df2.height).collect()                  [Row(name=None,  height=80),  Row(name=u'Alice',  height=None),   Row(name=u'Bob',  height=85)]   >>>  cond  =  [df.name  ==  df3.name,  df.age  ==  df3.age]   >>>  df.join(df3,  cond,  'outer').select(df.name,  df3.age).collect()                  [Row(name=u'Bob',  age=5),  Row(name=u'Alice',  age=2)]   >>>  df.join(df2,  'name').select(df.name,  df2.height).collect()                  [Row(name=u'Bob',  height=85)]   >>>  df.join(df4,  ['name',  'age']).select(df.name,  df.age).collect()                  [Row(name=u'Bob',  age=5)]
  15. REPLACE Returns a new DataFrame replacing a value with another

    value. >>>  df4.na.replace(['Alice',  'Bob'],  ['A',  'B'],  'name').show()                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |  age|height|name|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+                  |    10|        80|      A|                  |      5|    null|      B|                  |null|    null|  Tom|                  |null|    null|null|                  +-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐+
  16. SELECT Projects a set of expressions and returns a new

    DataFrame. >>>  df.select('*').collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.select('name',  'age').collect()                  [Row(name=u'Alice',  age=2),  Row(name=u'Bob',  age=5)]   >>>  df.select(df.name,  (df.age  +  10).alias('age')).collect()                  [Row(name=u'Alice',  age=12),  Row(name=u'Bob',   age=15)]
  17. SORT Returns a new DataFrame sorted by the specified column(s).

    >>>  df.sort(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort("age",  ascending=False).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.orderBy(df.age.desc()).collect()                  [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]   >>>  df.sort(asc("age")).collect()                  [Row(age=2,  name=u'Alice'),  Row(age=5,  name=u'Bob')]   >>>  df.orderBy(desc("age"),  "name").collect()   [Row(age=5,  name=u'Bob'),  Row(age=2,  name=u'Alice')]
  18. INTEGRATING WITH PANDAS • df.toPandas() - Returns the contents of

    this DataFrame as Pandas pandas.DataFrame   *  WARNING  -­‐  Since  Pandas  DataFrame  stays  in  memory,  this  will  cause  problems  if  the   dataset  is  too  large.   • df.dtypes() - Returns all column names and their data types as a list.
  19. Q&A