relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.
SQLContext(sc) df = sqlContext.read.json("examples/src/main/resources/people.json") df.show() df.printSchema() df.registerTempTable("people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
= SQLContext(sc) lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) schemaPeople = sqlContext.createDataFrame(people) schemaPeople.registerTempTable(“people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/ kv1.txt' INTO TABLE src") # Queries can be expressed in HiveQL. results = sqlContext.sql("FROM src SELECT key, value").collect()
df.groupBy.agg()). >>> df.agg({"age": “max"}).collect() [Row(max(age)=5)] >>> from pyspark.sql import functions as F >>> df.agg(F.min(df.age)).collect() [Row(min(age)=2)]
this DataFrame as Pandas pandas.DataFrame * WARNING -‐ Since Pandas DataFrame stays in memory, this will cause problems if the dataset is too large. • df.dtypes() - Returns all column names and their data types as a list.