1

DataFrame & ML PipeLine 林煒清/WayneLin slide at : https://goo.gl/txohc2

關於講者林煒清/Wayne [email protected] 軟體工程師相關經驗：Hadoop , Spark 興趣 : 任何能夠讓電腦更聰明的東西

Agenda DataFrame What Why API WordCount hand-on demo 1 ML
PipeLine What Why API hand-on demo 2

What is DataFrame • A distributed tabular data structure on
semi-structured data names , types , properties relational query, declarative transformations partition1 DataFrame Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex partition2 RDD • DataFrame = RDD + Schema + Domain-Specific Language

Why use DataFrame • similar as single-node tools (Python’s pandas,
R’s dplyr) • same capacity as SQL. & syntax consistency. • more build-in functions (good news for python user) • Spark automatically define and optimize RDD computaion for you

Why use DataFrame « future of Spark »

API Overview • SQLContext ,HiveContext • DataFrame • Column •
sql.functions • sql.types • Row • DataFrameNaFunctions • DataFrameStatFunctions (1.4) • GroupedData • Window (1.4) • DataFrameReader (1.4) • DataFrameWriter (1.4)

share DataFrames via Metastore Look up tables Read table into
DataFrame Save DataFrame to metastore in-memory DataFrame persistent DataFrame hiveContext.tables() hiveContext.table(“name”) df.saveAsTable(“name”) df.registerTempTable(“name”)

DataFrame , Pandas , RDD DataFrame <--> Pandas spark_df =
context.createDataFrame(pandas_df) pandas_df = spark_df.toPandas() DataFrame <--> RDD people_df = people_rdd.toDF() people_rdd = people_df.rdd()

Operators Select Where GroupBy UDF (User defined Function) Join Union
In When Over Between Like ...

Pandas DataFrame style df[df['age']>18][['name','age']] Projection & Filter SQL SELECT name,
age FROM table WHERE age > 18 Spark DataFrame style df.filter(df['age']>18).select('name','age')

Aggregation Pandas / Spark DataFrame style from pyspark.sql.functions import avg,max
df.groupBy('country').agg(avg('age'),max('age')) SQL SELECT name, AVG(age), MAX(age) FROM table GROUP BY name

from pyspark.sql.functions import udf from pyspark.sql.types import StringType,ArrayType split =
udf(lambda x: x.split(), ArrayType(StringType())) df.select(split(df['sentence'])) User Defined Function (UDF) 指定輸出SQL類別 SparkSQL UDF your python function

還有感謝 TW Spark Group 眾多高手大大們幫忙架構系統環境感謝 http://161.202.33.19:8000/

ML: • high-level pipeline abstraction • based on Spark Dataframe
• ML = scikit-learn pipeline + pandas dataframe What is ML MLlib: • low-level implementation • based on RDD

Spark ML Typical Machine Learning workflow(training phase) pre-processing feature extraction
training tunning evaluation Pipeline

• Pipeline: clear and unified interface for complex machine learning
workflow ◦ same tool(dataframe) for data wrangling and machine learning . ◦ easy inspection on any intermediate features. Why ML Learning Algorithm Feature Model tunning • Model Tuning

Terms Pipeline is an Estimator and chains Transformers and Estimators
together. Transformer can transform DataFrame. Estimator can be fitted on DataFrame. Transformer transform Estimator fit Pipeline fit fit

PipeLine Transformer Estimator DF1 Pipeline fitting DF2 Transformer rule: for
each component, fit then transform DF3 Transform Transformer Estimator Transformer

Pipeline API from pyspark.ml.feature import * from pyspark.ml.classification import *
from pyspark.ml import * stage1 = OneHotEncoder(inputCol="category",outputCol="code") stage2 = VectorAssembler(inputCol=["code","count"], outputCol="features") stage3 = LogisticRegression(featuresCol="features",labelCol="spam") pipeline = Pipeline(stages=[stage1,stage2,stage3]) model = pipeline.fit(training_df) prediction_df = model.transform(testing_df) Estimator Transformer OneHotEncoder VectorAssembler LogisticRegression "category" "count"

All Databrick’s talks and blogs about Dataframe and ML pipeline
till 2015/9/20 Spark Programming Guide http://0x0fff.com/spark-dataframes-are-faster-arent-they/ http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/ https://medium.com/@chris_bour/6-differences-between-pandas-and-spark- dataframes-1380cec394d2 https://software.intel.com/en-us/blogs/2015/05/01/restudy-schemardd- in-sparksql http://www.infoobjects.com/journey-of-schema-in-big-data-world/ https://issues.apache.org/jira/browse/SPARK-3530 Reference

Scala Q&A RDD SparkS QL DataF rame MLlib Stream ing
SQL ML AL S Kafka

Why use DataFrame(Cont.) RDD data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda
x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) DataFrame data.groupBy("country").avg("age")

Indexing df.loc[[4,3,2],’A’] Mutation df[‘col’] = 3 grouped data indexing df.groupBy(‘key’)[‘A’,’B’]
difference from Pandas

DataFrame Read/Write df = context.read .format("json") .options("samplingRatio","0.1") .load("/path/data.json") df.write .format("parquet")
.mode("append") .partitionBy("year") .save("/path/data.json")

Pipeline fitting(cont.) fit on a DataFrame Pipeline Transformer Estimator Model
Transformer Model training data data prediction

1

1

superChing

More Decks by superChing

Featured

Transcript

DataFrame & ML PipeLine 林煒清/WayneLin slide at : https://goo.gl/txohc2

關於講者林煒清/Wayne [email protected] 軟體工程師相關經驗：Hadoop , Spark 興趣 : 任何能夠讓電腦更聰明的東西

Agenda DataFrame What Why API WordCount hand-on demo 1 ML

What is DataFrame • A distributed tabular data structure on

Why use DataFrame • similar as single-node tools (Python’s pandas,

Why use DataFrame « future of Spark »

API Overview • SQLContext ,HiveContext • DataFrame • Column •

share DataFrames via Metastore Look up tables Read table into

DataFrame , Pandas , RDD DataFrame <--> Pandas spark_df =

Operators Select Where GroupBy UDF (User defined Function) Join Union

Pandas DataFrame style df[df['age']>18][['name','age']] Projection & Filter SQL SELECT name,

Aggregation Pandas / Spark DataFrame style from pyspark.sql.functions import avg,max

from pyspark.sql.functions import udf from pyspark.sql.types import StringType,ArrayType split =

+--------------+ | sentence| +--------------+ |spark spark fast fast fast| +--------------+

還有感謝 TW Spark Group 眾多高手大大們幫忙架構系統環境感謝 http://161.202.33.19:8000/

ML: • high-level pipeline abstraction • based on Spark Dataframe

Spark ML Typical Machine Learning workflow(training phase) pre-processing feature extraction

• Pipeline: clear and unified interface for complex machine learning

Terms Pipeline is an Estimator and chains Transformers and Estimators

PipeLine Transformer Estimator DF1 Pipeline fitting DF2 Transformer rule: for

Pipeline API from pyspark.ml.feature import * from pyspark.ml.classification import *

Q&A

All Databrick’s talks and blogs about Dataframe and ML pipeline

Scala Q&A RDD SparkS QL DataF rame MLlib Stream ing

Q&A

Why use DataFrame(Cont.) RDD data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda

Indexing df.loc[[4,3,2],’A’] Mutation df[‘col’] = 3 grouped data indexing df.groupBy(‘key’)[‘A’,’B’]

DataFrame Read/Write df = context.read .format("json") .options("samplingRatio","0.1") .load("/path/data.json") df.write .format("parquet")

Pipeline fitting(cont.) fit on a DataFrame Pipeline Transformer Estimator Model