Upgrade to Pro — share decks privately, control downloads, hide ads and more …

1

superChing
September 18, 2015
280

 1

superChing

September 18, 2015
Tweet

Transcript

  1. DataFrame & ML PipeLine
    林煒清/WayneLin
    slide at : https://goo.gl/txohc2

    View full-size slide

  2. 關於講者
    林煒清/Wayne
    [email protected]
    軟體工程師
    相關經驗:Hadoop , Spark
    興趣 : 任何能夠讓電腦更聰明的東西

    View full-size slide

  3. Agenda
    DataFrame
    What
    Why
    API
    WordCount
    hand-on demo 1
    ML PipeLine
    What
    Why
    API
    hand-on demo 2

    View full-size slide

  4. What is DataFrame
    ● A distributed tabular data structure on semi-structured data
    names ,
    types ,
    properties
    relational query,
    declarative
    transformations
    partition1
    DataFrame
    Name | Age | Sex
    Name | Age | Sex
    Name | Age | Sex
    Name | Age | Sex
    Name | Age | Sex
    Name | Age | Sex
    partition2
    RDD
    ● DataFrame = RDD + Schema + Domain-Specific Language

    View full-size slide

  5. Why use DataFrame
    ● similar as single-node tools (Python’s pandas, R’s dplyr)
    ● same capacity as SQL. & syntax consistency.
    ● more build-in functions (good news for python user)
    ● Spark automatically define and optimize RDD computaion for you

    View full-size slide

  6. Why use DataFrame
    « future of Spark »

    View full-size slide

  7. API Overview
    ● SQLContext ,HiveContext
    ● DataFrame
    ● Column
    ● sql.functions
    ● sql.types
    ● Row
    ● DataFrameNaFunctions
    ● DataFrameStatFunctions (1.4)
    ● GroupedData
    ● Window (1.4)
    ● DataFrameReader (1.4)
    ● DataFrameWriter (1.4)

    View full-size slide

  8. share DataFrames
    via Metastore
    Look up tables
    Read table into DataFrame
    Save DataFrame to metastore
    in-memory DataFrame
    persistent DataFrame
    hiveContext.tables()
    hiveContext.table(“name”)
    df.saveAsTable(“name”)
    df.registerTempTable(“name”)

    View full-size slide

  9. DataFrame , Pandas , RDD
    DataFrame <--> Pandas
    spark_df = context.createDataFrame(pandas_df)
    pandas_df = spark_df.toPandas()
    DataFrame <--> RDD
    people_df = people_rdd.toDF()
    people_rdd = people_df.rdd()

    View full-size slide

  10. Operators
    Select Where GroupBy
    UDF
    (User defined Function)
    Join Union
    In When Over
    Between Like ...

    View full-size slide

  11. Pandas DataFrame style
    df[df['age']>18][['name','age']]
    Projection & Filter
    SQL
    SELECT name, age
    FROM table
    WHERE age > 18
    Spark DataFrame style
    df.filter(df['age']>18).select('name','age')

    View full-size slide

  12. Aggregation
    Pandas / Spark DataFrame style
    from pyspark.sql.functions import avg,max
    df.groupBy('country').agg(avg('age'),max('age'))
    SQL
    SELECT name, AVG(age), MAX(age)
    FROM table
    GROUP BY name

    View full-size slide

  13. from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType,ArrayType
    split = udf(lambda x: x.split(), ArrayType(StringType()))
    df.select(split(df['sentence']))
    User Defined Function (UDF)
    指定輸出SQL類別
    SparkSQL UDF
    your python function

    View full-size slide

  14. +--------------+
    | sentence|
    +--------------+
    |spark spark fast
    fast fast|
    +--------------+
    Word Count by DataFrame
    df = cxt.createDataFrame([('spark spark fast fast fast',)],['sentence'])
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType,ArrayType
    from pyspark.sql.functions import explode
    +-----+-----+
    | word|count|
    +-----+-----+
    |spark| 2|
    | fast| 3|
    +-----+-----+
    .groupBy('word').count()\
    .show()
    groupBy
    +-----------------------------+
    | split(sentence)|
    +-----------------------------+
    |ArrayBuffer(spark, spark, ...|
    +-----------------------------+
    split_udf = udf(lambda x: x.split(), ArrayType(StringType()))
    UDF split
    split_udf(df['sentence'])
    +-----+
    | word|
    +-----+
    |spark|
    |spark|
    | fast|
    | fast|
    | fast|
    +-----+
    explode
    df.withColumn('word',explode(split_udf(df['sentence'])))\

    View full-size slide

  15. 還有感謝 TW Spark Group 眾多高手大大們幫忙架構系統環境
    感謝
    http://161.202.33.19:8000/

    View full-size slide

  16. ML:
    ● high-level pipeline abstraction
    ● based on Spark Dataframe
    ● ML = scikit-learn pipeline + pandas dataframe
    What is ML
    MLlib:
    ● low-level implementation
    ● based on RDD

    View full-size slide

  17. Spark ML
    Typical Machine Learning workflow(training phase)
    pre-processing
    feature extraction
    training
    tunning
    evaluation
    Pipeline

    View full-size slide

  18. ● Pipeline: clear and unified interface for complex
    machine learning workflow
    ○ same tool(dataframe) for data
    wrangling and machine learning .
    ○ easy inspection on any intermediate
    features.
    Why ML
    Learning Algorithm
    Feature
    Model tunning
    ● Model Tuning

    View full-size slide

  19. Terms
    Pipeline is an Estimator and chains
    Transformers and Estimators together.
    Transformer can transform DataFrame.
    Estimator can be fitted on DataFrame.
    Transformer
    transform
    Estimator
    fit
    Pipeline
    fit
    fit

    View full-size slide

  20. PipeLine
    Transformer Estimator
    DF1
    Pipeline fitting
    DF2 Transformer
    rule: for each component, fit then transform
    DF3 Transform
    Transformer
    Estimator
    Transformer

    View full-size slide

  21. Pipeline API
    from pyspark.ml.feature import *
    from pyspark.ml.classification import *
    from pyspark.ml import *
    stage1 = OneHotEncoder(inputCol="category",outputCol="code")
    stage2 = VectorAssembler(inputCol=["code","count"], outputCol="features")
    stage3 = LogisticRegression(featuresCol="features",labelCol="spam")
    pipeline = Pipeline(stages=[stage1,stage2,stage3])
    model = pipeline.fit(training_df)
    prediction_df = model.transform(testing_df)
    Estimator
    Transformer
    OneHotEncoder
    VectorAssembler LogisticRegression
    "category"
    "count"

    View full-size slide

  22. All Databrick’s talks and blogs about Dataframe and ML pipeline till
    2015/9/20
    Spark Programming Guide
    http://0x0fff.com/spark-dataframes-are-faster-arent-they/
    http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/
    https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-
    dataframes-1380cec394d2
    https://software.intel.com/en-us/blogs/2015/05/01/restudy-schemardd-
    in-sparksql
    http://www.infoobjects.com/journey-of-schema-in-big-data-world/
    https://issues.apache.org/jira/browse/SPARK-3530
    Reference

    View full-size slide

  23. Scala Q&A
    RDD
    SparkS
    QL
    DataF
    rame
    MLlib
    Stream
    ing
    SQL
    ML
    AL
    S
    Kafka

    View full-size slide

  24. Why use DataFrame(Cont.)
    RDD
    data.map(lambda x: (x[0], [int(x[1]), 1]))
    .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
    .map(lambda x: [x[0], x[1][0] / x[1][1]])
    DataFrame
    data.groupBy("country").avg("age")

    View full-size slide

  25. Indexing
    df.loc[[4,3,2],’A’]
    Mutation
    df[‘col’] = 3
    grouped data indexing
    df.groupBy(‘key’)[‘A’,’B’]
    difference from Pandas

    View full-size slide

  26. DataFrame Read/Write
    df = context.read
    .format("json")
    .options("samplingRatio","0.1")
    .load("/path/data.json")
    df.write
    .format("parquet")
    .mode("append")
    .partitionBy("year")
    .save("/path/data.json")

    View full-size slide

  27. Pipeline fitting(cont.)
    fit on a DataFrame
    Pipeline
    Transformer Estimator
    Model
    Transformer Model
    training data
    data prediction

    View full-size slide