Upgrade to Pro — share decks privately, control downloads, hide ads and more …

1

superChing
September 18, 2015
300

 1

superChing

September 18, 2015
Tweet

Transcript

  1. Agenda DataFrame What Why API WordCount hand-on demo 1 ML

    PipeLine What Why API hand-on demo 2
  2. What is DataFrame • A distributed tabular data structure on

    semi-structured data names , types , properties relational query, declarative transformations partition1 DataFrame Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex partition2 RDD • DataFrame = RDD + Schema + Domain-Specific Language
  3. Why use DataFrame • similar as single-node tools (Python’s pandas,

    R’s dplyr) • same capacity as SQL. & syntax consistency. • more build-in functions (good news for python user) • Spark automatically define and optimize RDD computaion for you
  4. API Overview • SQLContext ,HiveContext • DataFrame • Column •

    sql.functions • sql.types • Row • DataFrameNaFunctions • DataFrameStatFunctions (1.4) • GroupedData • Window (1.4) • DataFrameReader (1.4) • DataFrameWriter (1.4)
  5. share DataFrames via Metastore Look up tables Read table into

    DataFrame Save DataFrame to metastore in-memory DataFrame persistent DataFrame hiveContext.tables() hiveContext.table(“name”) df.saveAsTable(“name”) df.registerTempTable(“name”)
  6. DataFrame , Pandas , RDD DataFrame <--> Pandas spark_df =

    context.createDataFrame(pandas_df) pandas_df = spark_df.toPandas() DataFrame <--> RDD people_df = people_rdd.toDF() people_rdd = people_df.rdd()
  7. Pandas DataFrame style df[df['age']>18][['name','age']] Projection & Filter SQL SELECT name,

    age FROM table WHERE age > 18 Spark DataFrame style df.filter(df['age']>18).select('name','age')
  8. Aggregation Pandas / Spark DataFrame style from pyspark.sql.functions import avg,max

    df.groupBy('country').agg(avg('age'),max('age')) SQL SELECT name, AVG(age), MAX(age) FROM table GROUP BY name
  9. from pyspark.sql.functions import udf from pyspark.sql.types import StringType,ArrayType split =

    udf(lambda x: x.split(), ArrayType(StringType())) df.select(split(df['sentence'])) User Defined Function (UDF) 指定輸出SQL類別 SparkSQL UDF your python function
  10. +--------------+ | sentence| +--------------+ |spark spark fast fast fast| +--------------+

    Word Count by DataFrame df = cxt.createDataFrame([('spark spark fast fast fast',)],['sentence']) from pyspark.sql.functions import udf from pyspark.sql.types import StringType,ArrayType from pyspark.sql.functions import explode +-----+-----+ | word|count| +-----+-----+ |spark| 2| | fast| 3| +-----+-----+ .groupBy('word').count()\ .show() groupBy +-----------------------------+ | split(sentence)| +-----------------------------+ |ArrayBuffer(spark, spark, ...| +-----------------------------+ split_udf = udf(lambda x: x.split(), ArrayType(StringType())) UDF split split_udf(df['sentence']) +-----+ | word| +-----+ |spark| |spark| | fast| | fast| | fast| +-----+ explode df.withColumn('word',explode(split_udf(df['sentence'])))\
  11. ML: • high-level pipeline abstraction • based on Spark Dataframe

    • ML = scikit-learn pipeline + pandas dataframe What is ML MLlib: • low-level implementation • based on RDD
  12. • Pipeline: clear and unified interface for complex machine learning

    workflow ◦ same tool(dataframe) for data wrangling and machine learning . ◦ easy inspection on any intermediate features. Why ML Learning Algorithm Feature Model tunning • Model Tuning
  13. Terms Pipeline is an Estimator and chains Transformers and Estimators

    together. Transformer can transform DataFrame. Estimator can be fitted on DataFrame. Transformer transform Estimator fit Pipeline fit fit
  14. PipeLine Transformer Estimator DF1 Pipeline fitting DF2 Transformer rule: for

    each component, fit then transform DF3 Transform Transformer Estimator Transformer
  15. Pipeline API from pyspark.ml.feature import * from pyspark.ml.classification import *

    from pyspark.ml import * stage1 = OneHotEncoder(inputCol="category",outputCol="code") stage2 = VectorAssembler(inputCol=["code","count"], outputCol="features") stage3 = LogisticRegression(featuresCol="features",labelCol="spam") pipeline = Pipeline(stages=[stage1,stage2,stage3]) model = pipeline.fit(training_df) prediction_df = model.transform(testing_df) Estimator Transformer OneHotEncoder VectorAssembler LogisticRegression "category" "count"
  16. Q&A

  17. All Databrick’s talks and blogs about Dataframe and ML pipeline

    till 2015/9/20 Spark Programming Guide http://0x0fff.com/spark-dataframes-are-faster-arent-they/ http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/ https://medium.com/@chris_bour/6-differences-between-pandas-and-spark- dataframes-1380cec394d2 https://software.intel.com/en-us/blogs/2015/05/01/restudy-schemardd- in-sparksql http://www.infoobjects.com/journey-of-schema-in-big-data-world/ https://issues.apache.org/jira/browse/SPARK-3530 Reference
  18. Q&A

  19. Why use DataFrame(Cont.) RDD data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda

    x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) DataFrame data.groupBy("country").avg("age")