semi-structured data names , types , properties relational query, declarative transformations partition1 DataFrame Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex Name | Age | Sex partition2 RDD • DataFrame = RDD + Schema + Domain-Specific Language
R’s dplyr) • same capacity as SQL. & syntax consistency. • more build-in functions (good news for python user) • Spark automatically define and optimize RDD computaion for you
udf(lambda x: x.split(), ArrayType(StringType())) df.select(split(df['sentence'])) User Defined Function (UDF) 指定輸出SQL類別 SparkSQL UDF your python function
workflow ◦ same tool(dataframe) for data wrangling and machine learning . ◦ easy inspection on any intermediate features. Why ML Learning Algorithm Feature Model tunning • Model Tuning