Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Committer Night meetup @ NYC

Reynold Xin
October 15, 2014

Spark Committer Night meetup @ NYC

Reynold Xin

October 15, 2014
Tweet

More Decks by Reynold Xin

Other Decks in Technology

Transcript

  1. What is Apache Spark? Fast and general engine for big

    data processing Generalizes the MapReduce model to support low-latency and complex analytics Most active open source project in big data
  2. About Databricks Founded by the creators of Spark in 2013

    Drives open source Spark development, and offers a cloud service (Databricks Cloud) Largest organization contributing to Spark >  Over 1000 patches in the past year
  3. Community Growth 0 25 50 75 100 2010 2011 2012

    2013 2014 Contributors / Month to Spark
  4. Community Growth 0 25 50 75 100 2010 2011 2012

    2013 2014 Contributors / Month to Spark 2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, …
  5. What’s New in Spark? Petabyte sort record Application integration > 

    Tableau, Trifacta, Talend, ElasticSearch, … Ongoing development for Spark 1.2 >  Python streaming, new MLlib API, YARN scaling, … Spark certification
  6. Short Talks Petabyte sort (Reynold Xin) Spark 1.2 development (Patrick

    Wendell) Machine learning pipelines (Joseph Bradley)
  7. Common Misconception About Spark “Spark is in-memory. It doesn’t work

    with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”
  8. Spark Project Goals Works well with GBs, TBs, or PBs

    or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs
  9. Sorting 100 TB and 1 PB Participated in this year’s

    Sort Benchmark Sorted 100TB following benchmark rules >  Data only on disk Sorted 1PB on our own to push scalability (no official benchmark exists)
  10. What made this possible? Sort-based shuffle (SPARK-2045) Netty native network

    transport (SPARK-2468) External shuffle service (SPARK-3796) Power of the Cloud (Amazon EC2) ……
  11. Spark Releases Release every 3 months: 1.0, 1.1, 1.2 Patch

    releases when necessary: 1.0.2 Spark 1.1: Released September 11 Spark 1.2: Release early December
  12. Roadmap Spark 1.1 and 1.2 have similar themes Spark core:

    Usability, stability, and performance MLlib/SQL/Streaming: Expanded feature set and performance Around ~40% of mailing list traffic is about these libraries.
  13. Spark Core 1.1 “Sort based” shuffle implementation Optimized broadcasts Disk

    spilling improvements 1.2 Re-written network module Scala 2.11 support Better debugging tools
  14. Spark SQL 1.1 JDBC server for multi-tenant access and BI

    tools Native JSON support Native Parquet support and optimizations 1.2 Support for Hive 0.13 External data sources API (Avro, Cassandra, etc). Native ORC support
  15. Spark Streaming 1.1 Amazon Kinesis support Support for polling Flume

    streams Streaming + ML: Streaming linear regression 1.2 Python streaming API Write-ahead log for full H/A operation
  16. Python API for Streaming import sys from pyspark import SparkContext

    from pyspark.streaming import StreamingContext if __name__ == "__main__": sc = SparkContext(appName="StreamingWordCount") ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.print() ssc.start() github.com/apache/spark/blob/master/examples/src/main/python/ streaming/network_wordcount.py Ken Takagiwa github/giwa Davies Liu github/davies
  17. MLlib and GraphX 1.1 Algorithms (SVD, multiclass decision tree….) Feature

    extraction utilities (word2vec, tf-idf) Statistics library 1.2 Pipeline-based interface to all algorithms Many new algorithms including Random Forests Stable API for GraphX
  18. ML Pipelines Training Data Feature Extraction Model Training Model Testing

    Test Data Feature Extraction Other Training Data Other Model Training Typical ML workflow is complex.
  19. ML Pipelines Pipelines under development •  Easy workflow construction • 

    Standardized interface for model tuning •  Testing & failing early Typical ML workflow is complex. Uniform API for all pipeline components
  20. Datasets Further Integration with Spark SQL ML pipelines require Datasets

    •  Handle many data types (features) •  Keep metadata about features •  Select subsets of features for different parts of pipeline •  Join groups of features ML Dataset = SchemaRDD Under development
  21. ML Pipeline Example training = sqlContext.sql("SELECT ... FROM ... ").cache()

    interactor = Interactor() fvAssembler = FeatureVectorAssembler() treeClassifer = DecisionTreeClassifer() paramMap = ParamMap() .put(interactor.features, {"genderMatch" : ["userGender", "targetGender”]}) .put(fvAssembler.features, {"features" : ["genderMatch", "userCountryIndex", ...]}) .put(treeClassifer.maxDepth, 4) pipeline = Pipeline.create(interactor, fvAssembler, treeClassifier) model = pipeline.fit(training, paramMap)
  22. Spark Certification Apache Spark developer certificate program •  http://www.oreilly.com/go/sparkcert • 

    Defined by Spark experts @Databricks •  Assessed by O’Reilly Media