Mining Repositories with Apache Spark

Mining Repositories with Apache Spark

At the beginning of every research effort, researchers in empirical software
engineering have to go through the processes of extracting data from raw
data sources and transforming them to what their tools expect as inputs.
This step is time consuming and error prone, while the produced artifacts
(code, intermediate datasets) are usually not of scientific value. In the
recent years, Apache Spark has emerged as a solid foundation for data
science and has taken the big data analytics domain by storm. We believe
that the primitives exposed by Apache Spark can help software engineering
researchers create and share reproducible, high-performance data analysis
pipelines.

In our presentation, given as a ICSE 2018 technical briefing, we discuss how researchers can profit from Apache Spark, through a hands-on case study.

43df3993acc9af4e9f619e59cd849aee?s=128

Georgios Gousios

May 29, 2018
Tweet

Transcript

  1. Mining repositories with Apache Spark Georgios Gousios 18 June 2018

    1
  2. The tutorial Welcome! In this tutorial, we will go through

    the following 3 topics: Enough FP to become dangerous Internals of Apache Spark Using Apache Spark for common MSR data tasks 2
  3. Programming languages The de facto languages of Big Data and

    data science are Scala Mostly used for data intensive systems Python Mostly used for data analytics tasks Other languages include Java The “assembly” of big data systems; the language that most big data infrastructure is written into. R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools. In our tutorial, we will be using Scala and Python. 3
  4. Data processing with Functional Programming 4 . 1

  5. Types of data Unstructured: Data whose format is not known

    Raw text documents HTML pages Semi-Structured: Data with a known format. Pre-parsed data to standard formats: , , Structured: Data with known formats, linked together in graphs or tables JSON CSV XML 4 . 2
  6. Sequences / Lists Sequences or Lists or Arrays represent consecutive

    items in memory In Python: In Scala Basic properties: Size is bounded by memory Items can be accessed by an index: a[1] or l.get(3) Items can only inserted at the end (append) a = [1, 2, 3, 4] val l = List(1,2,3,4) 4 . 3
  7. Sets Sets store values, without any particular order, and no

    repeated values. Basic properties: Size is bounded by memory Can be queried for containment Set operations: union, intersection, difference, subset scala> val s = Set(1,2,3,4,4) s: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4) 4 . 4
  8. Maps or Dictionaries Maps or Dictionaries or Associative Arrays is

    a collection of (k,v) pairs in such a way that each k appears only once. Some languages have build-in support for Dictionaries Basic properties: One key always corresponds to one value. Accessing a value given a key is very fast ( ) a = {'a' : 1, 'b' : 2} ≈ O(1) 4 . 5
  9. Nested data types: Graphs A graph data structure consists of

    a finite set of vertices or nodes, together with a set of unordered pairs of these vertices for an undirected graph or a set of ordered pairs for a directed graph. Nodes can contain attributes Vertices can contain weights and directions Graphs are usually represented as Map[Node, List[Vertex]], where case class Node(id: Int, attributes: Map[A, B]) case class Vertex(a: Node, b: Node, directed: Option[Boolean], weight: Option[Double] ) 4 . 6
  10. Nested data types: Trees Ordered graphs without loops If we

    parse the above JSON in almost any language, we get a series of nested maps a = {"id": "5542101946", "type": "PushEvent", "actor": { "id": 801183, "login": "tvansteenburgh" }, "repo": { "id": 42362423, "name": "juju-solutions/review-queue" }} Map(id -> 5542101946, type -> PushEvent, actor -> Map(id -> 801183.0, login -> tvansteenburgh), repo -> Map(id -> 4.2362423E7, name -> juju-solutions/review-queu 4 . 7
  11. Relations An -tuple is a sequence of elements, whose types

    are known. A relation is a Set of n-tuples . Relations are very important for data processing, as they form the theoretical framework ( ) for relational (SQL) databases. Typical operations on relations are insert, remove and join. Join allows us to compute new relations by joining existing ones on common fields. n n val record = Tuple4[Int, String, String, Int] (1, 'Georgios', 'Mekelweg', '4') (d1, d2, . . . , dn) Relational Algebra 4 . 8
  12. Key/Value pairs A key/value pair (or KV) is a special

    type of a Map, where a key k does not have to appear once. Key/Value pairs are usually implemented as a Map whose keys are of a sortable type K (e.g. Int) and values are a Set of elements of type V. Another way to represent a K/V pair is a List of n-tuples . K and V are flexible: that’s why the Key/Value abstraction is key to NoSQL databases, including MongoDB, DynamoDB, Redis etc. Those databases sacrifice, among others, type val kv = Map[K, Set[V]]() (d1, d2, . . . , dn) 4 . 9
  13. Functional programming Functional programming is a programming paradigm that treats

    computation as the evaluation of mathematical functions and avoids changing-state and mutable data (Wikipedia). Functional programming characteristics: Absence of side-effects: A function, given an argument, always returns the same results irrespective of and without modifying its environment. Higher-order functions: Functions can take functions as arguments to parametrise their behavior Lazyness: The art of waiting to compute till you can wait no more 4 . 10
  14. Function signatures : function name and : Names of function

    arguments and : Types of function arguments. : Denotes the return type : Type of the returned result : Denotes that type can be traversed We read this as: Function foo takes as arguments an array/list foo(x : [A], y : B) → C foo x y [A] B → C [A] A 4 . 11
  15. Side effects A function has a side effect if it

    modifies some state outside its scope or has an observable interaction with its calling functions or the outside world besides returning a value. As a general rule, any function that returns nothing (void or Unit) does a side effect! max = -1 def ge(a, b): global max if a >= b: max = a ## <- Side effect! return True else: max = b return False 4 . 12
  16. Examples of side effects Setting a field on an object:

    OO is not FP! Modifying a data structure in place: In FP, data structures are always . Throwing an exception or halting with an error: In FP, we use types that encapsulate and propagate erroneous behaviour Printing to the console or reading user input, reading persistent 4 . 13
  17. Higher-Order functions A higher order function is a function that

    can take a function as an argument or return a function. In the context of BDP, high-order functions capture common idioms of processing data as enumarated elements, e.g. going over all elements, selectively removing elements and aggregating them. class Array[A] { // Return elements that satisfy f def filter(f: A => Boolean) : Array[A] } 4 . 14
  18. Important higher-order functions map(xs: [A], f: A => B) :

    [B] Applies f to all elements and returns a new list. flatMap(xs: [A], f: A => [B]) : [B] Like map, but flattens the result to a single list. fold(xs: [A], f: (B, A) => B, init: B) : B 4 . 15
  19. Aux higher-order functions groupBy(xs: [A], f: A => K): Map[K,

    [A]] Partitions xs into a map of traversable collections according to some discriminator function. filter(xs: [A], f: A => Boolean) : [A] Takes a function that returns a boolean and returns all elements that satisfy it 4 . 16
  20. foldL and foldR foldL(xs: [A], f: (B, A) => B,

    init: B) : B foldR(xs: [A], f: (A, B) => B, init: B) : B Both take almost the same agruments and return the same results. What is the difference in their evaluation? How does foldL work? How does foldR work? print reduce(reduce_pp, range(1,10), 0) ## (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) print reduceR(reduce_pp, range(1,10), 0) 4 . 17
  21. Laziness Laziness is an evaluation strategy which delays the evaluation

    of an expression until its value is needed. Seperating a pipeline construction from its evaluation Not requiring to read datasets in memory: we can process them in lazy-loaded batches Generating infinite collections Optimising execution plans def primes(limit): sieve = [False]*limit for j in xrange(2, limit): 4 . 18
  22. Apache Spark 5 . 1

  23. What is Spark? Spark is on open source cluster computing

    framework. automates distribution of data and computations on a cluster of computers provides a fault-tolerant abstraction to distributed datasets is based on functional programming primitives provides two abstractions to data, list-like (RDDs) and table-like (Datasets) 5 . 2
  24. Resilient Distributed Datasets (RDDs) RDDs are the core abstraction that

    Spark uses. RDDs make datasets distributed over a cluster of machines look like a Scala collection. RDDs: are immutable reside (mostly) in memory are transparently distributed feature all FP programming primitives in addition, more to minimize shuffling In practice, RDD[A] works like Scala’s List[A], with some gotchas 5 . 3
  25. Counting words with Spark The same code works on one

    computer on a cluster of 100s of computers. val rdd = sc.textFile("./datasets/odyssey.mb.txt") rdd. flatMap(l => l.split(" ")). // Split file in words map(x => (x, 1)). // Create key,1 pairs reduceByKey((acc, x) => acc + x). // Count occurences of same pairs sortBy(x => x._2, false). // Sort by number of occurences take(50). // Take the first 50 results foreach(println) 5 . 4
  26. How to create an RDD? RDDs can only be created

    in the following 3 ways 1. Reading data from external sources 2. Convert a local memory dataset to a distributed one 3. Transform an existing RDD val rdd1 = sc.textFile("hdfs://...") val rdd2 = sc.textFile("file://odyssey.txt") val rdd3 = sc.textFile("s3://...") val xs: Range[Int] = Range(1, 10000) val rdd: RDD[Int] = sc.parallelize(xs) rdd.map(x => x.toString) //returns an RDD[String] 5 . 5
  27. RDDs are lazy! There are two types of operations we

    can do on an RDD: Transformation: Applying a function that returns a new RDD. They are lazy. Action: Request the computation of a result. They are eager. // This just sets up the pipeline val result = rdd. flatMap(l => l.split(" ")). 5 . 6
  28. Examples of RDD transformations All uses of articles in the

    Odyssey Q: How can we find uses of all regular verbs in past tense? Q: How can we remove all punctuation marks? val odyssey = sc.textFile("datasets/odyssey.mb.txt"). flatMap(_.split(" ")) odyssey.map(_.toLowerCase). filter(Seq("a", "the").contains(_)) odyssey.filter(x => x.endsWith("ed")) odyssey.map(x => x.replaceAll("\\p{Punct}|\\d", "")) 5 . 7
  29. Common actions on RDD[A] collect: Return all elements of an

    RDD take: Return the first n elements of the RDD reduce, fold: Combine all elements to a single result of the same time. aggregate: Aggregate the elements of each partition, and then the r RDD. collect() : Array[A] RDD. take(n) : Array[A] RDD. reduce(f : (A, A) → A) : A 5 . 8
  30. Examples of RDD actions How many words are there? How

    can we sort the RDD? How can we sample data from the RDD? val odyssey = sc.textFile("datasets/odyssey.mb.txt").flatMap(_.split( odyssey.map(x => 1).reduce((a,b) => a + b) odyssey.sortBy(x => x) val (train, test) = odyssey.randomSplit(Array(0.8, 0.2)) 5 . 9
  31. Pair RDDs RDDs can represent any complex data type, if

    it can be iterated. Spark treats RDDs of the type RDD[(K,V)] as special, named PairRDDs, as they can be both iterated and indexed. Operations such as join are only defined on Pair RDDs. We can create Pair RDDs by applying an indexing function or by grouping records: val rdd = List("foo", "bar", "baz").parallelize // RDD[String] val pairRDD = rdd.map(x => (x.charAt(0), x)) // RDD[(Char, String)] pairRDD.collect // Array((f,foo), (b,bar), (b,baz)) val pairRDD2 = rdd.groupBy(x => x.charAt(0)) // RDD[(Char, Iterable( pairRDD2.collect //Array((b,CompactBuffer(bar, baz)), (f,CompactBuffer(foo))) 5 . 10
  32. Transformations on Pair RDDs The following functions are only available

    on RDD[(K,V)] reduceByKey: Merge the values for each key using an associative and commutative reduce function aggregateByKey: Aggregate the values of each key, using given com and a neutral “zero value” join: Return an RDD containing all pairs of elements with matching keys reduceByKey(f : (V , V ) → V ) : RDD[(K, V )] aggrByKey(zero : U )(f : (U , V ) → U , g : (U , U ) → U ) : join(b : RDD[(K, W )]) : RDD[(K, (V , W ))] 5 . 11
  33. Pair RDD example: aggregateByKey How can we count the number

    of occurrences of part of speach elements? D: What type conversions take place here? object PartOfSpeach { sealed trait EnumVal case object Verb extends EnumVal case object Noun extends EnumVal case object Article extends EnumVal case object Other extends EnumVal val partsOfSpeach = Seq(Verb, Noun, Article, Other) } def partOfSpeach(w: word): PartOfSpeach = ... odyssey.groupBy(partOfSpeach). aggregateByKey(0)((acc, x) => acc + 1, (x, y) => x + y) 5 . 12
  34. Pair RDD example: join Q: What are the types of

    ps and as? How can we join them? case class Person(id: Int, name: String) case class Addr(id: Int, person_id: Int, address: String, number: Int) val pers = sc.textFile("pers.csv") // id, name val addr = sc.textFile("addr.csv") // id, person_id, street, num val ps = pers.map(_.split(",")).map(Person(_(0).toInt, _(1))) val as = addr.map(_.split(",")).map(Addr(_(0).toInt, _(1).toInt, _(2), _(3).toInt)) val pairPs = ps.keyBy(_.id) val pairAs = as.keyBy(_.person_id) val addrForPers = pairAs.join(pairPs) // RDD[(Int, (Addr, Person))] 5 . 13
  35. Spark SQL In Spark SQL, we trade some of the

    freedom provided by the RDD API to enable: declarativity, in the form of SQL automatic optimizations, similar to ones provided by databases execution plan optimizations data movement/partitioning optimizations The price we have to pay is to bring our data to a (semi-)tabular format and describe its schema. Then, we let 5 . 14
  36. Spark SQL basics SparkSQL is a library build on top

    of Spark RDDs. It provides two main abstractions: Datasets, collections of strongly-typed objects. Scala/Java only! Dataframes, essentially a Dataset[Row], where Row Array[Object]. Equivalent to R or Pandas Dataframes SQL syntax It can directly connect and use structured data sources (e.g. SQL databases) and can import CSV, JSON, Parquet, ≈ 5 . 15
  37. Creating Data Frames and Datasets 1. From RDDs containing tuples,

    e.g. RDD[(String, Int, String)] 2. From RDDs with known complex types, e.g. RDD[Person] import spark.implicits._ val df = rdd.toDF("name", "id", "address") val df = persons.toDF() // Columns names/types are infered! 5 . 16
  38. 3. From RDDs, with manual schema definition 4. By reading

    (semi-)structured data files val schema = StructType(Array( StructField("level", StringType, nullable = true), StructField("date", DateType, nullable = true), StructField("client_id", IntType, nullable = true), StructField("stage", StringType, nullable = true), StructField("msg", StringType, nullable = true), )) val rowRdd = sc.textFile("ghtorrent-log.txt"). map(_.split("#")). map(r => Row(r(0), new Date(r(1)), r(2).toInt, r(3), r(4))) val logDF = spark.createDataframe(rowRDD, schema) val df = spark.read.json("examples/src/main/resources/people.json") df = sqlContext.read.csv("/datasets/pullreqs.csv", sep=",", header=True, inferSchema=True) 5 . 17
  39. Spark cluster architecture 5 . 18

  40. Using Spark for structured data 6 . 1

  41. Spark as an efficient Pandas/R backend While R and Python

    are handy with small CSV files, they can be very slow when the number of CSV file lines reaches . Spark offers a very versatile structured data framework and can act as an efficient backend for: Interactive exploration Ad-hoc querying and joing structured data Machine learning applications 10 6 6 . 2
  42. Our running example: Pull Requests! The dataset we are using

    is from from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.csv("hdfs://athens:8020/pullreqs.csv", sep=",", header=True, inferSchema=True).\ cache() sqlContext.registerDataFrameAsTable(df, "pullreqs") Gousios and Zaidman, 2014 6 . 3
  43. Running SQL queries Listing projects Check how PRs are merged

    sqlContext.sql("select distinct(project_name) from pullreqs").show(10 sqlContext.sql("""select merged_using, count(*) as occurences from pullreqs group by merged_using order by occurences desc""").show() +--------------------+----------+ | merged_using|occurences| +--------------------+----------+ | github| 364528| | commits_in_master| 342339| | unknown| 138566| | merged_in_comments| 29273| |commit_sha_in_com...| 23234| | fixes_in_commit| 18125| 6 . 4
  44. Nested SQL queries Queries can be complicated. Here we use

    a nested query to get the projects per programming language. sqlContext.sql("""select lang, count(*) as num_projects from ( select distinct(project_name), lang from pullreqs ) as project_langs group by lang""").show() +----------+------------+ | lang|num_projects| +----------+------------+ |javascript| 1726| | python| 1518| | ruby| 1086| | java| 1075| | scala| 138| 6 . 5
  45. Joining across data sources Suppose we would like to get

    some more info about the PR mergers from GHTorrent. What is important here is partitioning: this will split the MySQL table in numPartitions partitions and allow for parallel processing of the data. If the table is small, users = sqlContext.read.format("jdbc").options( url='jdbc:mysql://munich/ghtorrent?&serverTimezone=UTC', driver='com.mysql.jdbc.Driver', dbtable='users', user='ght', password='ght', partitionColumn = "id", numPartitions = "56", lowerBound = "0", upperBound = "40341639").\ load().cache() sqlContext.registerDataFrameAsTable(users, "users") 6 . 6
  46. Joining data sources This returns the expected results, even though

    the data resides in 2 (very) different sources. sqlContext.sql("""select distinct(u.id), u.login, u.country_code from users u join pullreqs pr on u.login = pr.merger where country_code != 'null'""").show(10) +-------+----------------+------------+ | id| login|country_code| +-------+----------------+------------+ |2870788|Bernardstanislas| fr| |1136167|CamDavidsonPilon| ca| | 35273| DataTables| gb| |2955396| Drecomm| nl| |2468606| Gaurang033| in| |2436711| JahlomP| gh| |8855272| JonnyWong16| ca| | 624345| M2Ys4U| gb| |1185808| PierreZ| fr| +-------+----------------+------------+ 6 . 7
  47. Exporting to Pandas/R Spark only offers basic statistics; fortunately, we

    can easily export data to Pandas/R. import pandas as pd pandas_df = sqlContext.sql( """select project_name, count(*) as num_prs from pullreqs group by project_name""").toPandas() pandas_df.describe() --- num_prs count 5543.000000 mean 165.265199 std 407.276860 min 1.000000 6 . 8
  48. Machine learning with Spark Spark has very nice . The

    general idea is that we need to bring our data in a format that MLlib understands and then we can fit and evaluate several ready- made algorithms. The reshaping process composes of: Converting factors to use OneHot encoding Converting booleans to integers Creating training and testing datasets The transformations are always done on DataFrames, in a pipeline fashion. We also need to specify an evaluation function. Machine learning library 6 . 9
  49. Data preparation examples One Hot encoding for factors Defining a

    transformation pipeline Creating train and test datasets # Convert feature columns to a numeric vectors onehot = VectorAssembler(inputCols=feature_cols, outputCol='features' pipeline = Pipeline(stages=[onehot]) allData = pipeline.fit(df).transform(df).cache() (train, test) = allData.randomSplit([0.9, 0.1], seed=42) 6 . 10
  50. Our evaluation function We just compare classifiers based on AUC

    from pyspark.ml.evaluation import BinaryClassificationEvaluator ## Calculate and return the AUC metric def evaluate(testData, predictions): evaluator = BinaryClassificationEvaluator(labelCol="merged_int", rawPredictionCol="rawPr print "AUC: %f" % evaluator.evaluate(predictions) 6 . 11
  51. Random Forests and Gradient Boosting AUC is 0.780482 AUC is

    0.792181 from pyspark.ml.classification import RandomForestClassifier as RF from pyspark.ml.evaluation import BinaryClassificationEvaluator rf = RF(labelCol='merged_int', featuresCol='features', numTrees=100, maxDepth=5) rfModel = rf.fit(train) evaluate(testData, rfModel.transform(test)) from pyspark.ml.classification import GBTClassifier gbt = GBTClassifier(maxIter=10, maxDepth=5, labelCol="merged_int", seed=42) gbtModel = gbt.fit(trainingData) evaluate(testData, gbtModel.transform(testData)) 6 . 12
  52. Mining repositories with Spark 7 . 1

  53. The source{d} MSR stack source{d} is a start up that

    develops tools for doing research on Big Code: The is a dataset containing all GitHub repos with more than 50 stars is a Spark plugin that enables access to multiple git repos Public Git Archive Engine 7 . 2
  54. Downloading data: The Public Git Archive The is a dataset

    containing all GitHub repos with more than 50 stars. It comes with a cmd line tool, pga, which allows users to selectively download repos in the custom siva format, which is suitable for use on HDFS. Public Git Archive $ pga list -u incubator |wc -l 1251 $ pga get -u incubator # Retrieve data $ hadoop fs -put incubator / # Put data to HDFS 7 . 3
  55. Using the source{d} engine import tech.sourced.engine._ val engine = Engine(spark,

    "hdfs://athens:8020/incubator/latest/*/", val repos = engine.getRepositories val commits = engine.getRepositories.getReferences.getAllReferenceCom repos.createOrReplaceTempView("repos") commits.createOrReplaceTempView("commits") 7 . 4
  56. Running arbitrary queries Seeing how many repos we have Commits

    per repo spark.sql("select count(*) from repos").show(10, false) spark.sql("select repository_id, count(*) from commits group by repository_id order by count(*) desc").show(10, false) 7 . 5
  57. References / License If you are interested in Big Data

    Processing, you might want to have a look to my course at TU Del . This work is (c) 2017 - onwards by Georgios Gousios and licensed under the license. Big Data Processing Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International 7 . 6