Mining Repositories with Apache Spark

Slide 1

Slide 1 text

Mining repositories with Apache Spark Georgios Gousios 18 June 2018 1

Slide 2

Slide 2 text

The tutorial Welcome! In this tutorial, we will go through the following 3 topics: Enough FP to become dangerous Internals of Apache Spark Using Apache Spark for common MSR data tasks 2

Slide 3

Slide 3 text

Programming languages The de facto languages of Big Data and data science are Scala Mostly used for data intensive systems Python Mostly used for data analytics tasks Other languages include Java The “assembly” of big data systems; the language that most big data infrastructure is written into. R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools. In our tutorial, we will be using Scala and Python. 3

Slide 4

Slide 4 text

Data processing with Functional Programming 4 . 1

Slide 5

Slide 5 text

Types of data Unstructured: Data whose format is not known Raw text documents HTML pages Semi-Structured: Data with a known format. Pre-parsed data to standard formats: , , Structured: Data with known formats, linked together in graphs or tables JSON CSV XML 4 . 2

Slide 6

Slide 6 text

Sequences / Lists Sequences or Lists or Arrays represent consecutive items in memory In Python: In Scala Basic properties: Size is bounded by memory Items can be accessed by an index: a[1] or l.get(3) Items can only inserted at the end (append) a = [1, 2, 3, 4] val l = List(1,2,3,4) 4 . 3

Slide 7

Slide 7 text

Sets Sets store values, without any particular order, and no repeated values. Basic properties: Size is bounded by memory Can be queried for containment Set operations: union, intersection, diﬀerence, subset scala> val s = Set(1,2,3,4,4) s: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4) 4 . 4

Slide 8

Slide 8 text

Maps or Dictionaries Maps or Dictionaries or Associative Arrays is a collection of (k,v) pairs in such a way that each k appears only once. Some languages have build-in support for Dictionaries Basic properties: One key always corresponds to one value. Accessing a value given a key is very fast ( ) a = {'a' : 1, 'b' : 2} ≈ O(1) 4 . 5

Slide 9

Slide 9 text

Nested data types: Graphs A graph data structure consists of a finite set of vertices or nodes, together with a set of unordered pairs of these vertices for an undirected graph or a set of ordered pairs for a directed graph. Nodes can contain attributes Vertices can contain weights and directions Graphs are usually represented as Map[Node, List[Vertex]], where case class Node(id: Int, attributes: Map[A, B]) case class Vertex(a: Node, b: Node, directed: Option[Boolean], weight: Option[Double] ) 4 . 6

Slide 10

Slide 10 text

Nested data types: Trees Ordered graphs without loops If we parse the above JSON in almost any language, we get a series of nested maps a = {"id": "5542101946", "type": "PushEvent", "actor": { "id": 801183, "login": "tvansteenburgh" }, "repo": { "id": 42362423, "name": "juju-solutions/review-queue" }} Map(id -> 5542101946, type -> PushEvent, actor -> Map(id -> 801183.0, login -> tvansteenburgh), repo -> Map(id -> 4.2362423E7, name -> juju-solutions/review-queu 4 . 7

Slide 11

Slide 11 text

Relations An -tuple is a sequence of elements, whose types are known. A relation is a Set of n-tuples . Relations are very important for data processing, as they form the theoretical framework ( ) for relational (SQL) databases. Typical operations on relations are insert, remove and join. Join allows us to compute new relations by joining existing ones on common fields. n n val record = Tuple4[Int, String, String, Int] (1, 'Georgios', 'Mekelweg', '4') (d1, d2, . . . , dn) Relational Algebra 4 . 8

Slide 12

Slide 12 text

Key/Value pairs A key/value pair (or KV) is a special type of a Map, where a key k does not have to appear once. Key/Value pairs are usually implemented as a Map whose keys are of a sortable type K (e.g. Int) and values are a Set of elements of type V. Another way to represent a K/V pair is a List of n-tuples . K and V are flexible: that’s why the Key/Value abstraction is key to NoSQL databases, including MongoDB, DynamoDB, Redis etc. Those databases sacrifice, among others, type val kv = Map[K, Set[V]]() (d1, d2, . . . , dn) 4 . 9

Slide 13

Slide 13 text

Functional programming Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data (Wikipedia). Functional programming characteristics: Absence of side-eﬀects: A function, given an argument, always returns the same results irrespective of and without modifying its environment. Higher-order functions: Functions can take functions as arguments to parametrise their behavior Lazyness: The art of waiting to compute till you can wait no more 4 . 10

Slide 14

Slide 14 text

Function signatures : function name and : Names of function arguments and : Types of function arguments. : Denotes the return type : Type of the returned result : Denotes that type can be traversed We read this as: Function foo takes as arguments an array/list foo(x : [A], y : B) → C foo x y [A] B → C [A] A 4 . 11

Slide 15

Slide 15 text

Side effects A function has a side effect if it modifies some state outside its scope or has an observable interaction with its calling functions or the outside world besides returning a value. As a general rule, any function that returns nothing (void or Unit) does a side effect! max = -1 def ge(a, b): global max if a >= b: max = a ## <- Side effect! return True else: max = b return False 4 . 12

Slide 16

Slide 16 text

Examples of side eﬀects Setting a field on an object: OO is not FP! Modifying a data structure in place: In FP, data structures are always . Throwing an exception or halting with an error: In FP, we use types that encapsulate and propagate erroneous behaviour Printing to the console or reading user input, reading persistent 4 . 13

Slide 17

Slide 17 text

Higher-Order functions A higher order function is a function that can take a function as an argument or return a function. In the context of BDP, high-order functions capture common idioms of processing data as enumarated elements, e.g. going over all elements, selectively removing elements and aggregating them. class Array[A] { // Return elements that satisfy f def filter(f: A => Boolean) : Array[A] } 4 . 14

Slide 18

Slide 18 text

Important higher-order functions map(xs: [A], f: A => B) : [B] Applies f to all elements and returns a new list. flatMap(xs: [A], f: A => [B]) : [B] Like map, but flattens the result to a single list. fold(xs: [A], f: (B, A) => B, init: B) : B 4 . 15

Slide 19

Slide 19 text

Aux higher-order functions groupBy(xs: [A], f: A => K): Map[K, [A]] Partitions xs into a map of traversable collections according to some discriminator function. filter(xs: [A], f: A => Boolean) : [A] Takes a function that returns a boolean and returns all elements that satisfy it 4 . 16

Slide 20

Slide 20 text

foldL and foldR foldL(xs: [A], f: (B, A) => B, init: B) : B foldR(xs: [A], f: (A, B) => B, init: B) : B Both take almost the same agruments and return the same results. What is the diﬀerence in their evaluation? How does foldL work? How does foldR work? print reduce(reduce_pp, range(1,10), 0) ## (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) print reduceR(reduce_pp, range(1,10), 0) 4 . 17

Slide 21

Slide 21 text

Laziness Laziness is an evaluation strategy which delays the evaluation of an expression until its value is needed. Seperating a pipeline construction from its evaluation Not requiring to read datasets in memory: we can process them in lazy-loaded batches Generating infinite collections Optimising execution plans def primes(limit): sieve = [False]*limit for j in xrange(2, limit): 4 . 18

Slide 22

Slide 22 text

Apache Spark 5 . 1

Slide 23

Slide 23 text

What is Spark? Spark is on open source cluster computing framework. automates distribution of data and computations on a cluster of computers provides a fault-tolerant abstraction to distributed datasets is based on functional programming primitives provides two abstractions to data, list-like (RDDs) and table-like (Datasets) 5 . 2

Slide 24

Slide 24 text

Resilient Distributed Datasets (RDDs) RDDs are the core abstraction that Spark uses. RDDs make datasets distributed over a cluster of machines look like a Scala collection. RDDs: are immutable reside (mostly) in memory are transparently distributed feature all FP programming primitives in addition, more to minimize shuﬀling In practice, RDD[A] works like Scala’s List[A], with some gotchas 5 . 3

Slide 25

Slide 25 text

Counting words with Spark The same code works on one computer on a cluster of 100s of computers. val rdd = sc.textFile("./datasets/odyssey.mb.txt") rdd. flatMap(l => l.split(" ")). // Split file in words map(x => (x, 1)). // Create key,1 pairs reduceByKey((acc, x) => acc + x). // Count occurences of same pairs sortBy(x => x._2, false). // Sort by number of occurences take(50). // Take the first 50 results foreach(println) 5 . 4

Slide 26

Slide 26 text

How to create an RDD? RDDs can only be created in the following 3 ways 1. Reading data from external sources 2. Convert a local memory dataset to a distributed one 3. Transform an existing RDD val rdd1 = sc.textFile("hdfs://...") val rdd2 = sc.textFile("file://odyssey.txt") val rdd3 = sc.textFile("s3://...") val xs: Range[Int] = Range(1, 10000) val rdd: RDD[Int] = sc.parallelize(xs) rdd.map(x => x.toString) //returns an RDD[String] 5 . 5

Slide 27

Slide 27 text

RDDs are lazy! There are two types of operations we can do on an RDD: Transformation: Applying a function that returns a new RDD. They are lazy. Action: Request the computation of a result. They are eager. // This just sets up the pipeline val result = rdd. flatMap(l => l.split(" ")). 5 . 6

Slide 28

Slide 28 text

Examples of RDD transformations All uses of articles in the Odyssey Q: How can we find uses of all regular verbs in past tense? Q: How can we remove all punctuation marks? val odyssey = sc.textFile("datasets/odyssey.mb.txt"). flatMap(_.split(" ")) odyssey.map(_.toLowerCase). filter(Seq("a", "the").contains(_)) odyssey.filter(x => x.endsWith("ed")) odyssey.map(x => x.replaceAll("\\p{Punct}|\\d", "")) 5 . 7

Slide 29

Slide 29 text

Common actions on RDD[A] collect: Return all elements of an RDD take: Return the first n elements of the RDD reduce, fold: Combine all elements to a single result of the same time. aggregate: Aggregate the elements of each partition, and then the r RDD. collect() : Array[A] RDD. take(n) : Array[A] RDD. reduce(f : (A, A) → A) : A 5 . 8

Slide 30

Slide 30 text

Examples of RDD actions How many words are there? How can we sort the RDD? How can we sample data from the RDD? val odyssey = sc.textFile("datasets/odyssey.mb.txt").flatMap(_.split( odyssey.map(x => 1).reduce((a,b) => a + b) odyssey.sortBy(x => x) val (train, test) = odyssey.randomSplit(Array(0.8, 0.2)) 5 . 9

Slide 31

Slide 31 text

Pair RDDs RDDs can represent any complex data type, if it can be iterated. Spark treats RDDs of the type RDD[(K,V)] as special, named PairRDDs, as they can be both iterated and indexed. Operations such as join are only defined on Pair RDDs. We can create Pair RDDs by applying an indexing function or by grouping records: val rdd = List("foo", "bar", "baz").parallelize // RDD[String] val pairRDD = rdd.map(x => (x.charAt(0), x)) // RDD[(Char, String)] pairRDD.collect // Array((f,foo), (b,bar), (b,baz)) val pairRDD2 = rdd.groupBy(x => x.charAt(0)) // RDD[(Char, Iterable( pairRDD2.collect //Array((b,CompactBuffer(bar, baz)), (f,CompactBuffer(foo))) 5 . 10

Slide 32

Slide 32 text

Transformations on Pair RDDs The following functions are only available on RDD[(K,V)] reduceByKey: Merge the values for each key using an associative and commutative reduce function aggregateByKey: Aggregate the values of each key, using given com and a neutral “zero value” join: Return an RDD containing all pairs of elements with matching keys reduceByKey(f : (V , V ) → V ) : RDD[(K, V )] aggrByKey(zero : U )(f : (U , V ) → U , g : (U , U ) → U ) : join(b : RDD[(K, W )]) : RDD[(K, (V , W ))] 5 . 11

Slide 33

Slide 33 text

Pair RDD example: aggregateByKey How can we count the number of occurrences of part of speach elements? D: What type conversions take place here? object PartOfSpeach { sealed trait EnumVal case object Verb extends EnumVal case object Noun extends EnumVal case object Article extends EnumVal case object Other extends EnumVal val partsOfSpeach = Seq(Verb, Noun, Article, Other) } def partOfSpeach(w: word): PartOfSpeach = ... odyssey.groupBy(partOfSpeach). aggregateByKey(0)((acc, x) => acc + 1, (x, y) => x + y) 5 . 12

Slide 34

Slide 34 text

Pair RDD example: join Q: What are the types of ps and as? How can we join them? case class Person(id: Int, name: String) case class Addr(id: Int, person_id: Int, address: String, number: Int) val pers = sc.textFile("pers.csv") // id, name val addr = sc.textFile("addr.csv") // id, person_id, street, num val ps = pers.map(_.split(",")).map(Person(_(0).toInt, _(1))) val as = addr.map(_.split(",")).map(Addr(_(0).toInt, _(1).toInt, _(2), _(3).toInt)) val pairPs = ps.keyBy(_.id) val pairAs = as.keyBy(_.person_id) val addrForPers = pairAs.join(pairPs) // RDD[(Int, (Addr, Person))] 5 . 13

Slide 35

Slide 35 text

Spark SQL In Spark SQL, we trade some of the freedom provided by the RDD API to enable: declarativity, in the form of SQL automatic optimizations, similar to ones provided by databases execution plan optimizations data movement/partitioning optimizations The price we have to pay is to bring our data to a (semi-)tabular format and describe its schema. Then, we let 5 . 14

Slide 36

Slide 36 text

Spark SQL basics SparkSQL is a library build on top of Spark RDDs. It provides two main abstractions: Datasets, collections of strongly-typed objects. Scala/Java only! Dataframes, essentially a Dataset[Row], where Row Array[Object]. Equivalent to R or Pandas Dataframes SQL syntax It can directly connect and use structured data sources (e.g. SQL databases) and can import CSV, JSON, Parquet, ≈ 5 . 15

Slide 37

Slide 37 text

Creating Data Frames and Datasets 1. From RDDs containing tuples, e.g. RDD[(String, Int, String)] 2. From RDDs with known complex types, e.g. RDD[Person] import spark.implicits._ val df = rdd.toDF("name", "id", "address") val df = persons.toDF() // Columns names/types are infered! 5 . 16

Slide 38

Slide 38 text

3. From RDDs, with manual schema definition 4. By reading (semi-)structured data files val schema = StructType(Array( StructField("level", StringType, nullable = true), StructField("date", DateType, nullable = true), StructField("client_id", IntType, nullable = true), StructField("stage", StringType, nullable = true), StructField("msg", StringType, nullable = true), )) val rowRdd = sc.textFile("ghtorrent-log.txt"). map(_.split("#")). map(r => Row(r(0), new Date(r(1)), r(2).toInt, r(3), r(4))) val logDF = spark.createDataframe(rowRDD, schema) val df = spark.read.json("examples/src/main/resources/people.json") df = sqlContext.read.csv("/datasets/pullreqs.csv", sep=",", header=True, inferSchema=True) 5 . 17

Slide 39

Slide 39 text

Spark cluster architecture 5 . 18

Slide 40

Slide 40 text

Using Spark for structured data 6 . 1

Slide 41

Slide 41 text

Spark as an efficient Pandas/R backend While R and Python are handy with small CSV files, they can be very slow when the number of CSV file lines reaches . Spark offers a very versatile structured data framework and can act as an efficient backend for: Interactive exploration Ad-hoc querying and joing structured data Machine learning applications 10 6 6 . 2

Slide 42

Slide 42 text

Our running example: Pull Requests! The dataset we are using is from from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.csv("hdfs://athens:8020/pullreqs.csv", sep=",", header=True, inferSchema=True).\ cache() sqlContext.registerDataFrameAsTable(df, "pullreqs") Gousios and Zaidman, 2014 6 . 3

Slide 43

Slide 43 text

Running SQL queries Listing projects Check how PRs are merged sqlContext.sql("select distinct(project_name) from pullreqs").show(10 sqlContext.sql("""select merged_using, count(*) as occurences from pullreqs group by merged_using order by occurences desc""").show() +--------------------+----------+ | merged_using|occurences| +--------------------+----------+ | github| 364528| | commits_in_master| 342339| | unknown| 138566| | merged_in_comments| 29273| |commit_sha_in_com...| 23234| | fixes_in_commit| 18125| 6 . 4

Slide 44

Slide 44 text

Nested SQL queries Queries can be complicated. Here we use a nested query to get the projects per programming language. sqlContext.sql("""select lang, count(*) as num_projects from ( select distinct(project_name), lang from pullreqs ) as project_langs group by lang""").show() +----------+------------+ | lang|num_projects| +----------+------------+ |javascript| 1726| | python| 1518| | ruby| 1086| | java| 1075| | scala| 138| 6 . 5

Slide 45

Slide 45 text

Joining across data sources Suppose we would like to get some more info about the PR mergers from GHTorrent. What is important here is partitioning: this will split the MySQL table in numPartitions partitions and allow for parallel processing of the data. If the table is small, users = sqlContext.read.format("jdbc").options( url='jdbc:mysql://munich/ghtorrent?&serverTimezone=UTC', driver='com.mysql.jdbc.Driver', dbtable='users', user='ght', password='ght', partitionColumn = "id", numPartitions = "56", lowerBound = "0", upperBound = "40341639").\ load().cache() sqlContext.registerDataFrameAsTable(users, "users") 6 . 6

Slide 46

Slide 46 text

Joining data sources This returns the expected results, even though the data resides in 2 (very) diﬀerent sources. sqlContext.sql("""select distinct(u.id), u.login, u.country_code from users u join pullreqs pr on u.login = pr.merger where country_code != 'null'""").show(10) +-------+----------------+------------+ | id| login|country_code| +-------+----------------+------------+ |2870788|Bernardstanislas| fr| |1136167|CamDavidsonPilon| ca| | 35273| DataTables| gb| |2955396| Drecomm| nl| |2468606| Gaurang033| in| |2436711| JahlomP| gh| |8855272| JonnyWong16| ca| | 624345| M2Ys4U| gb| |1185808| PierreZ| fr| +-------+----------------+------------+ 6 . 7

Slide 47

Slide 47 text

Exporting to Pandas/R Spark only oﬀers basic statistics; fortunately, we can easily export data to Pandas/R. import pandas as pd pandas_df = sqlContext.sql( """select project_name, count(*) as num_prs from pullreqs group by project_name""").toPandas() pandas_df.describe() --- num_prs count 5543.000000 mean 165.265199 std 407.276860 min 1.000000 6 . 8

Slide 48

Slide 48 text

Machine learning with Spark Spark has very nice . The general idea is that we need to bring our data in a format that MLlib understands and then we can fit and evaluate several ready- made algorithms. The reshaping process composes of: Converting factors to use OneHot encoding Converting booleans to integers Creating training and testing datasets The transformations are always done on DataFrames, in a pipeline fashion. We also need to specify an evaluation function. Machine learning library 6 . 9

Slide 49

Slide 49 text

Data preparation examples One Hot encoding for factors Defining a transformation pipeline Creating train and test datasets # Convert feature columns to a numeric vectors onehot = VectorAssembler(inputCols=feature_cols, outputCol='features' pipeline = Pipeline(stages=[onehot]) allData = pipeline.fit(df).transform(df).cache() (train, test) = allData.randomSplit([0.9, 0.1], seed=42) 6 . 10

Slide 50

Slide 50 text

Our evaluation function We just compare classifiers based on AUC from pyspark.ml.evaluation import BinaryClassificationEvaluator ## Calculate and return the AUC metric def evaluate(testData, predictions): evaluator = BinaryClassificationEvaluator(labelCol="merged_int", rawPredictionCol="rawPr print "AUC: %f" % evaluator.evaluate(predictions) 6 . 11

Slide 51

Slide 51 text

Random Forests and Gradient Boosting AUC is 0.780482 AUC is 0.792181 from pyspark.ml.classification import RandomForestClassifier as RF from pyspark.ml.evaluation import BinaryClassificationEvaluator rf = RF(labelCol='merged_int', featuresCol='features', numTrees=100, maxDepth=5) rfModel = rf.fit(train) evaluate(testData, rfModel.transform(test)) from pyspark.ml.classification import GBTClassifier gbt = GBTClassifier(maxIter=10, maxDepth=5, labelCol="merged_int", seed=42) gbtModel = gbt.fit(trainingData) evaluate(testData, gbtModel.transform(testData)) 6 . 12

Slide 52

Slide 52 text

Mining repositories with Spark 7 . 1

Slide 53

Slide 53 text

The source{d} MSR stack source{d} is a start up that develops tools for doing research on Big Code: The is a dataset containing all GitHub repos with more than 50 stars is a Spark plugin that enables access to multiple git repos Public Git Archive Engine 7 . 2

Slide 54

Slide 54 text

Downloading data: The Public Git Archive The is a dataset containing all GitHub repos with more than 50 stars. It comes with a cmd line tool, pga, which allows users to selectively download repos in the custom siva format, which is suitable for use on HDFS. Public Git Archive $ pga list -u incubator |wc -l 1251 $ pga get -u incubator # Retrieve data $ hadoop fs -put incubator / # Put data to HDFS 7 . 3

Slide 55

Slide 55 text

Using the source{d} engine import tech.sourced.engine._ val engine = Engine(spark, "hdfs://athens:8020/incubator/latest/*/", val repos = engine.getRepositories val commits = engine.getRepositories.getReferences.getAllReferenceCom repos.createOrReplaceTempView("repos") commits.createOrReplaceTempView("commits") 7 . 4

Slide 56

Slide 56 text

Running arbitrary queries Seeing how many repos we have Commits per repo spark.sql("select count(*) from repos").show(10, false) spark.sql("select repository_id, count(*) from commits group by repository_id order by count(*) desc").show(10, false) 7 . 5

Slide 57

Slide 57 text

References / License If you are interested in Big Data Processing, you might want to have a look to my course at TU Del . This work is (c) 2017 - onwards by Georgios Gousios and licensed under the license. Big Data Processing Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International 7 . 6