Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Repositories with Apache Spark

Mining Repositories with Apache Spark

At the beginning of every research effort, researchers in empirical software
engineering have to go through the processes of extracting data from raw
data sources and transforming them to what their tools expect as inputs.
This step is time consuming and error prone, while the produced artifacts
(code, intermediate datasets) are usually not of scientific value. In the
recent years, Apache Spark has emerged as a solid foundation for data
science and has taken the big data analytics domain by storm. We believe
that the primitives exposed by Apache Spark can help software engineering
researchers create and share reproducible, high-performance data analysis
pipelines.

In our presentation, given as a ICSE 2018 technical briefing, we discuss how researchers can profit from Apache Spark, through a hands-on case study.

Georgios Gousios

May 29, 2018
Tweet

More Decks by Georgios Gousios

Other Decks in Programming

Transcript

  1. Mining repositories with
    Apache Spark
    Georgios Gousios
    18 June 2018
    1

    View full-size slide

  2. The tutorial
    Welcome! In this tutorial, we will go through the following 3
    topics:
    Enough FP to become dangerous
    Internals of Apache Spark
    Using Apache Spark for common MSR data tasks
    2

    View full-size slide

  3. Programming languages
    The de facto languages of Big Data and data science are
    Scala Mostly used for data intensive systems
    Python Mostly used for data analytics tasks
    Other languages include
    Java The “assembly” of big data systems; the language
    that most big data infrastructure is written into.
    R The statistician’s tool of choice. Great selection of
    libraries for serious data analytics, great plotting tools.
    In our tutorial, we will be using Scala and Python.
    3

    View full-size slide

  4. Data processing with
    Functional Programming
    4 . 1

    View full-size slide

  5. Types of data
    Unstructured: Data whose format is not known
    Raw text documents
    HTML pages
    Semi-Structured: Data with a known format.
    Pre-parsed data to standard formats: , ,
    Structured: Data with known formats, linked together in
    graphs or tables
    JSON CSV XML
    4 . 2

    View full-size slide

  6. Sequences / Lists
    Sequences or Lists or Arrays represent consecutive items in
    memory
    In Python:
    In Scala
    Basic properties:
    Size is bounded by memory
    Items can be accessed by an index: a[1] or l.get(3)
    Items can only inserted at the end (append)
    a = [1, 2, 3, 4]
    val l = List(1,2,3,4)
    4 . 3

    View full-size slide

  7. Sets
    Sets store values, without any particular order, and no
    repeated values.
    Basic properties:
    Size is bounded by memory
    Can be queried for containment
    Set operations: union, intersection, difference, subset
    scala> val s = Set(1,2,3,4,4)
    s: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4)
    4 . 4

    View full-size slide

  8. Maps or Dictionaries
    Maps or Dictionaries or Associative Arrays is a collection of
    (k,v) pairs in such a way that each k appears only once.
    Some languages have build-in support for Dictionaries
    Basic properties:
    One key always corresponds to one value.
    Accessing a value given a key is very fast ( )
    a = {'a' : 1, 'b' : 2}
    ≈ O(1)
    4 . 5

    View full-size slide

  9. Nested data types: Graphs
    A graph data structure consists of a finite set of vertices or
    nodes, together with a set of unordered pairs of these
    vertices for an undirected graph or a set of ordered pairs for
    a directed graph.
    Nodes can contain attributes
    Vertices can contain weights and directions
    Graphs are usually represented as
    Map[Node, List[Vertex]], where
    case class Node(id: Int, attributes: Map[A, B])
    case class Vertex(a: Node, b: Node, directed: Option[Boolean],
    weight: Option[Double] )
    4 . 6

    View full-size slide

  10. Nested data types: Trees
    Ordered graphs without loops
    If we parse the above JSON in almost any language, we get a
    series of nested maps
    a = {"id": "5542101946", "type": "PushEvent",
    "actor": {
    "id": 801183,
    "login": "tvansteenburgh"
    },
    "repo": {
    "id": 42362423,
    "name": "juju-solutions/review-queue"
    }}
    Map(id -> 5542101946,
    type -> PushEvent,
    actor -> Map(id -> 801183.0, login -> tvansteenburgh),
    repo -> Map(id -> 4.2362423E7, name -> juju-solutions/review-queu 4 . 7

    View full-size slide

  11. Relations
    An -tuple is a sequence of elements, whose types are
    known.
    A relation is a Set of n-tuples .
    Relations are very important for data processing, as they
    form the theoretical framework ( ) for
    relational (SQL) databases.
    Typical operations on relations are insert, remove and join.
    Join allows us to compute new relations by joining existing
    ones on common fields.
    n n
    val record = Tuple4[Int, String, String, Int]
    (1, 'Georgios', 'Mekelweg', '4')
    (d1, d2, . . . , dn)
    Relational Algebra
    4 . 8

    View full-size slide

  12. Key/Value pairs
    A key/value pair (or KV) is a special type of a Map, where a
    key k does not have to appear once.
    Key/Value pairs are usually implemented as a Map whose
    keys are of a sortable type K (e.g. Int) and values are a Set
    of elements of type V.
    Another way to represent a K/V pair is a List of n-tuples
    .
    K and V are flexible: that’s why the Key/Value abstraction is
    key to NoSQL databases, including MongoDB, DynamoDB,
    Redis etc. Those databases sacrifice, among others, type
    val kv = Map[K, Set[V]]()
    (d1, d2, . . . , dn)
    4 . 9

    View full-size slide

  13. Functional programming
    Functional programming is a programming paradigm that
    treats computation as the evaluation of mathematical
    functions and avoids changing-state and mutable data
    (Wikipedia).
    Functional programming characteristics:
    Absence of side-effects: A function, given an argument,
    always returns the same results irrespective of and
    without modifying its environment.
    Higher-order functions: Functions can take functions as
    arguments to parametrise their behavior
    Lazyness: The art of waiting to compute till you can wait
    no more
    4 . 10

    View full-size slide

  14. Function signatures
    : function name
    and : Names of function arguments
    and : Types of function arguments.
    : Denotes the return type
    : Type of the returned result
    : Denotes that type can be traversed
    We read this as: Function foo takes as arguments an array/list
    foo(x : [A], y : B) → C
    foo
    x y
    [A] B

    C
    [A] A
    4 . 11

    View full-size slide

  15. Side effects
    A function has a side effect if it modifies some state outside
    its scope or has an observable interaction with its calling
    functions or the outside world besides returning a value.
    As a general rule, any function that returns nothing (void or
    Unit) does a side effect!
    max = -1
    def ge(a, b):
    global max
    if a >= b:
    max = a ## <- Side effect!
    return True
    else:
    max = b
    return False
    4 . 12

    View full-size slide

  16. Examples of side effects
    Setting a field on an object: OO is not FP!
    Modifying a data structure in place: In FP, data structures
    are always .
    Throwing an exception or halting with an error: In FP, we
    use types that encapsulate and propagate erroneous
    behaviour
    Printing to the console or reading user input, reading
    persistent
    4 . 13

    View full-size slide

  17. Higher-Order functions
    A higher order function is a function that can take a function
    as an argument or return a function.
    In the context of BDP, high-order functions capture common
    idioms of processing data as enumarated elements,
    e.g. going over all elements, selectively removing elements
    and aggregating them.
    class Array[A] {
    // Return elements that satisfy f
    def filter(f: A => Boolean) : Array[A]
    }
    4 . 14

    View full-size slide

  18. Important higher-order functions
    map(xs: [A], f: A => B) : [B]
    Applies f to all elements and returns a new list.
    flatMap(xs: [A], f: A => [B]) : [B]
    Like map, but flattens the result to a single list.
    fold(xs: [A], f: (B, A) => B, init: B) : B
    4 . 15

    View full-size slide

  19. Aux higher-order functions
    groupBy(xs: [A], f: A => K): Map[K, [A]]
    Partitions xs into a map of traversable collections
    according to some discriminator function.
    filter(xs: [A], f: A => Boolean) : [A]
    Takes a function that returns a boolean and returns all
    elements that satisfy it
    4 . 16

    View full-size slide

  20. foldL and foldR
    foldL(xs: [A], f: (B, A) => B, init: B) : B
    foldR(xs: [A], f: (A, B) => B, init: B) : B
    Both take almost the same agruments and return the same
    results. What is the difference in their evaluation?
    How does foldL work?
    How does foldR work?
    print reduce(reduce_pp, range(1,10), 0)
    ## (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)
    print reduceR(reduce_pp, range(1,10), 0)
    4 . 17

    View full-size slide

  21. Laziness
    Laziness is an evaluation strategy which delays the
    evaluation of an expression until its value is needed.
    Seperating a pipeline construction from its evaluation
    Not requiring to read datasets in memory: we can process
    them in lazy-loaded batches
    Generating infinite collections
    Optimising execution plans
    def primes(limit):
    sieve = [False]*limit
    for j in xrange(2, limit): 4 . 18

    View full-size slide

  22. Apache Spark
    5 . 1

    View full-size slide

  23. What is Spark?
    Spark is on open source cluster computing framework.
    automates distribution of data and computations on a
    cluster of computers
    provides a fault-tolerant abstraction to distributed
    datasets
    is based on functional programming primitives
    provides two abstractions to data, list-like (RDDs) and
    table-like (Datasets)
    5 . 2

    View full-size slide

  24. Resilient Distributed Datasets (RDDs)
    RDDs are the core abstraction that Spark uses.
    RDDs make datasets distributed over a cluster of machines
    look like a Scala collection. RDDs:
    are immutable
    reside (mostly) in memory
    are transparently distributed
    feature all FP programming primitives
    in addition, more to minimize shuffling
    In practice, RDD[A] works like Scala’s List[A], with some
    gotchas
    5 . 3

    View full-size slide

  25. Counting words with Spark
    The same code works on one computer on a cluster of 100s
    of computers.
    val rdd = sc.textFile("./datasets/odyssey.mb.txt")
    rdd.
    flatMap(l => l.split(" ")). // Split file in words
    map(x => (x, 1)). // Create key,1 pairs
    reduceByKey((acc, x) => acc + x). // Count occurences of same pairs
    sortBy(x => x._2, false). // Sort by number of occurences
    take(50). // Take the first 50 results
    foreach(println)
    5 . 4

    View full-size slide

  26. How to create an RDD?
    RDDs can only be created in the following 3 ways
    1. Reading data from external sources
    2. Convert a local memory dataset to a distributed one
    3. Transform an existing RDD
    val rdd1 = sc.textFile("hdfs://...")
    val rdd2 = sc.textFile("file://odyssey.txt")
    val rdd3 = sc.textFile("s3://...")
    val xs: Range[Int] = Range(1, 10000)
    val rdd: RDD[Int] = sc.parallelize(xs)
    rdd.map(x => x.toString) //returns an RDD[String]
    5 . 5

    View full-size slide

  27. RDDs are lazy!
    There are two types of operations we can do on an RDD:
    Transformation: Applying a function that returns a new
    RDD. They are lazy.
    Action: Request the computation of a result. They are
    eager.
    // This just sets up the pipeline
    val result = rdd.
    flatMap(l => l.split(" ")). 5 . 6

    View full-size slide

  28. Examples of RDD transformations
    All uses of articles in the Odyssey
    Q: How can we find uses of all regular verbs in past tense?
    Q: How can we remove all punctuation marks?
    val odyssey = sc.textFile("datasets/odyssey.mb.txt").
    flatMap(_.split(" "))
    odyssey.map(_.toLowerCase).
    filter(Seq("a", "the").contains(_))
    odyssey.filter(x => x.endsWith("ed"))
    odyssey.map(x => x.replaceAll("\\p{Punct}|\\d", ""))
    5 . 7

    View full-size slide

  29. Common actions on RDD[A]
    collect: Return all elements of an RDD
    take: Return the first n elements of the RDD
    reduce, fold: Combine all elements to a single result
    of the same time.
    aggregate: Aggregate the elements of each partition, and then the r
    RDD. collect() : Array[A]
    RDD. take(n) : Array[A]
    RDD. reduce(f : (A, A) → A) : A
    5 . 8

    View full-size slide

  30. Examples of RDD actions
    How many words are there?
    How can we sort the RDD?
    How can we sample data from the RDD?
    val odyssey = sc.textFile("datasets/odyssey.mb.txt").flatMap(_.split(
    odyssey.map(x => 1).reduce((a,b) => a + b)
    odyssey.sortBy(x => x)
    val (train, test) = odyssey.randomSplit(Array(0.8, 0.2))
    5 . 9

    View full-size slide

  31. Pair RDDs
    RDDs can represent any complex data type, if it can be
    iterated. Spark treats RDDs of the type RDD[(K,V)] as
    special, named PairRDDs, as they can be both iterated and
    indexed.
    Operations such as join are only defined on Pair RDDs.
    We can create Pair RDDs by applying an indexing function or
    by grouping records:
    val rdd = List("foo", "bar", "baz").parallelize // RDD[String]
    val pairRDD = rdd.map(x => (x.charAt(0), x)) // RDD[(Char, String)]
    pairRDD.collect
    // Array((f,foo), (b,bar), (b,baz))
    val pairRDD2 = rdd.groupBy(x => x.charAt(0)) // RDD[(Char, Iterable(
    pairRDD2.collect
    //Array((b,CompactBuffer(bar, baz)), (f,CompactBuffer(foo)))
    5 . 10

    View full-size slide

  32. Transformations on Pair RDDs
    The following functions are only available on RDD[(K,V)]
    reduceByKey: Merge the values for each key using an
    associative and commutative reduce function
    aggregateByKey: Aggregate the values of each key, using given com
    and a neutral “zero value”
    join: Return an RDD containing all pairs of elements
    with matching keys
    reduceByKey(f : (V , V ) → V ) : RDD[(K, V )]
    aggrByKey(zero : U )(f : (U , V ) → U , g : (U , U ) → U ) :
    join(b : RDD[(K, W )]) : RDD[(K, (V , W ))]
    5 . 11

    View full-size slide

  33. Pair RDD example: aggregateByKey
    How can we count the number of occurrences of part of
    speach elements?
    D: What type conversions take place here?
    object PartOfSpeach {
    sealed trait EnumVal
    case object Verb extends EnumVal
    case object Noun extends EnumVal
    case object Article extends EnumVal
    case object Other extends EnumVal
    val partsOfSpeach = Seq(Verb, Noun, Article, Other)
    }
    def partOfSpeach(w: word): PartOfSpeach = ...
    odyssey.groupBy(partOfSpeach).
    aggregateByKey(0)((acc, x) => acc + 1,
    (x, y) => x + y)
    5 . 12

    View full-size slide

  34. Pair RDD example: join
    Q: What are the types of ps and as? How can we join them?
    case class Person(id: Int, name: String)
    case class Addr(id: Int, person_id: Int,
    address: String, number: Int)
    val pers = sc.textFile("pers.csv") // id, name
    val addr = sc.textFile("addr.csv") // id, person_id, street, num
    val ps = pers.map(_.split(",")).map(Person(_(0).toInt, _(1)))
    val as = addr.map(_.split(",")).map(Addr(_(0).toInt, _(1).toInt,
    _(2), _(3).toInt))
    val pairPs = ps.keyBy(_.id)
    val pairAs = as.keyBy(_.person_id)
    val addrForPers = pairAs.join(pairPs) // RDD[(Int, (Addr, Person))]
    5 . 13

    View full-size slide

  35. Spark SQL
    In Spark SQL, we trade some of the freedom provided by the
    RDD API to enable:
    declarativity, in the form of SQL
    automatic optimizations, similar to ones provided by
    databases
    execution plan optimizations
    data movement/partitioning optimizations
    The price we have to pay is to bring our data to a
    (semi-)tabular format and describe its schema. Then, we let
    5 . 14

    View full-size slide

  36. Spark SQL basics
    SparkSQL is a library build on top of Spark RDDs. It provides
    two main abstractions:
    Datasets, collections of strongly-typed objects. Scala/Java
    only!
    Dataframes, essentially a Dataset[Row], where Row
    Array[Object]. Equivalent to R or Pandas Dataframes
    SQL syntax
    It can directly connect and use structured data sources
    (e.g. SQL databases) and can import CSV, JSON, Parquet,

    5 . 15

    View full-size slide

  37. Creating Data Frames and Datasets
    1. From RDDs containing tuples, e.g.
    RDD[(String, Int, String)]
    2. From RDDs with known complex types, e.g.
    RDD[Person]
    import spark.implicits._
    val df = rdd.toDF("name", "id", "address")
    val df = persons.toDF() // Columns names/types are infered!
    5 . 16

    View full-size slide

  38. 3. From RDDs, with manual schema definition
    4. By reading (semi-)structured data files
    val schema = StructType(Array(
    StructField("level", StringType, nullable = true),
    StructField("date", DateType, nullable = true),
    StructField("client_id", IntType, nullable = true),
    StructField("stage", StringType, nullable = true),
    StructField("msg", StringType, nullable = true),
    ))
    val rowRdd = sc.textFile("ghtorrent-log.txt").
    map(_.split("#")).
    map(r => Row(r(0), new Date(r(1)), r(2).toInt,
    r(3), r(4)))
    val logDF = spark.createDataframe(rowRDD, schema)
    val df = spark.read.json("examples/src/main/resources/people.json")
    df = sqlContext.read.csv("/datasets/pullreqs.csv", sep=",",
    header=True, inferSchema=True) 5 . 17

    View full-size slide

  39. Spark cluster architecture
    5 . 18

    View full-size slide

  40. Using Spark for structured
    data
    6 . 1

    View full-size slide

  41. Spark as an efficient Pandas/R
    backend
    While R and Python are handy with small CSV files, they can
    be very slow when the number of CSV file lines reaches .
    Spark offers a very versatile structured data framework and
    can act as an efficient backend for:
    Interactive exploration
    Ad-hoc querying and joing structured data
    Machine learning applications
    10
    6
    6 . 2

    View full-size slide

  42. Our running example: Pull Requests!
    The dataset we are using is from
    from pyspark.sql.types import *
    from pyspark.sql.functions import *
    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    df = sqlContext.read.csv("hdfs://athens:8020/pullreqs.csv",
    sep=",", header=True, inferSchema=True).\
    cache()
    sqlContext.registerDataFrameAsTable(df, "pullreqs")
    Gousios and Zaidman, 2014
    6 . 3

    View full-size slide

  43. Running SQL queries
    Listing projects
    Check how PRs are merged
    sqlContext.sql("select distinct(project_name) from pullreqs").show(10
    sqlContext.sql("""select merged_using, count(*) as occurences
    from pullreqs
    group by merged_using
    order by occurences desc""").show()
    +--------------------+----------+
    | merged_using|occurences|
    +--------------------+----------+
    | github| 364528|
    | commits_in_master| 342339|
    | unknown| 138566|
    | merged_in_comments| 29273|
    |commit_sha_in_com...| 23234|
    | fixes_in_commit| 18125| 6 . 4

    View full-size slide

  44. Nested SQL queries
    Queries can be complicated. Here we use a nested query to
    get the projects per programming language.
    sqlContext.sql("""select lang, count(*) as num_projects
    from (
    select distinct(project_name), lang
    from pullreqs
    ) as project_langs
    group by lang""").show()
    +----------+------------+
    | lang|num_projects|
    +----------+------------+
    |javascript| 1726|
    | python| 1518|
    | ruby| 1086|
    | java| 1075|
    | scala| 138|
    6 . 5

    View full-size slide

  45. Joining across data sources
    Suppose we would like to get some more info about the PR
    mergers from GHTorrent.
    What is important here is partitioning: this will split the
    MySQL table in numPartitions partitions and allow for
    parallel processing of the data. If the table is small,
    users = sqlContext.read.format("jdbc").options(
    url='jdbc:mysql://munich/ghtorrent?&serverTimezone=UTC',
    driver='com.mysql.jdbc.Driver',
    dbtable='users',
    user='ght', password='ght',
    partitionColumn = "id",
    numPartitions = "56",
    lowerBound = "0", upperBound = "40341639").\
    load().cache()
    sqlContext.registerDataFrameAsTable(users, "users")
    6 . 6

    View full-size slide

  46. Joining data sources
    This returns the expected results, even though the data
    resides in 2 (very) different sources.
    sqlContext.sql("""select distinct(u.id), u.login, u.country_code
    from users u join pullreqs pr on u.login = pr.merger
    where country_code != 'null'""").show(10)
    +-------+----------------+------------+
    | id| login|country_code|
    +-------+----------------+------------+
    |2870788|Bernardstanislas| fr|
    |1136167|CamDavidsonPilon| ca|
    | 35273| DataTables| gb|
    |2955396| Drecomm| nl|
    |2468606| Gaurang033| in|
    |2436711| JahlomP| gh|
    |8855272| JonnyWong16| ca|
    | 624345| M2Ys4U| gb|
    |1185808| PierreZ| fr|
    +-------+----------------+------------+
    6 . 7

    View full-size slide

  47. Exporting to Pandas/R
    Spark only offers basic statistics; fortunately, we can easily
    export data to Pandas/R.
    import pandas as pd
    pandas_df = sqlContext.sql(
    """select project_name, count(*) as num_prs
    from pullreqs
    group by project_name""").toPandas()
    pandas_df.describe()
    ---
    num_prs
    count 5543.000000
    mean 165.265199
    std 407.276860
    min 1.000000
    6 . 8

    View full-size slide

  48. Machine learning with Spark
    Spark has very nice . The general
    idea is that we need to bring our data in a format that MLlib
    understands and then we can fit and evaluate several ready-
    made algorithms.
    The reshaping process composes of:
    Converting factors to use OneHot encoding
    Converting booleans to integers
    Creating training and testing datasets
    The transformations are always done on DataFrames, in a
    pipeline fashion.
    We also need to specify an evaluation function.
    Machine learning library
    6 . 9

    View full-size slide

  49. Data preparation examples
    One Hot encoding for factors
    Defining a transformation pipeline
    Creating train and test datasets
    # Convert feature columns to a numeric vectors
    onehot = VectorAssembler(inputCols=feature_cols, outputCol='features'
    pipeline = Pipeline(stages=[onehot])
    allData = pipeline.fit(df).transform(df).cache()
    (train, test) = allData.randomSplit([0.9, 0.1], seed=42)
    6 . 10

    View full-size slide

  50. Our evaluation function
    We just compare classifiers based on AUC
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    ## Calculate and return the AUC metric
    def evaluate(testData, predictions):
    evaluator = BinaryClassificationEvaluator(labelCol="merged_int",
    rawPredictionCol="rawPr
    print "AUC: %f" % evaluator.evaluate(predictions)
    6 . 11

    View full-size slide

  51. Random Forests and Gradient Boosting
    AUC is 0.780482
    AUC is 0.792181
    from pyspark.ml.classification import RandomForestClassifier as RF
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    rf = RF(labelCol='merged_int', featuresCol='features',
    numTrees=100, maxDepth=5)
    rfModel = rf.fit(train)
    evaluate(testData, rfModel.transform(test))
    from pyspark.ml.classification import GBTClassifier
    gbt = GBTClassifier(maxIter=10, maxDepth=5,
    labelCol="merged_int", seed=42)
    gbtModel = gbt.fit(trainingData)
    evaluate(testData, gbtModel.transform(testData))
    6 . 12

    View full-size slide

  52. Mining repositories with
    Spark
    7 . 1

    View full-size slide

  53. The source{d} MSR stack
    source{d} is a start up that develops tools for doing research
    on Big Code:
    The is a dataset containing all GitHub
    repos with more than 50 stars
    is a Spark plugin that enables access to multiple
    git repos
    Public Git Archive
    Engine
    7 . 2

    View full-size slide

  54. Downloading data: The Public Git
    Archive
    The is a dataset containing all GitHub
    repos with more than 50 stars. It comes with a cmd line tool,
    pga, which allows users to selectively download repos in the
    custom siva format, which is suitable for use on HDFS.
    Public Git Archive
    $ pga list -u incubator |wc -l
    1251
    $ pga get -u incubator # Retrieve data
    $ hadoop fs -put incubator / # Put data to HDFS
    7 . 3

    View full-size slide

  55. Using the source{d} engine
    import tech.sourced.engine._
    val engine = Engine(spark, "hdfs://athens:8020/incubator/latest/*/",
    val repos = engine.getRepositories
    val commits = engine.getRepositories.getReferences.getAllReferenceCom
    repos.createOrReplaceTempView("repos")
    commits.createOrReplaceTempView("commits")
    7 . 4

    View full-size slide

  56. Running arbitrary queries
    Seeing how many repos we have
    Commits per repo
    spark.sql("select count(*) from repos").show(10, false)
    spark.sql("select repository_id, count(*)
    from commits
    group by repository_id
    order by count(*) desc").show(10, false)
    7 . 5

    View full-size slide

  57. References / License
    If you are interested in Big Data Processing, you might want
    to have a look to my course at TU Del .
    This work is (c) 2017 - onwards by Georgios Gousios and
    licensed under the
    license.
    Big Data Processing
    Creative Commons Attribution-
    NonCommercial-ShareAlike 4.0 International
    7 . 6

    View full-size slide