Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICSE22: An Exploratory Study of Deep Learning Supply Chain

Xin Tan
April 10, 2022

ICSE22: An Exploratory Study of Deep Learning Supply Chain

This is the slide of the paper: An Exploratory Study of Deep Learning Supply Chain. This paper is accepted by ICSE 2022.

Xin Tan

April 10, 2022
Tweet

Other Decks in Research

Transcript

  1. An Exploratory Study of Deep
    Learning Supply Chain
    Xin Tan1, Kai Gao2, Minghui Zhou2, and Li Zhang1
    1Beihang University
    2Peking University

    View Slide

  2. Deep Learning (DL) is so Popular and In Demand Nowadays.
    An Exploratory Study of Deep Learning Supply Chain 2
    Why DL? Great power of DL #Searches for DL from 2014 to 2021
    Source: Google Trend, Search term “deep learning”

    View Slide

  3. What is DL Supply Chains (SCs)? How important it is?
    An Exploratory Study of Deep Learning Supply Chain 3
    software dependencies:
    import relationships




    downstream projects
    Characteristics
    Ø multi-layer
    Ø a deep learning
    framework as the core
    Ø substantial downstream
    projects as the periphery
    Ø gradually evolving
    Importance
    Ø dynamically reflect the overall
    picture of the ecosystem formed
    from the DL framework
    Ø compare the differences
    between different DL
    frameworks
    Ø provide reference for emerging
    frameworks

    View Slide

  4. DL Supply Chains Remain “Black Box”
    An Exploratory Study of Deep Learning Supply Chain 4
    the large number
    of projects
    involved in the
    supply chains
    the large number
    of intricate
    dependencies
    involved in the
    supply chains
    a lack of effective
    modeling methods

    View Slide

  5. Research Questions
    An Exploratory Study of Deep Learning Supply Chain 5
    RQ1: (Structure) What is the structure of the DL SCs,
    how does it evolve, and how to identify the vulnerabilities
    in the structure?
    RQ2: (Domain Distribution) What domains do the DL
    SCs cover, and are the two SCs different?
    RQ3: (Evolutionary Factors) What factors are related
    to the number of downstream projects in the DL SCs?
    Identify key and risk projects in DL SCs
    Reflecting software status in DL SCs, guiding
    developer selection
    Push software to supply chain
    efficiently
    Questions Aims

    View Slide

  6. Primary Issue: How to Construct DL SCs?
    An Exploratory Study of Deep Learning Supply Chain 6
    Input: the name of the project
    Output: all direct and indirectly dependent projects (i.e., downstream projects) of this project and their
    relationships
    To identify which projects in a layer of the SC
    have released packages and its information
    dependent software packages defined in Synthesizer.py
    World of Code (WoC) collects almost all the Git
    projects on the Internet and provides the API that
    can efficiently retrieval technical dependencies.
    Libraries.io:Libraries.io dataset tracks over
    2.7m unique open source packages.
    Project name

    View Slide

  7. SC Construction
    An Exploratory Study of Deep Learning Supply Chain 7
    Study Cases Approach to Construct DL SCs

    View Slide

  8. RQ1: Structure: Basic Information
    An Exploratory Study of Deep Learning Supply Chain 8
    1. Substantial projects in the DL SCs.
    2. Each SC has five or six layers.
    3. Low proportion of packages.(TensorFlow:
    0.29%, PyTorch: 0.51%)
    4. More than 30% packages have no downstream
    projects.
    Basic Information of the DL SCs

    View Slide

  9. RQ1: Structure: Basic Information
    An Exploratory Study of Deep Learning Supply Chain 9
    Project Dependencies in the DL SC
    More than 90% of projects are in the
    2nd layer, i.e., direct dependences of
    TensorFlow or PyTorch.

    View Slide

  10. RQ1: Structure: Basic Information
    An Exploratory Study of Deep Learning Supply Chain 10
    Project Dependencies in the DL SC
    Except for the first layer, the number of
    projects in each layer decreases with
    the increase of the number of layers,
    which indicates that with the increase
    of DL SC layer, it seems more difficult
    to attract downstream projects.

    View Slide

  11. RQ1: Structure: Basic Information
    An Exploratory Study of Deep Learning Supply Chain 11
    Project Dependencies in the DL SC
    For most of layers, projects seem to be
    more likely to only import packages
    published by the project immediately
    above them. This is generally because the
    package has integrated the functions of
    the package it directly depends on.

    View Slide

  12. RQ1: Structure: Evolution
    An Exploratory Study of Deep Learning Supply Chain 12
    1. Both TensorFlow and PyTorch SCs show growth trend overall.
    2. The growth rates of both two SCs reach peak in 2019.05.
    3. The growth trend of packages is similar to the growth trend of projects.

    View Slide

  13. RQ1: Structure: Evolution
    An Exploratory Study of Deep Learning Supply Chain 13
    1. The cumulative changes of downstream projects in TensorFlow SC and PyTorch SC are similar.
    2. Many of the initial downstream projects come from early adopters. After this initial rising of popularity,
    the growth of half of the projects’ downstream projects tends to stabilize.

    View Slide

  14. RQ1: Structure: Vulnerabilities
    An Exploratory Study of Deep Learning Supply Chain 14
    Most Impactive Projects
    The most critical projects have cascading
    impact on their downstream projects. We
    consider both direct dependences and
    transitive dependences when calculating
    in-degree.

    View Slide

  15. RQ1: Structure: Vulnerabilities
    An Exploratory Study of Deep Learning Supply Chain 15
    Most Vulnerable Projects
    The more packages a project depends on,
    the more vulnerable it may be. Therefore,
    we calculate the out-degree distribution of
    the projects considering their transitive
    dependency.

    View Slide

  16. RQ1: Structure: Vulnerabilities
    An Exploratory Study of Deep Learning Supply Chain 16
    Projects have the Most Chance of being
    a Single Point of Failure
    By identifying the nodes that have high
    betweenness centrality but low degree
    centrality, we can find the projects that are
    more likely to be a single point of failure
    once they are removed.

    View Slide

  17. RQ2: Domain Distribution: Packages
    An Exploratory Study of Deep Learning Supply Chain 17
    Method
    We analyze the packages with no less than ten downstream projects.
    We apply thematic analysis on their descriptions.
    Packages Distribution in TensorFlow SC Packages Distribution in PyTorch SC
    The non-domain related
    packages (NDR)
    account for more than
    half, including
    framework, model,
    wrapper, and tutorial,
    and provide a rich
    variety of support tools

    View Slide

  18. RQ2: Domain Distribution: Packages
    An Exploratory Study of Deep Learning Supply Chain 18
    Method
    We analyze the packages with no less than ten downstream.
    We apply thematic analysis on their descriptions.
    Packages Distribution in TensorFlow SC Packages Distribution in PyTorch SC
    Domain-related
    packages (DR) cover a
    wide range of fields,
    such as hot areas
    including CV, NLP, and
    RL, as well as
    interdisciplinary fields,
    e.g., biology.

    View Slide

  19. RQ2: Domain Distribution: Packages
    An Exploratory Study of Deep Learning Supply Chain 19
    Some Differences
    Packages Distribution in TensorFlow SC Packages Distribution in PyTorch SC
    PyTorch SC seems
    to contain more
    framework
    packages in the
    specific domains,
    while TensorFlow
    SC seems to
    contain more
    general supporting
    tools.

    View Slide

  20. RQ2: Domain Distribution: Packages
    An Exploratory Study of Deep Learning Supply Chain 20
    Some Differences
    Packages Distribution in TensorFlow SC Packages Distribution in PyTorch SC
    PyTorch SC
    seems to have a
    higher proportion
    in general
    Framework, while
    TensorFlow SC
    seems to pay
    more attention to
    general wrappers.

    View Slide

  21. RQ2: Domain Distribution: Projects
    An Exploratory Study of Deep Learning Supply Chain 21
    We apply LDA to analyze their readme files.
    1. For two SCs, their application categories contain rich types, both including CV, NLP, and RL.
    2. TensorFlow SC involves the areas such as Self-driving and Robot Control, which are closely related to the
    industry.

    View Slide

  22. RQ2: Domain Distribution: Projects
    An Exploratory Study of Deep Learning Supply Chain 22
    Evolution of the Project Types Distribution in
    TensorFlow SC
    Evolution of the Project Types Distribution
    in PyTorch SC
    Some Observations
    • The distribution of
    project types fluctuates
    over time, among which
    CV contributes a large
    proportion – about 30%
    ~ 40%.
    • The proportion of
    Research in the
    PyTorch SC is much
    higher than that of the
    TensorFlow SC.

    View Slide

  23. RQ3: Evolutionary Factors
    An Exploratory Study of Deep Learning Supply Chain 23
    • whether the number of downstream projects is related to some
    factors
    • Projects: Choose the projects with more than 10 downstream projects
    • Response variable : #downstream projects
    • Predictor variables:
    • Project characteristics: #commits; #authors; age; #stars;
    • Package Domain: package_domain (DR/NDR); sub_domain
    • Supply chain characteristics: TensorFlow/PyTorch SC, No.layer, #dependencies
    • Generalized Additive Models (GAM)

    View Slide

  24. RQ3: Evolutionary Factors
    An Exploratory Study of Deep Learning Supply Chain 24
    1. The influences of #authors on #downstream_projects is nonlinear.
    2. #dependencies is negatively related with #downstream_projects.
    3. Domain-related packages tends to attract more downstream projects.

    View Slide

  25. Implications
    Static Analysis for Restoring Python Script Dependencies: How Far Are We? 25
    DL community
    Maintainers
    DL Practitioners Researchers
    • Make the structure visible can help
    to formulate targeted development
    strategies and ensure the safety of
    the SCs.
    • The differences we reveal can help
    maintainers better understand what
    they need to optimize to remain
    competitive.
    • Serve as examples for new
    frameworks to lead their own
    ecosystems.
    • choose the suitable DL framework
    more easily.
    • can provide guidance for package
    developers in the DL SCs.
    • Build on our findings and ask
    new questions in order to have
    a deeper understanding of
    software SCs and perform
    relevant research.

    View Slide

  26. Summary
    An Exploratory Study of Deep Learning Supply Chain 26
    DL SCs
    S tru ctu re
    ü Basic Structure
    ü Evolution
    Characteristics
    ü Vulnerabilities
    We try to open
    the "black box" of
    DL SCs.
    DL SCs
    D o main
    D is trib u tio n
    E vo lu tio n ary
    F a c to rs
    ü Packages Domain
    ü Projects Domain
    ü Consider
    following factors:
    1. Project
    characteristics
    2. Package domain
    3. SC
    characteristics

    View Slide

  27. 27
    Thanks!
    An Exploratory Study of Deep Learning Supply Chain

    View Slide