Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Social Data, Lecture 10: Networks

Modeling Social Data, Lecture 10: Networks

Jake Hofman

April 05, 2019
Tweet

More Decks by Jake Hofman

Other Decks in Education

Transcript

  1. Networks
    APAM E4990
    Modeling Social Data
    Jake Hofman
    Columbia University
    April 5, 2019
    Jake Hofman (Columbia University) Networks April 5, 2019 1 / 16

    View Slide

  2. History
    Jake Hofman (Columbia University) Networks April 5, 2019 2 / 16

    View Slide

  3. ∼1930s: Relationships as networks
    Moreno (1933)
    http://bit.ly/sociograms
    Jake Hofman (Columbia University) Networks April 5, 2019 3 / 16

    View Slide

  4. ∼1960s: Random graph theory
    p >
    (1 + ) ln n
    n
    Erd˝
    os & R´
    enyi (1959)
    Jake Hofman (Columbia University) Networks April 5, 2019 4 / 16

    View Slide

  5. ∼1970s: Clustering, weak ties
    Granovetter (1973)
    Jake Hofman (Columbia University) Networks April 5, 2019 5 / 16

    View Slide

  6. ∼1970s: Clustering, weak ties
    Granovetter (1973)
    Jake Hofman (Columbia University) Networks April 5, 2019 5 / 16

    View Slide

  7. ∼1970s: Clustering, weak ties
    Granovetter (1973)
    Jake Hofman (Columbia University) Networks April 5, 2019 5 / 16

    View Slide

  8. ∼1970s: Clustering, weak ties
    Granovetter (1973)
    Jake Hofman (Columbia University) Networks April 5, 2019 5 / 16

    View Slide

  9. ∼1970s: Cumulative advantage
    have never been cited, about 10 percent woulld prove so distinctive that they
    have been cited once, about 9 percent could be picked automatically by
    twice, and so on, the percentages slowly means of citation-index-production ‘pro-
    decreasing, so that half of all papers cedures and published as a single U.X
    will be cited eventually five times or (or World) Journal of Really Impor-
    more, and a quarter of all papers, ten tan t Papers,
    In year’
    100 old papers in field 91 references ~n~~i~,
    40
    papers
    not cited
    in year
    - .
    IO cited
    more
    than
    unce
    2w
    *%
    2s
    2T
    2y
    2
    3
    3
    4
    6
    50 papers
    cited
    once
    10 miscellaneous
    from outside field
    Fig. 3. Idealized representation of the balance of papers and citations for a given
    “almost closed” field in a single year. It is assumed that the field consists of 1010
    papers whose numbers have been growing exponentially at the normal rate. If we
    assume that each of the seven new papers contains about 13 references to journal
    papers and that about 11 percent of these 91 cited papers (or ten papers) are outside
    the field, we find that 50 of the old papers are connected by one citation each to the
    new papers (these links are not shown) and that 40 of the old papers are not cited
    at all during the year. The seven new papers, then, are linked to ten sf the old ones
    by the complex network shown here,
    512
    relation, if one exists, is very smalf,
    Certainly, there is no strong tendency
    for review papers ‘to be cited unusually
    often Tf my conjecture is valid, it is
    worth noting that, since 10 percent of
    all papers contain no ~bibliogrXapbic
    ref-
    erences and another, presumably almost
    independent, 10 percent of all pa.pers
    are never cited, it follows that there
    is a lower Ibound of -1. percent of all
    papers on the number of papers tlhat
    are totally disconnected in a pure ci-
    tation network and could be found
    only by topical indexing or similar
    methods; this is a very small class, and
    probaibly a most unim:portant one.
    The balance of references and ci-
    tations in a single. year indicates one
    very important attribute of the net-
    work (see Fig. 3). Although most papers
    produced in the year contain a near-
    average number of bibliographic refer-
    ences, half of these are references to
    about half of all the papers that have
    been published in previous years. The
    other half of the references tie these
    new papers to a quite small group of
    earlier ones, and generate a rather tight
    pattern of multiple relationships. Thus
    each group of new papers is “knitted”
    to a small, select part of the existing
    scientific literature tbut connected rath-
    er weakly and randomly to a much
    greater part. Since only a small part of
    the earlier literature is knitted together
    by the new year’s crop of papers, we
    may look upon this small part as a sort
    of growing tip or epidermal Jayer, an
    active research front. I believe it is the
    existence of a research front, in this
    sense, that distinguishes the sciences
    from the rest of scholarship, a.nd, be-
    cause of it, I propose that one of the
    major ,tasks of statistical analysis is to
    determine the mechanism that enables
    science to cumulate so ~much faster than
    nonscience that it produces a literature
    crisis,
    An analysis of the distribution of
    publication dates of all -papers cited in
    a single year (Fig. 4) sheds further
    light on the existence of such a research
    front. Taking [from Garfield (2)] data
    for 1961, the ‘most numerous count
    SCIENCE, VOL. 149
    de Solla Price (1965, 1976)
    Jake Hofman (Columbia University) Networks April 5, 2019 6 / 16

    View Slide

  10. ∼1970s: Cumulative advantage
    4
    1
    dex.
    ndex.
    d data for
    rterly and
    I fmd for
    five years,
    and inde-
    ues of 1.4,
    efore that
    the quin-
    nafifth of
    we should
    for n = 29,655 we have m = 0.53.
    . .
    2 . . Dimibution
    1
    10 100
    Fig. I . Number of papers w
    i
    t
    h (a) exactly and (b) at least n cita-
    tions in %, 1, and 5-year indexes.
    fomation Science-September-October 1976
    de Solla Price (1965, 1976)
    Jake Hofman (Columbia University) Networks April 5, 2019 6 / 16

    View Slide

  11. ∼1990s: Small-world networks
    Watts & Strogatz (1998)
    Jake Hofman (Columbia University) Networks April 5, 2019 7 / 16

    View Slide

  12. ∼1990s: Empirical structure and dynamics of networks
    Newman, Barabasi, Watts (2006)
    Jake Hofman (Columbia University) Networks April 5, 2019 8 / 16

    View Slide

  13. ∼2000s: Homophily, contagion, and all that
    Figure 1: Community structure of political blogs (expanded set), shown using utilizing the GUESS visual-
    ization and analysis tool[2]. The colors reflect political orientation, red for conservative, and blue for liberal.
    Orange links go from liberal to conservative, and purple ones from conservative to liberal. The size of each
    blog reflects the number of other blogs that link to it.
    Because of bloggers’ ability to identify and frame break-
    ing news, many mainstream media sources keep a close eye
    on the best known political blogs. A number of mainstream
    news sources have started to discuss and even to host blogs.
    neighborhoods of Atrios, a popular liberal blog, and In-
    stapundit, a popular conservative blog. He found the In-
    stapundit neighborhood to include many more blogs than
    the Atrios one, and observed no overlap in the URLs cited
    Adamic & Glance (2005)
    Jake Hofman (Columbia University) Networks April 5, 2019 9 / 16

    View Slide

  14. Types of networks
    Jake Hofman (Columbia University) Networks April 5, 2019 10 / 16

    View Slide

  15. Types of networks
    Networks are a useful abstractions for many different types of data
    • Social networks (e.g., Facebook)
    • Information networks (e.g., the Web)
    • Activity networks (e.g., email)
    • Biological networks (e.g., protein interactions)
    • Geographical networks (e.g., roads)
    Jake Hofman (Columbia University) Networks April 5, 2019 11 / 16

    View Slide

  16. Representations
    There are many different levels of abstraction for representing
    networks (e.g., directed, weighted, metadata, etc.)
    32 CHAPTER 2. GRAPHS
    B
    A
    C D
    (a) A graph on 4 nodes.
    B
    A
    C D
    (b) A directed graph on 4 nodes.
    Figure 2.1: Two graphs: (a) an undirected graphs, and (b) a directed graph.
    will be undirected unless noted otherwise.
    Graphs as Models of Networks. Graphs are useful because they serve as mathematical
    models of network structures. With this in mind, it is useful before going further to replace
    the toy examples in Figure 2.1 with a real example. Figure 2.2 depicts the network structure
    Jake Hofman (Columbia University) Networks April 5, 2019 12 / 16

    View Slide

  17. Representations
    There are many different levels of abstraction for representing
    networks (e.g., directed, weighted, metadata, etc.)
    2.2. PATHS AND CONNECTIVITY 33
    Jake Hofman (Columbia University) Networks April 5, 2019 12 / 16

    View Slide

  18. Representations
    There are many different levels of abstraction for representing
    networks (e.g., directed, weighted, metadata, etc.)
    Relational Topic Models for Document Networks
    52
    478
    430
    2487
    75
    288
    1123
    2122
    2299
    1354
    1854
    1855
    89
    635
    92
    2438
    136
    479
    109
    640
    119
    686
    120
    1959
    1539
    147
    172
    177
    965
    911
    2192
    1489
    885
    178
    378
    286
    208
    1569
    2343
    1270
    218
    1290
    223
    227
    236
    1617
    254
    1176
    256
    634
    264
    1963
    2195
    1377
    303
    426
    2091
    313
    1642
    534
    801
    335
    344
    585
    1244
    2291
    2617
    1627
    2290
    1275
    375
    1027
    396
    1678
    2447
    2583
    1061 692
    1207
    960
    1238
    2012
    1644
    2042
    381
    418
    1792
    1284
    651
    524
    1165
    2197
    1568
    2593
    1698
    547 683
    2137 1637
    2557
    2033
    632
    1020
    436
    442
    449
    474
    649
    2636
    2300
    539
    541
    603
    1047
    722
    660
    806
    1121
    1138
    831
    837
    1335
    902
    964
    966
    981
    1673
    1140
    1481
    1432
    1253
    1590
    1060
    992
    994
    1001
    1010
    1651
    1578
    1039
    1040
    1344
    1345
    1348
    1355
    1420
    1089
    1483
    1188
    1674
    1680
    2272
    1285
    1592
    1234
    1304
    1317
    1426
    1695
    1465
    1743
    1944
    2259
    2213
    We address the problem of
    finding a subset of features that
    allows a supervised induction
    algorithm to induce small high-
    accuracy concepts...
    Irrelevant features and the
    subset selection problem
    In many domains, an appropriate
    inductive bias is the MIN-
    FEATURES bias, which prefers
    consistent hypotheses definable
    over as few features as
    possible...
    Learning with many irrelevant
    features
    In this introduction, we define the
    term bias as it is used in machine
    learning systems. We motivate
    the importance of automated
    methods for evaluating...
    Evaluation and selection of
    biases in machine learning
    The inductive learning problem
    consists of learning a concept
    given examples and
    nonexamples of the concept. To
    perform this learning task,
    inductive learning algorithms bias
    their learning method...
    Utilizing prior concepts for
    learning
    The problem of learning decision
    rules for sequential tasks is
    addressed, focusing on the
    problem of learning tactical plans
    from a simple flight simulator
    where a plane must avoid a
    missile...
    Improving tactical plans with
    genetic algorithms
    Evolutionary learning methods
    have been found to be useful in
    several areas in the development
    of intelligent robots. In the
    approach described here,
    evolutionary...
    An evolutionary approach to
    learning in robots
    Navigation through obstacles
    such as mine fields is an
    important capability for
    autonomous underwater vehicles.
    One way to produce robust
    behavior...
    Using a genetic algorithm to
    learn strategies for collision
    avoidance and local
    navigation
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    Figure 1: Example data appropriate for the relational topic model. Each document is represented as a bag of words and
    linked to other documents via citation. The RTM defines a joint distribution over the words in each document and the
    citation links between them.
    The RTM is based on latent Dirichlet allocation (LDA)
    (Blei et al. 2003). LDA is a generative probabilistic model
    that uses a set of “topics,” distributions over a fixed vocab-
    Figure 2 illustrates the graphical model for this process for
    a single pair of documents. The full model, which is dif-
    ficult to illustrate, contains the observed words from all D
    Jake Hofman (Columbia University) Networks April 5, 2019 12 / 16

    View Slide

  19. Which network?
    3.4. TIE STRENGTH, SOCIAL MEDIA, AND PASSIVE ENGAGEMENT 69
    All Friends
    One-way Communication Mutual Communication
    Maintained Relationships
    Figure 3.8: Four di erent views of a Facebook user’s network neighborhood, showing the
    structure of links coresponding respectively to all declared friendships, maintained relation-
    ships, one-way communication, and reciprocal (i.e. mutual) communication. (Image from
    [281].)
    Notice that these three categories are not mutually exclusive — indeed, the links classified
    as reciprocal communication always belong to the set of links classified as one-way commu-
    nication.
    Jake Hofman (Columbia University) Networks April 5, 2019 13 / 16

    View Slide

  20. Which network?
    636 CHAPTER 20. THE SMALL-WORLD PHENOMENON
    Figure 20.12: The pattern of e-mail communication among 436 employees of Hewlett
    Packard Research Lab is superimposed on the o⌅cial organizational hierarchy, show-
    ing how network links span di erent social foci [6]. (Image from http://www-
    personal.umich.edu/ ladamic/img/hplabsemailhierarchy.jpg)
    Social Foci and Social Distance. When we first discussed the Watts-Strogatz model in
    Jake Hofman (Columbia University) Networks April 5, 2019 13 / 16

    View Slide

  21. Which network?
    Figure 1: Topology of the largest components over various choices of threshold conditions for (a) a dataset
    based on email server logs at a US university, and (b) the Enron email corpus. Significant changes in topology
    are observed as the thresholding condition of the network is varied.
    where alternative definitions are considered [15, 17], the pur-
    pose is exclusively to serve as a robustness check on the find-
    ings; thus the scope of possibilities is typically limited to
    within some range of the original choice of threshold. Most
    closely related to the current work are two recent studies us-
    ing mobile phone data [27, 9]. In [27], the authors systemat-
    ically deleted edges as a function of call frequency in order to
    investigate the connectivity of the network, and its impact
    The emails contain encrypted IDs of the sender and recipi-
    ent(s) of each email and the timestamp, but do not contain
    the content. The dataset also features several (anonymized)
    personal attributes, including status, gender, age, depart-
    mental affiliation, number of years in the community, dorm
    and home zipcode information for the students, as well as
    course affiliations for the students at each semester.
    In order to focus on a population of users who use emails
    WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA
    Jake Hofman (Columbia University) Networks April 5, 2019 13 / 16

    View Slide

  22. Data structures
    [ [0,1], [0,6], [0,8], [1,4], [1,6],
    [1,9], [2,4], [2,6], [3,4], [3,5],
    [3,8], [4,5], [4,9], [7,8], [7,9] ]
    Simple for storage, but difficult
    to compute with
    Jake Hofman (Columbia University) Networks April 5, 2019 14 / 16

    View Slide

  23. Data structures
    Adjacency matrix
    Quick to check edges, good for
    linear algebra, often sparse
    Jake Hofman (Columbia University) Networks April 5, 2019 14 / 16

    View Slide

  24. Data structures
    Adjacency list
    Good for graph traversal
    Jake Hofman (Columbia University) Networks April 5, 2019 14 / 16

    View Slide

  25. Describing networks
    Jake Hofman (Columbia University) Networks April 5, 2019 15 / 16

    View Slide

  26. Descriptive statistics
    • Degree: How many connections does a node have?
    • Path length: What’s the shortest path between two nodes?
    • Clustering: How many friends of friends are also friends?
    • Components: How many disconnected parts does the network
    have?
    Jake Hofman (Columbia University) Networks April 5, 2019 16 / 16

    View Slide

  27. Algorithms for Descriptive statistics
    • Degree: How many connections does a node have?
    → Degree distributions
    • Path length: What’s the shortest path between two nodes?
    → Breadth first search
    • Clustering: How many friends of friends are also friends?
    → Triangle counting
    • Components: How many disconnected parts does the network
    have?
    → Connected components
    Jake Hofman (Columbia University) Networks April 5, 2019 16 / 16

    View Slide