Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: Link graph and data-driven graphs as...

Exactpro
November 27, 2021

TMPA-2021: Link graph and data-driven graphs as complex networks: comparative study

Vasilii Gromov, HSE

Link graph and data-driven graphs as complex networks: comparative study

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro

November 27, 2021
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Link graph and data-driven graphs as complex networks: comparative study

    Vasilii A. Gromov [email protected] The School of Data Analysis and Artificial Intelligence, HSE University, Moscow 109028, Russian Federation.
  2. Motivation The complex networks prove to be an efficient tool

    to study and describe most real-world complex systems. Usually, a vertex of any complex network has a large amount of data associated with it It is possible to construct data-driven graphs and link graphs. One approach assumes that each graph is associated with its own dynamics, roughly corresponding to disease and information spreading for social systems. It is important to study comparatively features of the graphs. A large number of problems involve community detection in linked data, and algorithms designed to solve this problem. The overwhelming majority of such algorithms operate on either an explicit or implicit assumption that it is possible to handle link and data-driven graphs in a unified way. In turn, this implies that both graphs share qualitatively similar structures.
  3. The present paper compares complex networks characteristics for various graphs,

    both link and data-driven: • ε-ball neighbourhood graph (ε-ball) • Gabriel graph (GG) • influence graph (IG) • nearest neighbourhood graph (NNG) • relative neighbourhood graph (RNG)
  4. • link graph usually shows most features of complex networks,

    • data-driven graph usually retains only a part of them. Possible explanations: • This may be interpreted to mean that such and such method to construct a data-driven graph is inapplicable for data with a complex network structure. • The respective features may not be of fundamental importance for the description of complex networks. Furthermore, it seems to be, that if the features of graphs of both types coincide, then this fact indicates that graphs observe true data structure of the information object. The similarity between results for graphs of both types defines, in a sense, a measure of the observability of the data structure.
  5. Related works • Qiao et al. ◦ reviewed the majority

    of state-of-the-art data-driven graph construction techniques. ◦ classify all possible approaches on parameter-free techniques, single-parameter techniques and techniques that require flexible 3 parameter selection schemes • Aronov et al. ◦ consider witness proximity graphs, which are data-driven graphs with the structure determined by two separate sets of points. • Aiello et al. ◦ estimate a range for a power exponent of the degree distribution, such that a giant component necessarily exists for any complex network with the degree distribution corresponding to this range • Ding et al. ◦ discuss an evolving RNG-graph of a traffic network interpreted as a complex network • Clauset et al. ◦ employ Kolmogorov-Smirnov statistic in order to estimate both a power law exponent and a lower cutoff
  6. Datasets 2. The second dataset: • belongs to the Stanford

    University complex networks collection • size: 548552 vertices (goods and comments on them) and 987942 edges. • information about goods that are available at the Amazon site. • information about a vertex (good): a sales-rank, its internal popularity rating; list of goods frequently bought with it; detailed information about the category it belongs to; users’ comments on the good. • All frequent words of the English language were removed from the comments before the analysis
  7. Datasets 1. Six Degrees of Francis Bacon (SDFB): • belongs

    to digital library Folger Shakespeare • size: 15801 vertices (persons) and 171408 edges (acquaintances) • information about circles of friends for Frenchmen of the XVII and XVIII centuries • acquainted, if letters between them have survived, or one has mentioned another one in some historical document • information about a vertex (person): name, sex, title (if any), trade, dates of birth and death, and the documents in which he or she has been mentioned
  8. Datasets 3. The third dataset • belongs to the Aminer

    collection. • size: 112416 vertices (users) and 308927 tweets • information about Twitter users’ subscriptions and actions. • For all presented values, the normalized p-value does not exceed 0.001 for power-law distributions and 0.1 for normal distributions. A goodness of fit (GoF) statistic that exceeds 0.1, suggests a power-law distribution.
  9. SDFB link graph appears to be weakly disassortative. Its assortativity

    coefficient equals to -0.08 (if Pearson correlation coefficient is used), -0.08 (Spearman), and - 0.06 (Kendall). Link graphs characteristics. SDFB
  10. Link graphs characteristics. Amazon dataset Two other graphs appear to

    be assortative. For the Amazon link graph, the respective values are 0.04, 0.07, 0.05;
  11. Link graphs characteristics. Twitter dataset Two other graphs appear to

    be assortative. For the Twitter link graph, 0.365, 0.447, 0.412
  12. Link graphs characteristics The distributions were tested for normality, and

    the hypotheses were rejected. The data appear to be consistent with power-law distributions. This allows one to draw the conclusion that all three graphs feature small-world property. Dataset α xmin Goodness-of-fit statistic graph diameters community sizes, p.-law distributions exponents SDFB 2.8 15 0.42 9 2.11 Amazon 3.18 75 0.88 12 2.97 Twitter 3.34 19 0.62 11 2.28
  13. Data-driven graphs characteristics.SDFB graph\correlation coefficient Pearson Spearman Kendall ε-ball 0.98

    0.96 0.84 GG 0.36 0.37 0.27 IG 0.59 0.47 0.4 NNG 0.12 0.12 0.1 RNG 0.19 0.12 0.17
  14. Data-driven graphs characteristics. Amazon dataset graph\correlation coefficient Pearson Spearman Kendall

    ε-ball 0.9 0.93 0.78 GG 0.41 0.44 0.38 IG 0.4 0.46 0.34 NNG 0.13 0.12 0.09 RNG 0.2 0.19 0.16
  15. Data-driven graphs characteristics.Twitter dataset graph\correlation coefficient Pearson Spearman Kendall ε-ball

    0.95 0.95 0.87 GG 0.35 0.32 0.3 IG 0.41 0.48 0.36 NNG 0.13 0.12 0.09 RNG 0.17 0.18 0.14
  16. Data-driven graphs characteristics It is quite obvious that the assortativity

    coefficient changes drastically when the link graphs are replaced by the data-driven ones; sometimes, assortativity alters to disassortativity. It is worth stressing that assortative coefficients are rather close to each other for various datasets. It seems that their values are determined not by a dataset, but rather by the type of a data-driven graph.
  17. Data-driven graphs characteristics The degree distributions are tested for normality,

    and whether samples are consistent with a power law. follow a power law follow a normal distribution • ε-ball • GG • IG • NNG • RNG
  18. Data-driven graphs characteristics Moreover, even for the graphs that keep

    following a power law, an exponent changes from the values that yield finite mathematical expectations, to those that yield infinite ones. • SDFB dataset. it switches from 2.8 to 1.49 and 1.4; • Amazon dataset. from 3.71 to 1.72 and 1.6; • Twitter dataset. from 3.34 to 1.53 and 1.36. Besides that, for the ε-ball neighbourhood graph, the exponents depend on ε.
  19. Data-driven graphs characteristics • All PDFs for: ◦ data-driven graphs

    - show hunches, typical for most realworld power laws. ◦ link graphs - do not show the hunches in the double logarithmic scale • Small world property appears to be inherent to some extent, to all data-driven graphs. • All the data-driven graphs exhibit relatively large values of average clustering coefficients. Interestingly, the estimates for average clustering coefficients of link and data-driven graphs demonstrate a satisfactory agreement.
  20. Conclusion (1/2) • Most link and data-driven graphs - giant

    components and community size power-law distributions • The power exponents of the respective distribution for link and data-driven graphs are in good agreement. • Data-driven graphs - small world property and relatively large values of clustering coefficients, provided the same holds true for the respective link graph. • ε-ball and the GG - a power-law degree distribution, other types of data-driven graphs - normal • The value of the power exponent crosses the border between exponents that yield finite mathematical expectations and those that yield infinite.
  21. Conclusion (2/2) • The assortativity coefficient is essentially corrupted when

    one moves from a link graph to data graphs. • Sometimes, assortativity alters to disassortativity. It seems that values of the assortativity coefficients are determined not by a dataset, but rather by the type of a data-driven graph. Among all data-driven graphs considered, the GG seems to retain most properties of complex networks.