Slide 1

Slide 1 text

Link graph and data-driven graphs as complex networks: comparative study Vasilii A. Gromov [email protected] The School of Data Analysis and Artificial Intelligence, HSE University, Moscow 109028, Russian Federation.

Slide 2

Slide 2 text

Motivation The complex networks prove to be an efficient tool to study and describe most real-world complex systems. Usually, a vertex of any complex network has a large amount of data associated with it It is possible to construct data-driven graphs and link graphs. One approach assumes that each graph is associated with its own dynamics, roughly corresponding to disease and information spreading for social systems. It is important to study comparatively features of the graphs. A large number of problems involve community detection in linked data, and algorithms designed to solve this problem. The overwhelming majority of such algorithms operate on either an explicit or implicit assumption that it is possible to handle link and data-driven graphs in a unified way. In turn, this implies that both graphs share qualitatively similar structures.

Slide 3

Slide 3 text

The present paper compares complex networks characteristics for various graphs, both link and data-driven: ● ε-ball neighbourhood graph (ε-ball) ● Gabriel graph (GG) ● influence graph (IG) ● nearest neighbourhood graph (NNG) ● relative neighbourhood graph (RNG)

Slide 4

Slide 4 text

● link graph usually shows most features of complex networks, ● data-driven graph usually retains only a part of them. Possible explanations: ● This may be interpreted to mean that such and such method to construct a data-driven graph is inapplicable for data with a complex network structure. ● The respective features may not be of fundamental importance for the description of complex networks. Furthermore, it seems to be, that if the features of graphs of both types coincide, then this fact indicates that graphs observe true data structure of the information object. The similarity between results for graphs of both types defines, in a sense, a measure of the observability of the data structure.

Slide 5

Slide 5 text

Related works ● Qiao et al. ○ reviewed the majority of state-of-the-art data-driven graph construction techniques. ○ classify all possible approaches on parameter-free techniques, single-parameter techniques and techniques that require flexible 3 parameter selection schemes ● Aronov et al. ○ consider witness proximity graphs, which are data-driven graphs with the structure determined by two separate sets of points. ● Aiello et al. ○ estimate a range for a power exponent of the degree distribution, such that a giant component necessarily exists for any complex network with the degree distribution corresponding to this range ● Ding et al. ○ discuss an evolving RNG-graph of a traffic network interpreted as a complex network ● Clauset et al. ○ employ Kolmogorov-Smirnov statistic in order to estimate both a power law exponent and a lower cutoff

Slide 6

Slide 6 text

Notation

Slide 7

Slide 7 text

Notation. Data-driven graphs

Slide 8

Slide 8 text

Datasets 2. The second dataset: ● belongs to the Stanford University complex networks collection ● size: 548552 vertices (goods and comments on them) and 987942 edges. ● information about goods that are available at the Amazon site. ● information about a vertex (good): a sales-rank, its internal popularity rating; list of goods frequently bought with it; detailed information about the category it belongs to; users’ comments on the good. ● All frequent words of the English language were removed from the comments before the analysis

Slide 9

Slide 9 text

Datasets 1. Six Degrees of Francis Bacon (SDFB): ● belongs to digital library Folger Shakespeare ● size: 15801 vertices (persons) and 171408 edges (acquaintances) ● information about circles of friends for Frenchmen of the XVII and XVIII centuries ● acquainted, if letters between them have survived, or one has mentioned another one in some historical document ● information about a vertex (person): name, sex, title (if any), trade, dates of birth and death, and the documents in which he or she has been mentioned

Slide 10

Slide 10 text

Datasets 3. The third dataset ● belongs to the Aminer collection. ● size: 112416 vertices (users) and 308927 tweets ● information about Twitter users’ subscriptions and actions. ● For all presented values, the normalized p-value does not exceed 0.001 for power-law distributions and 0.1 for normal distributions. A goodness of fit (GoF) statistic that exceeds 0.1, suggests a power-law distribution.

Slide 11

Slide 11 text

SDFB link graph appears to be weakly disassortative. Its assortativity coefficient equals to -0.08 (if Pearson correlation coefficient is used), -0.08 (Spearman), and - 0.06 (Kendall). Link graphs characteristics. SDFB

Slide 12

Slide 12 text

Link graphs characteristics. Amazon dataset Two other graphs appear to be assortative. For the Amazon link graph, the respective values are 0.04, 0.07, 0.05;

Slide 13

Slide 13 text

Link graphs characteristics. Twitter dataset Two other graphs appear to be assortative. For the Twitter link graph, 0.365, 0.447, 0.412

Slide 14

Slide 14 text

Link graphs characteristics The distributions were tested for normality, and the hypotheses were rejected. The data appear to be consistent with power-law distributions. This allows one to draw the conclusion that all three graphs feature small-world property. Dataset α xmin Goodness-of-fit statistic graph diameters community sizes, p.-law distributions exponents SDFB 2.8 15 0.42 9 2.11 Amazon 3.18 75 0.88 12 2.97 Twitter 3.34 19 0.62 11 2.28

Slide 15

Slide 15 text

Data-driven graphs characteristics.SDFB graph\correlation coefficient Pearson Spearman Kendall ε-ball 0.98 0.96 0.84 GG 0.36 0.37 0.27 IG 0.59 0.47 0.4 NNG 0.12 0.12 0.1 RNG 0.19 0.12 0.17

Slide 16

Slide 16 text

Data-driven graphs characteristics. Amazon dataset graph\correlation coefficient Pearson Spearman Kendall ε-ball 0.9 0.93 0.78 GG 0.41 0.44 0.38 IG 0.4 0.46 0.34 NNG 0.13 0.12 0.09 RNG 0.2 0.19 0.16

Slide 17

Slide 17 text

Data-driven graphs characteristics.Twitter dataset graph\correlation coefficient Pearson Spearman Kendall ε-ball 0.95 0.95 0.87 GG 0.35 0.32 0.3 IG 0.41 0.48 0.36 NNG 0.13 0.12 0.09 RNG 0.17 0.18 0.14

Slide 18

Slide 18 text

Data-driven graphs characteristics It is quite obvious that the assortativity coefficient changes drastically when the link graphs are replaced by the data-driven ones; sometimes, assortativity alters to disassortativity. It is worth stressing that assortative coefficients are rather close to each other for various datasets. It seems that their values are determined not by a dataset, but rather by the type of a data-driven graph.

Slide 19

Slide 19 text

Data-driven graphs characteristics The degree distributions are tested for normality, and whether samples are consistent with a power law. follow a power law follow a normal distribution ● ε-ball ● GG ● IG ● NNG ● RNG

Slide 20

Slide 20 text

Data-driven graphs characteristics Moreover, even for the graphs that keep following a power law, an exponent changes from the values that yield finite mathematical expectations, to those that yield infinite ones. ● SDFB dataset. it switches from 2.8 to 1.49 and 1.4; ● Amazon dataset. from 3.71 to 1.72 and 1.6; ● Twitter dataset. from 3.34 to 1.53 and 1.36. Besides that, for the ε-ball neighbourhood graph, the exponents depend on ε.

Slide 21

Slide 21 text

Data-driven graphs characteristics ● All PDFs for: ○ data-driven graphs - show hunches, typical for most realworld power laws. ○ link graphs - do not show the hunches in the double logarithmic scale ● Small world property appears to be inherent to some extent, to all data-driven graphs. ● All the data-driven graphs exhibit relatively large values of average clustering coefficients. Interestingly, the estimates for average clustering coefficients of link and data-driven graphs demonstrate a satisfactory agreement.

Slide 22

Slide 22 text

Conclusion (1/2) ● Most link and data-driven graphs - giant components and community size power-law distributions ● The power exponents of the respective distribution for link and data-driven graphs are in good agreement. ● Data-driven graphs - small world property and relatively large values of clustering coefficients, provided the same holds true for the respective link graph. ● ε-ball and the GG - a power-law degree distribution, other types of data-driven graphs - normal ● The value of the power exponent crosses the border between exponents that yield finite mathematical expectations and those that yield infinite.

Slide 23

Slide 23 text

Conclusion (2/2) ● The assortativity coefficient is essentially corrupted when one moves from a link graph to data graphs. ● Sometimes, assortativity alters to disassortativity. It seems that values of the assortativity coefficients are determined not by a dataset, but rather by the type of a data-driven graph. Among all data-driven graphs considered, the GG seems to retain most properties of complex networks.