Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

DynaGraph: Dynamic Graph Neural Networks at Scale

DynaGraph: Dynamic Graph Neural Networks atย Scale

Avatar for Anand Iyer

Anand Iyer

June 12, 2022
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. DynaGraph: Dynamic Graph Neural Networks at Scale Mingyu Guan*, Anand

    Iyerโ–ด, Taesoo Kim* *Georgia Institute of Technology โ–ดMicrosoft Research GRADES-NDA 2022
  2. Graph Neural Networks (GNNs) โ€ข The recent past has seen

    an increasing interest in GNNs. โ€ข Node embeddings are generated by combining graph structure and feature information. โ€ข Most GNN models can fit into the Message Passing Paradigm. A E C D B GNN A E C D B Output Features/Embeddings of Each Node Initial Features/Embeddings of Each Node
  3. Message Passing Paradigm C D B B C Current Neighbor

    States D Current Node State โ„Ž! "#$
  4. Message Passing Paradigm C D B B C Current Neighbor

    States D Current Node State โ„Ž! "#$ B D C D Messages from Neighbors
  5. Message Passing Paradigm C D B B C Current Neighbor

    States D Current Node State โ„Ž! "#$ B D C D Messages from Neighbors Aggregate and Reduce Received Messages ๐‘š! "
  6. Message Passing Paradigm C D B B C Current Neighbor

    States D Current Node State โ„Ž! "#$ B D C D Messages from Neighbors Aggregate and Reduce Received Messages Update D Next Node State โ„Ž! " ๐‘š! "
  7. Dynamic GNNs โ€ข Most of existing GNN frameworks assume that

    the input graph is static. โ€ข Real-world graphs are often dynamic in nature. โ€ข Representation: a time series of snapshots of the graph. โ€ข Common approach: Combine GNNs and RNNs. oGNNs for encoding spatial information (graph structure) oRNNs for encoding temporal information
  8. ๐‘Š%& ๐‘ฅ' ๐‘Š(& โ„Ž'#$ Gate ๐’Š ๐‘Š%) ๐‘ฅ' ๐‘Š() โ„Ž'#$

    Gate ๐’‡ ๐‘Š %* ๐‘ฅ' ๐‘Š(* โ„Ž'#$ Gate ๐’„ ๐‘Š %+ ๐‘ฅ' ๐‘Š(+ โ„Ž'#$ Gate ๐’ + A + A + A + A E โ„Ž' ๐‘Š %, ๐‘ฅ' ๐‘Š(, โ„Ž'#$ Gate ๐’“ + A ๐‘Š %- ๐‘ฅ' ๐‘Š(- โ„Ž'#$ Gate ๐’› + A ๐‘Š%( ๐‘ฅ' ๐‘Š(( โ„Žโ€ฒ Gate ๐’‰ + A * โ„Ž' -1 E โ„Ž' โ„Žโ€ฒ LSTM GRU
  9. ๐‘Š%& ๐‘ฅ' ๐‘Š(& โ„Ž'#$ Gate ๐’Š ๐‘Š%) ๐‘ฅ' ๐‘Š() โ„Ž'#$

    Gate ๐’‡ ๐‘Š %* ๐‘ฅ' ๐‘Š(* โ„Ž'#$ Gate ๐’„ ๐‘Š %+ ๐‘ฅ' ๐‘Š(+ โ„Ž'#$ Gate ๐’ + A + A + A + A E โ„Ž' ๐‘Š %, ๐‘ฅ' ๐‘Š(, โ„Ž'#$ Gate ๐’“ + A ๐‘Š %- ๐‘ฅ' ๐‘Š(- โ„Ž'#$ Gate ๐’› + A ๐‘Š%( ๐‘ฅ' ๐‘Š(( โ„Žโ€ฒ Gate ๐’‰ + A * โ„Ž' -1 E โ„Ž' โ„Žโ€ฒ LSTM GRU Time-independent
  10. ๐‘Š%& ๐‘ฅ' ๐‘Š(& โ„Ž'#$ Gate ๐’Š ๐‘Š%) ๐‘ฅ' ๐‘Š() โ„Ž'#$

    Gate ๐’‡ ๐‘Š %* ๐‘ฅ' ๐‘Š(* โ„Ž'#$ Gate ๐’„ ๐‘Š %+ ๐‘ฅ' ๐‘Š(+ โ„Ž'#$ Gate ๐’ + A + A + A + A E โ„Ž' ๐‘Š %, ๐‘ฅ' ๐‘Š(, โ„Ž'#$ Gate ๐’“ + A ๐‘Š %- ๐‘ฅ' ๐‘Š(- โ„Ž'#$ Gate ๐’› + A ๐‘Š%( ๐‘ฅ' ๐‘Š(( โ„Žโ€ฒ Gate ๐’‰ + A * โ„Ž' -1 E โ„Ž' โ„Žโ€ฒ LSTM GRU Time-independent Time-dependent
  11. ๐‘Š%& ๐‘ฅ' ๐‘Š(& โ„Ž'#$ Gate ๐’Š ๐‘Š%) ๐‘ฅ' ๐‘Š() โ„Ž'#$

    Gate ๐’‡ ๐‘Š %* ๐‘ฅ' ๐‘Š(* โ„Ž'#$ Gate ๐’„ ๐‘Š %+ ๐‘ฅ' ๐‘Š(+ โ„Ž'#$ Gate ๐’ + A + A + A + A E โ„Ž' ๐‘Š %, ๐‘ฅ' ๐‘Š(, โ„Ž'#$ Gate ๐’“ + A ๐‘Š %- ๐‘ฅ' ๐‘Š(- โ„Ž'#$ Gate ๐’› + A ๐‘Š%( ๐‘ฅ' ๐‘Š(( โ„Žโ€ฒ Gate ๐’‰ + A * โ„Ž' -1 E โ„Ž' โ„Žโ€ฒ LSTM GRU Time-independent Time-dependent
  12. GraphLSTM ๐บ*+.! (๐‘ฅ' , ๐‘Š%& ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(&

    ) Gate ๐’Š + A + A + A + A E โ„Ž' + A + A + A * โ„Ž' -1 E โ„Ž' โ„Žโ€ฒ GraphGRU ๐บ*+.! (๐‘ฅ' , ๐‘Š%) ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š() ) Gate ๐’‡ ๐บ*+.! (๐‘ฅ' , ๐‘Š %* ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(* ) Gate ๐’„ ๐บ*+.! (๐‘ฅ' , ๐‘Š %+ ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(+ ) Gate ๐’ ๐บ*+.! (๐‘ฅ' , ๐‘Š %, ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(, ) Gate ๐’“ ๐บ*+.! (๐‘ฅ' , ๐‘Š %- ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(- ) Gate ๐’› ๐บ*+.! (๐‘ฅ' , ๐‘Š%( ) ๐บ*+.! (โ„Ž/, ๐‘Š(!( ) Gate ๐’‰ Time-independent Time-dependent
  13. Challenge #1: Redundant Neighborhood Aggregation GraphLSTM โ€ข Two categories of

    graph convolutions. ร˜ Time-independent graph convolution depends on current representations of nodes. ร˜ Time-dependent graph convolution depends on previous hidden states. โ€ข Redundancy: Graph convolutions in the same category perform same neighborhood aggregation. ๐บ*+.! (๐‘ฅ' , ๐‘Š%& ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(& ) Gate ๐’Š + A + A + A + A E โ„Ž' ๐บ*+.! (๐‘ฅ' , ๐‘Š%) ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š() ) Gate ๐’‡ ๐บ*+.! (๐‘ฅ' , ๐‘Š %* ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(* ) Gate ๐’„ ๐บ*+.! (๐‘ฅ' , ๐‘Š %+ ) ๐บ*+.! (โ„Ž'#$ , ๐‘Š(+ ) Gate ๐’
  14. Challenge #2: Inefficient Distributed Training โ€ข No existing systems for

    training static GNNs, for example, DGL, support distributed dynamic GNN training in an efficient way. โ€ข Static GNN training: โ€ข Partitioning both the graph structure and node features across machines. โ€ข Using data parallelism to train a static GNN. โ€ข Can we partition each snapshot individually? ยง Partitioning and maintaining a large number of snapshots can be expensive. ยง The graph structure and the node features in each snapshot may vary.
  15. Cached Message Passing + A + A + A +

    A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+ Time-independent Time-dependent Typical Message Passing Paradigm of GNN: ๐‘š0โ†’! " = ๐‘€"(โ„Ž! "#$, โ„Ž0 "#$, ๐‘’0โ†’! "#$ ) ๐‘š! " = . 0โˆˆ3(!) ๐‘š0โ†’! " โ„Ž! " = ๐‘ˆ"(โ„Ž! "#$, ๐‘š! " )
  16. Cached Message Passing Typical Message Passing Paradigm of GNN: The

    results after the message passing can be reused for all graph convolution in the same category. ๐‘š0โ†’! " = ๐‘€"(โ„Ž! "#$, โ„Ž0 "#$, ๐‘’0โ†’! "#$ ) ๐‘š! " = . 0โˆˆ3(!) ๐‘š0โ†’! " โ„Ž! " = ๐‘ˆ"(โ„Ž! "#$, ๐‘š! " ) + A + A + A + A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+ Time-independent Time-dependent
  17. Cached Message Passing โ€ข Dynamic graphs are often trained using

    sequence-to-sequence models in a sliding-window fashion. t=1 GraphRNN t=2 t=3 t=4 GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN H H H Layer 1 Layer 2 H H H H H H H Teacher States (Ground Truth) Encoder Decoder Seq 1
  18. Cached Message Passing โ€ข Dynamic graphs are often trained using

    sequence-to-sequence models in a sliding-window fashion. GraphRNN t=2 t=3 t=4 GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN H H H Layer 1 Layer 2 H H H H H H H Encoder Decoder t=5 Seq 2
  19. Cached Message Passing โ€ข Dynamic graphs are often trained using

    sequence-to-sequence models in a sliding-window fashion. GraphRNN t=2 t=3 t=4 GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN GraphRNN H H H Layer 1 Layer 2 H H H H H H H Encoder Decoder t=5 Neighborhood aggregation has already been performed in previous sequence(s)! Seq 2
  20. Cached Message Passing + A + A + A +

    A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+ ๐‘ฅ' โ„Ž'#$ Cache Store Snapshot t โ€ฆ Snapshot t-n ๐‘š%"#% ๐‘š("#%#$
  21. Cached Message Passing ๐‘ฅ' โ„Ž'#$ Cache Store Snapshot t โ€ฆ

    Snapshot t-n ๐‘š%"#% ๐‘š("#%#$ Msg. Passing Msg. Passing + A + A + A + A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+
  22. Cached Message Passing ๐‘ฅ' โ„Ž'#$ Cache Store Snapshot t ๐‘š%"

    ๐‘š("#$ โ€ฆ Snapshot t-n ๐‘š%"#% ๐‘š("#%#$ PUT PUT Msg. Passing Msg. Passing + A + A + A + A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+
  23. Cached Message Passing ๐‘ฅ' โ„Ž'#$ Cache Store Snapshot t ๐‘š%"

    ๐‘š("#$ โ€ฆ Snapshot t-n ๐‘š%"#% ๐‘š("#%#$ PUT PUT GET GET Msg. Passing Msg. Passing + A + A + A + A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+
  24. Cached Message Passing ๐‘ฅ' โ„Ž'#$ Cache Store Snapshot t ๐‘š%"

    ๐‘š("#$ โ€ฆ Snapshot t-n ๐‘š%"#% ๐‘š("#%#$ PUT PUT GET GET Msg. Passing GET GET GET GET + A + A + A + A E โ„Ž' GraphLSTM ๐‘”%" ) ๐‘”("#$ ) Gate ๐’‡ ๐‘š%" ๐‘š("#$ ๐‘Š%) ๐‘Š() ๐‘”%" & ๐‘”("#$ & Gate ๐’Š ๐‘š%" ๐‘š("#$ ๐‘Š%& ๐‘Š(& ๐‘”%" * ๐‘”("#$ * Gate ๐’„ ๐‘š%" ๐‘š("#$ ๐‘Š %* ๐‘Š(* ๐‘”%" + ๐‘”("#$ + Gate ๐’ ๐‘š%" ๐‘š("#$ ๐‘Š %+ ๐‘Š(+ Msg. Passing
  25. Distributed DGNN Training t=1 t=2 t=n โ€ฆ Partitioned Snapshots &

    Input Features ๐‘€$ Layer 1 Layer 2 Layer K t=1 t=2 t=n โ€ฆ ๐‘€6 t=1 t=2 t=n โ€ฆ ๐‘€7 t=1 t=2 t=n โ€ฆ ๐‘€8 Layer 1 Layer 2 Layer k Sliding Window
  26. DynaGraph API cache() Cache caller function outputs; do nothing if

    already cached. msg_pass() Computes intermediate message passing results. update() Computes output representation from intermediate message passing results. integrate() Integrates a GNN into a GraphRNN to create a dynamic GNN. stack_seq_model() Stacks dynamic GNN layers to an encoder-decoder structure.
  27. Implementation & Evaluation โ€ข Implemented on Deep Graph Library (DGL)

    v0.7 โ€ข Evaluated using 8 machines, each with 2 NVIDIA Tesla V100 GPUs ยง METR-LA: 207 nodes/snapshots, |F|=2, |S|= 34K ยง PEMS-BAY: 325 nodes/snapshots, |F|=2, |S|= 52K ยง METR-LA-Large: 0.4m nodes/snapshots, |F|=128, |S|= 34k ยง PEMS-BAY-Large: 0.7m nodes/snapshots, |F|=128, |S|= 52k โ€ข Several Dynamic GNN architectures ยง GCRN-GRU, GCRN-LSTM [ICONIP โ€˜18] ยง DCRNN [ICLR โ€˜18]
  28. DynaGraph Single-Machine Performance 0 50 100 150 200 250 DCRNN

    GCRN-GRU GCRN-LSTM DCRNN GCRN-GRU GCRN-LSTM META-LA PEMS-BAY Average Epoch Time(s) DGL DynaGraph Up to 2.31x Speedup
  29. DynaGraph Distributed Performance Up to 2.23x Speedup 0 1000 2000

    3000 4000 5000 6000 DCRNN GCRN-GRU GCRN-LSTM DCRNN GCRN-GRU GCRN-LSTM META-LA-Large PEMS-BAY-Large Average Epoch Time(s) DGL DynaGraph
  30. DynaGraph Scaling 0 10 20 30 40 50 60 70

    80 2(4) 4(8) 8(16) Throughput (snapshots/sec) 0 10 20 30 40 50 60 70 80 2(4) 4(8) 8(16) Throughput (snapshots/sec) # Machines (# GPUs) DGL DynaGraph GCRN-GRU GCRN-LSTM
  31. Summary โ€ข Supporting dynamic graphs is increasingly important for enabling

    many GNN applications. โ€ข Existing GNN systems mainly focus on static graphs and static GNNs. โ€ข Dynamic GNN architectures combine GNN techniques and temporal embedding techniques like RNNs. โ€ข DynaGraph enables dynamic GNN training at scale. โ€ข Several techniques to reuse intermediate results. โ€ข Efficient distributed training. โ€ข Outperforms state-of-the-art solutions. Thank you! Contact: [email protected]