Who owns what? Graph theory application @ PyCon MY 2015

Who owns what? Graph theory application @ PyCon MY 2015

Finding companies indirect ownership can be tricky. Look at how we make use of graph theory and linear algebra to solve real world application.

F1ef7d612f498f57d5c1f6b82cf83d12?s=128

Shulhi Sapli

August 23, 2015
Tweet

Transcript

  1. 5.

    Company Shareholder Share Foo Sdn. Bhd A 50 Foo Sdn.

    Bhd Bar Sdn. Bhd 50 Bar Sdn. Bhd B 50 Bar Sdn. Bhd C 50
  2. 6.

    Foo Bar A B C 50% 50% 50% 50% Effective

    ownership own-a relationship
  3. 7.

    Foo Bar A B C 50% 50% 50% 50% 25%

    25% Effective ownership
  4. 8.

    Foo Bar A B C 50% 50% 50% 50% 25%

    25% Effective ownership
  5. 10.

    Graph traversal 1. Simple for linear relationship. Remember our first

    example? That’s linear. Foo Bar A B C 50% 50% 50% 50%
  6. 11.
  7. 13.
  8. 14.

    Correct answer: A -> Bar = 0.20481928 C -> Foo

    = 0.19879518 B -> Foo = 0.19879518
  9. 15.

    Graph traversal 3. Cycle relationship needs to be converted into

    geometric series formula in order to be correctly calculated. Foo Bar A B C 33% 34% 50% 33% 50%
  10. 16.

    Graph traversal 4. Lots of cycle in real data :\

    5. Lots of tracking to be done.
  11. 17.

    Graph traversal 5. From our runs, it took more than

    a week+ to calculate all the results.
  12. 19.

    Matrix: Revolutions 1. Model companies/shareholders’ relationship into adjacency matrix. 2.

    Feed into given equation 3. Calculate the inverse 4. Profit!
  13. 20.
  14. 21.

    Equation where, I is the identity matrix and A is

    the adjacency matrix corresponding to the relationship of companies.
  15. 22.

    Adjacency Matrix A C B Foo Bar A 0 0

    0 0.5 0 C 0 0 0 0 0.5 B 0 0 0 0 0.5 Foo 0 0 0 0 0 Bar 0 0 0 0.5 0 A =
  16. 23.
  17. 26.

    CPU Utilization 1. OpenBLAS - Multicore Numpy Numpy - Make

    use of BLAS BLAS - Low level linear algebra
  18. 27.
  19. 28.

    CPU Utilization 2. Python multi-thread/process does not work for our

    use case. • Multi-thread - no true parallelization. • Bottleneck is CPU not I/O bound • Multi-process - One single process already consuming lots of CPU. Lots of context switching.
  20. 29.

    Memory usage 1. Iterative vs recursive algorithm • Stack frame

    • No support for tail-call optimization 2. del keyword in Python. Manual management of object reference count.
  21. 31.
  22. 32.

    Numpy quirks File  "/usr/local/lib/python2.7/dist-­‐packages/scipy/linalg/decomp_svd.py",  line  103,  in   svd  

           raise  LinAlgError("SVD  did  not  converge")   numpy.linalg.linalg.LinAlgError:  SVD  did  not  converge • SVD does not converge. • Moore-Penrose pseudo-inverse make use of SVD. By definition, you can always find SVD. • Numpy has low iteration limit hard-coded into its source code. • Will raise SVD did not converge if failed to converge within this iteration limit. • Refer file dlapack_lite.c
  23. 33.

    Data structures • Know when to use Set vs List

    • Lookup: Set O(1) vs List O(n) • Numpy matrices format - sparse vs dense
  24. 34.

    Algorithm 1. Re-frame the problem 2. Matrix inverse is always

    hard and cpu intensive • If we can’t invent algo that can do the calculation in O(1), try to limit the n • Because matrix inversion becomes slower as n becoms larger
  25. 35.

    Algorithm - Limiting the n • Reducing memory usage •

    Reducing CPU utilization n x n n depends on the total companies/shareholders. Assuming n is 50,000. 50,000 x 50,000 x 8 bytes = 160Gb of memory usage just to hold data into memory. A =
  26. 37.
  27. 38.

    Algorithm - Limiting the n Lots of cyclic nodes }

    Matrix approach } Use normal approach for each connected component since linear multiplication between nodes are trivial in CPU cost Lots of cyclic nodes
  28. 39.

    Result Memory usage ~< 30Gb vs > 120Gb for our

    early trials & errors Run time calculation ~< 3 hours vs > 1 week for graph traversal approach
  29. 40.

    Reference Dr. Ivan Keglević, Matrix Approach to the Calculation of

    Indirect Quotas http://web.math.pmf.unizg.hr/applmath99/245-251.pdf