Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Who owns what? Graph theory application @ PyCon MY 2015

Who owns what? Graph theory application @ PyCon MY 2015

Finding companies indirect ownership can be tricky. Look at how we make use of graph theory and linear algebra to solve real world application.

Shulhi Sapli

August 23, 2015
Tweet

Other Decks in Programming

Transcript

  1. Company Shareholder Share Foo Sdn. Bhd A 50 Foo Sdn.

    Bhd Bar Sdn. Bhd 50 Bar Sdn. Bhd B 50 Bar Sdn. Bhd C 50
  2. Foo Bar A B C 50% 50% 50% 50% Effective

    ownership own-a relationship
  3. Foo Bar A B C 50% 50% 50% 50% 25%

    25% Effective ownership
  4. Foo Bar A B C 50% 50% 50% 50% 25%

    25% Effective ownership
  5. Graph traversal 1. Simple for linear relationship. Remember our first

    example? That’s linear. Foo Bar A B C 50% 50% 50% 50%
  6. Correct answer: A -> Bar = 0.20481928 C -> Foo

    = 0.19879518 B -> Foo = 0.19879518
  7. Graph traversal 3. Cycle relationship needs to be converted into

    geometric series formula in order to be correctly calculated. Foo Bar A B C 33% 34% 50% 33% 50%
  8. Graph traversal 4. Lots of cycle in real data :\

    5. Lots of tracking to be done.
  9. Graph traversal 5. From our runs, it took more than

    a week+ to calculate all the results.
  10. Matrix: Revolutions 1. Model companies/shareholders’ relationship into adjacency matrix. 2.

    Feed into given equation 3. Calculate the inverse 4. Profit!
  11. Equation where, I is the identity matrix and A is

    the adjacency matrix corresponding to the relationship of companies.
  12. Adjacency Matrix A C B Foo Bar A 0 0

    0 0.5 0 C 0 0 0 0 0.5 B 0 0 0 0 0.5 Foo 0 0 0 0 0 Bar 0 0 0 0.5 0 A =
  13. CPU Utilization 1. OpenBLAS - Multicore Numpy Numpy - Make

    use of BLAS BLAS - Low level linear algebra
  14. CPU Utilization 2. Python multi-thread/process does not work for our

    use case. • Multi-thread - no true parallelization. • Bottleneck is CPU not I/O bound • Multi-process - One single process already consuming lots of CPU. Lots of context switching.
  15. Memory usage 1. Iterative vs recursive algorithm • Stack frame

    • No support for tail-call optimization 2. del keyword in Python. Manual management of object reference count.
  16. Numpy quirks File  "/usr/local/lib/python2.7/dist-­‐packages/scipy/linalg/decomp_svd.py",  line  103,  in   svd  

           raise  LinAlgError("SVD  did  not  converge")   numpy.linalg.linalg.LinAlgError:  SVD  did  not  converge • SVD does not converge. • Moore-Penrose pseudo-inverse make use of SVD. By definition, you can always find SVD. • Numpy has low iteration limit hard-coded into its source code. • Will raise SVD did not converge if failed to converge within this iteration limit. • Refer file dlapack_lite.c
  17. Data structures • Know when to use Set vs List

    • Lookup: Set O(1) vs List O(n) • Numpy matrices format - sparse vs dense
  18. Algorithm 1. Re-frame the problem 2. Matrix inverse is always

    hard and cpu intensive • If we can’t invent algo that can do the calculation in O(1), try to limit the n • Because matrix inversion becomes slower as n becoms larger
  19. Algorithm - Limiting the n • Reducing memory usage •

    Reducing CPU utilization n x n n depends on the total companies/shareholders. Assuming n is 50,000. 50,000 x 50,000 x 8 bytes = 160Gb of memory usage just to hold data into memory. A =
  20. Algorithm - Limiting the n Lots of cyclic nodes }

    Matrix approach } Use normal approach for each connected component since linear multiplication between nodes are trivial in CPU cost Lots of cyclic nodes
  21. Result Memory usage ~< 30Gb vs > 120Gb for our

    early trials & errors Run time calculation ~< 3 hours vs > 1 week for graph traversal approach
  22. Reference Dr. Ivan Keglević, Matrix Approach to the Calculation of

    Indirect Quotas http://web.math.pmf.unizg.hr/applmath99/245-251.pdf