Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel Numerical Verification of the σ_odd pr...

Parallel Numerical Verification of the σ_odd problem

Quick statement of the σ_odd problem (and its variant ς_odd problem) with an algorithm to check it. Benchmarks of parallel implementations in multi-threads, Open MPI and OpenCL.

More Decks by 🌳 Olivier Pirson — OPi 🇧🇪🇫🇷🇬🇧 🐧 👨‍💻 👨‍🔬

Other Decks in Science

Transcript

  1. Universit´ e Libre de Bruxelles Computer Science Department INFO-Y100 (4004940ENR)

    Parallel systems Project Parallel Numerical Verification of the σodd problem Presentation 1 3 7 21 Olivier Pirson — [email protected] orcid.org/0000-0001-6296-9659 December 15, 2017 (Last modifications: September 11, 2019) https://speakerdeck.com/opimedia/parallel-numerical-verification-of-the-s-odd-problem
  2. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe problem 2 Computation Simple algorithm Better algorithm 3 Parallel implementations Multi-threads Message-passing (Open MPI) GPU (OpenCL) 4 Results Speedup Efficiency Overhead Benchmarks tables Parallel Numerical Verification of the σodd problem 2 / 41
  3. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe σodd and ςodd functions σ(n) = sum of all divisors of n (sigma) σodd (n) = sum of odd divisors of n (sigma odd) All divisors of 18: {1, 2, 3, 6, 9, 18} Only odd divisors: {1, 3, 9} so σodd (18) = 13 All divisors of 19: {1, 19} Only odd divisors: {1, 19} so σodd (19) = 20 ςodd (n) = σodd (n) divided by 2 until to be odd (varsigma odd) ςodd (18) = 13 ςodd (19) = 5 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 σ(n) 1 3 4 7 6 12 8 15 13 18 12 28 14 24 24 31 18 39 20 42 3 σodd (n) 1 1 4 1 6 4 8 1 13 6 12 4 14 8 24 1 18 13 20 6 3 ςodd (n) 1 1 1 1 3 1 1 1 13 3 3 1 7 1 3 1 9 13 5 3 Parallel Numerical Verification of the σodd problem 3 / 41
  4. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe σodd problem: an iteration problem We iterate the ςodd (or equivalently σodd ) function and we observe that we always reach 1. Numbers in orange are square numbers. For all n odd and square number (= 1): ςodd (n) = σodd (n) > n But we observe that for almost other odd numbers n: ςodd (n) < n Note that even numbers are not interesting for this problem, because σodd (2n) = σodd (n). and ςodd (2n) = ςodd (n). 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 Parallel Numerical Verification of the σodd problem 4 / 41
  5. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe σodd problem: an iteration problem The point in the middle of this picture is the number 1. 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 Parallel Numerical Verification of the σodd problem 5 / 41
  6. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe σodd problem is a conjecture Does the iteration always reaches 1? The σodd problem is the conjecture that is always true, what ever the starting number (integer ≥ 1). Successfully checked for each n until 1.1 × 1011 ≃ 1.6 × 236 with programs developed for this work. Previous result known was 230. Moreover, n ≤ 1011 =⇒ ςodd 15(n) = 1 Parallel Numerical Verification of the σodd problem 6 / 41
  7. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe problem 2 Computation Simple algorithm Better algorithm 3 Parallel implementations Multi-threads Message-passing (Open MPI) GPU (OpenCL) 4 Results Speedup Efficiency Overhead Benchmarks tables Parallel Numerical Verification of the σodd problem 7 / 41
  8. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesumerical verification by the simple direct algorithm For each odd number: Algorithm 1 first check varsigma odd(first n, last n) Ò f i r s t c h e c k v a r s i g m a o d d ( f i r s t n , l a s t n ) : 1 ÓÖ n = f i r s t n ØÓ l a s t n ×Ø Ô 2 2 lowe r n , l e n g t h = f i r s t i t e r a t e v a r s i g m a o d d u n t i l l o w e r (n ) 3 l e n g t h > 1 Ø Ò 4 ÔÖ ÒØ n , lowe r n , l e n g t h Parallel Numerical Verification of the σodd problem 8 / 41
  9. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesumerical verification by the simple direct algorithm Simply iterate ςodd until to have a little number: Algorithm 2 first iterate varsigma odd until lower(n) Ò f i r s t i t e r a t e v a r s i g m a o d d u n t i l l o w e r ( s t a r t n ) : 1 n = s t a r t n 2 l e n g t h = 0 3 Ó 4 l e n g t h = l e n g t h + 1 5 n = ςodd (n ) 6 n > MAX POSSIBLE N Ø Ò 7 ÔÖ ÒØ "! Impossible to check " , s t a r t n , le ngth , n 8 Ü Ø 9 Û Ð n > s t a r t n 10 11 n = s t a r t n Ø Ò 12 ÔÖ ÒØ "! Found not trivial cycle " , s t a r t n , l e n g t h 13 Ü Ø 14 15 Ö ØÙÖÒ n , l e n g t h Parallel Numerical Verification of the σodd problem 9 / 41
  10. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesomputation of σodd (n) Assume n odd: n = pα1 1 × pα2 2 × pα3 3 × · · · × pαk k with pi distinct prime numbers σodd (n) = pα+1 1 −1 p1−1 × pα+1 2 −1 p2−1 × pα+1 3 −1 p3−1 × · · · × pα+1 k −1 pk −1 Thus, to verify the conjecture we must factorize (other ways are less efficient). Parallel Numerical Verification of the σodd problem 10 / 41
  11. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesse properties to avoid a lot of computations For each n, we want to check there exists k such that σodd k (n) = 1 It is equivalent to check there exists k such that ςodd k (n) < n. That reduces the path that will be compute. Only odd numbers must be check (50%). Other numbers can be avoided (remains ≃ 33%). Almost numbers reach smaller number in only one step! Exceptions identified before computation: square numbers. The other exceptions (called bad numbers) are very rare. So instead to iterate we will compute only one step and keep exceptions that will be check separately (very fast). ςodd (ab) ≤ ςodd (a) ςodd (b) −→ shortcut in the factorization (the most heavy work) (with use of previous known bad numbers or with general upper bound). Parallel Numerical Verification of the σodd problem 11 / 41
  12. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesransformed problem With these properties we have transformed the necessity to compute the complete iteration of σodd (and thus the complete factorization) of each number to this both improved and simpler (relatively to other possible optimizations) algorithm: compute only one (eventually partially) iteration of ςodd for only some numbers. “The cheapest, fastest and most reliable components of a computer system are those that aren’t there.” — Gordon Bell Parallel Numerical Verification of the σodd problem 12 / 41
  13. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesransformed problem (progs/src/sequential/sequential/sequential.hpp) Algorithm 3 sequential check gentle varsigma odd(first n, last n) // P r e c o n d i t i o n s : 3 ≤ f i r s t n odd ≤ l a s t n ≤ MAX POSSIBLE N Ò s e q u e n t i a l c h e c k g e n t l e v a r s i g m a o d d ( f i r s t n , l a s t n ) : 1 b a d t a b l e = ∅ 2 ÓÖ n = f i r s t n ØÓ l a s t n ×Ø Ô 2 3 ÒÓØ (3, 7, 31 or 127 \ \ n) Ø Ò 4 ÒÓØ (n i s square number) Ø Ò 5 ÒÓØ s e q u e n t i a l i s v a r s i g m a o d d l o w e r (n , 6 bad table , f i r s t n ) Ø Ò 7 b a d t a b l e = b a d t a b l e ∪ {n} 8 ÔÖ ÒØ n Ö ØÙÖÒ b a d t a b l e // P o s t c o n d i t i o n : // I f a l l numbers < f i r s t n r e s p e c t the c o n j e c t u r e // and a l l square numbers ≤ l a s t n r e s p e c t the c o n j e c t u r e // and a l l odd bad numbers ≤ l a s t n r e s p e c t the c o n j e c t u r e // then a l l numbers ≤ l a s t n r e s p e c t the c o n j e c t u r e . // P r i n t a l l odd bad numbers between f i r s t n and l a s t n ( i n c l u d e d ) // and r e t u r n the s e t . d \ n means that d is a divisor of n. d \ \ n means that d is a divisor of n, but d2 is not. Parallel Numerical Verification of the σodd problem 13 / 41
  14. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesransformed problem Computes (eventually partially) ςodd (n) by the factorization of n and returns True if and only if ςodd (n) < n. Algorithm 4 sequential is varsigma odd lower(n, bad table, bad first n) // P r e c o n d i t i o n s : 3 ≤ n odd ≤ MAX POSSIBLE N // b a d t a b l e c o n t a i n s a l l odd bad numbers // between b a d f i r s t n ( i n c l u d e d ) and n ( e xc lude d ) Ò s e q u e n t i a l i s v a r s i g m a o d d l o w e r (n , bad table , b a d f i r s t n ) : 1 n d i v i d e d = n 2 varsigma odd = 1 3 ÓÖ p odd prime ≤ ⌊ √ n divided⌋ 4 α = 0 5 Û Ð p \ n d i v i d e d 6 n d i v i d e d = n d i v i d e d / p 7 α = α + 1 8 9 α > 0 Ø Ò // pα i s a f a c t o r of n 10 varsigma odd = varsigma odd ∗ Odd pα − 1 p − 1 + pα 11 ( varsigma odd 12 ∗ s e q u e n t i a l s i g m a o d d u p p e r b o u n d ( n d i v i d e d , 13 bad table , b a d f i r s t n )) < n Ø Ò 14 Ö ØÙÖÒ ÌÖÙ 15 16 n d i v i d e d > 1 Ø Ò // n d i v i d e d i s prime 17 varsigma odd = varsigma odd ∗ Odd( n d i v i d e d + 1) 18 19 Ö ØÙÖÒ ( varsigma odd < n ) Parallel Numerical Verification of the σodd problem 14 / 41
  15. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesactorization shortcut When we found a prime factor, it may be possible to shortcut the complete factorization. For example, with a first prime factor p1 of n: n = pα1 1 n′ σodd (n) = pα+1 1 −1 p1−1 × σodd (n′) σodd (n) ≤ pα+1 1 −1 p1−1 × upper bound of σodd (n′) < n? If yes, then stop Upper bound always true: σodd (n′) ≤ 2n′ 8 √ n′ It is the same for the ςodd function, with some additional division(s) by 2. And if n′ is gentle (odd but neither square neither bad): ςodd (n′) < n′ (so it can be possible to shortcut “often”). Parallel Numerical Verification of the σodd problem 15 / 41
  16. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablestop and restart Note that the program can be stopped (or executed until some last n) and restarted with the last value checked. In fact, it is possible to compute different ranges of numbers separately (in the same time or not). If all required numbers are checked (with odd square numbers and bad numbers checked, for example by the naive way, which is fast for these rare numbers) until number N, then the conclusion is for all n such that n ≤ N, the iteration of σodd (and ςodd ) from n reaches 1 (what we wanted to achieve). Parallel Numerical Verification of the σodd problem 16 / 41
  17. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe problem 2 Computation Simple algorithm Better algorithm 3 Parallel implementations Multi-threads Message-passing (Open MPI) GPU (OpenCL) 4 Results Speedup Efficiency Overhead Benchmarks tables Parallel Numerical Verification of the σodd problem 17 / 41
  18. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableserformance with one thread/process First, the comparison between sequential, three multi-threading and two message-passing implementations (for only one thread/process). By checking numbers between 1 and 20,000,001. On a personal computer with 4 cores, 2 threads by core. 6 6.2 6.4 6.6 6.8 7 0 1 2 3 4 5 seconds 0:sequential, one thread (1:one by one, 2:by range, 3:dynamic), one process MPI (4:one by one, 5:dynamic) Parallel Numerical Verification of the σodd problem 18 / 41
  19. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesulti-threads (thread of C++11) 3 different implementations: (progs/src/threads/threads/threads.hpp) One by one Each slave computes independently one number and sends a boolean to the master. The master also computes one number, and waits everybody. And so forth with next numbers. Silly implementation; just to try. Very inefficient. The barrier is a big limitation because each number has a different factorization time. By range Like one by one but each slave receives a range of numbers (by these extremities), computes and returns the (very little) set of bad numbers founds. The master computes a smaller range, and waits everybody. And so forth with next numbers. Really better because computation is more well balanced, due to an average of the factorization time. “Dynamic” Like by range, but the master do not waits, gives new range when a slave is free, and computes also the rest of the time. Very good occupation for each thread (see graph in following slides). All threads share the same prime number tables. Parallel Numerical Verification of the σodd problem 19 / 41
  20. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesulti-threads — one by one 0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 seconds # threads Parallel Numerical Verification of the σodd problem 20 / 41
  21. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesulti-threads — by range 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 seconds # threads Parallel Numerical Verification of the σodd problem 21 / 41
  22. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesulti-threads — “dynamic” 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 seconds # threads Parallel Numerical Verification of the σodd problem 22 / 41
  23. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesessage-passing (Open MPI) 2 implementations: (progs/src/mpi/mpi/mpi.hpp) One by one One element, barrier. Very inefficient; just to try. “Dynamic” By range and does not wait. Same algorithms than for multi-threading. But exchange information by messages. (That could be between different machines, but these results was computed on only one computer.) Little impact if size of range is important compared to the small quantity of these information. Messages from the master to each slave: The unique number or the extremities of the range, and the new (rare) bad numbers found by other threads. Messages from each slave to the master: A boolean or a array of the new (rare) bad numbers found. Main differences with multi-threading: exchanges between processes, and each process have its own prime numbers table. Parallel Numerical Verification of the σodd problem 23 / 41
  24. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesessage-passing — “dynamic” 0 1 2 3 4 5 6 7 1 2 3 4 5 6 seconds 5 ¡o¢£¤¤ Parallel Numerical Verification of the σodd problem 24 / 41
  25. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 GPU (OpenCL) Only one implementation: (progs/src/opencl/opencl/opencl.hpp) By list of numbers The CPU selects a list of numbers to be check and sends them to the GPU. The GPU compute completely ς(n) for each n received (without to use a list of bad numbers and without to shortcut the factorization). Then the GPU returns a corresponding list of booleans to the CPU. And so forth. Instead a direct computation of ς(n) during the factorization, this implementation collects before all prime factors of n. That makes it easier the parallel work. The important improvements of the algorithm (the shortcut of the factorization) was also removed, because that did not gave better results, due to the complexification of branching. Parallel Numerical Verification of the σodd problem 25 / 41
  26. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 GPU (OpenCL): explanations of bad results The computation is massively parallel (if big list of numbers). But the efficiency is limited by the difference of the factorization process for each number. The algorithm, by the nature of the computation of the problem by factorization, is more or less a random succession of conditional branches. And the nature of the parallel computation by GPUs loses a lot of power on that. More the list of numbers is big and more the computation is ideally parallel. But more this list is big and more the computation of each number disturbs the progress of the others. Moreover, all numbers quickly factorized wait the end of the others. Also, GPUs give the best of their power on floating point computations. This problem is an integer problem. A completely different approach could be better. Parallel Numerical Verification of the σodd problem 26 / 41
  27. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 GPU (OpenCL): old GPU used during tests The poor performances on the OpenCL implementation are also due to the old GPU used: a graphic card NVIDIA quadro FX 1800 with 768 Mio. This GPU has no cache for the global memory. And the main loop iterates on prime numbers in this global memory. More modern GPU could use the native OpenCL function ctz (instead a loop). Nevertheless, with the maximum list of numbers possible for this GPU, the OpenCL implementation has a little (disappointing) gain of performance compared to the sequential implementation. Parallel Numerical Verification of the σodd problem 27 / 41
  28. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 GPU (OpenCL) — by list of numbers 0 20 40 60 80 100 100 1000 10000 100000 seconds s¥¦§ ¨© § ¥s ¨© §s ¨ ¥¥! s! §" Parallel Numerical Verification of the σodd problem 28 / 41
  29. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe problem 2 Computation Simple algorithm Better algorithm 3 Parallel implementations Multi-threads Message-passing (Open MPI) GPU (OpenCL) 4 Results Speedup Efficiency Overhead Benchmarks tables Parallel Numerical Verification of the σodd problem 29 / 41
  30. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 Results Results are produced on a computer with only 4 cores, that explains the decrease in gains beginning at 5 cores. Results with Open MPI are a little strange, because for some parameters they are better than the sequential implementation. It is like as if mpirun on the sequential program made it faster. Theoretically the overhead of the MPI implementation should be bigger than the multi-thread implementation, due to the communication between processes (but tests were made on a single computer). The implementation is almost identical to the multi-thread version and all computation results are identical, thus it must be correct. Maybe the GCC compiler required with Open MPI optimizes better this code than the clang compiler used for sequential and multi-thread versions. Maybe is due to a little imprecision in the measures. The two better implementations (“dynamic” algorithm with threads and Open MPI) are both pretty close to the ideal. Parallel Numerical Verification of the σodd problem 30 / 41
  31. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablespeedup 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 speedup # thre#$%&'()e00 i$e12i23 0e46e12i#7 thre#$0%(1e 83 (1e thre#$0%83 '#19e thre#$0%$31#@i) wAB%(1e 83 (1e wAB%$31#@i) Parallel Numerical Verification of the σodd problem 31 / 41
  32. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablespeedup with OpenCL 0 1 2 3 4 5 1 10 100 1000 10000 100000 1x106 CDEEFGD # thrHIPQRSocess or size of the list of numbers (logarithmic scale) identity sequential threads/one by one threads/by range threads/dynamic MPI/one by one MPI/dynamic OpenCL Parallel Numerical Verification of the σodd problem 32 / 41
  33. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesfficiency 0 0.2 0.4 0.6 0.8 1 1.2 0 1 2 3 4 T 6 U 8 e V ciency W XY`ead/process sequential threads/one by one threads/by range threads/dynamic MPI/one by one MPI/dynamic Parallel Numerical Verification of the σodd problem 33 / 41
  34. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tables 1 3 5 7 9 13 11 15 17 19 21 23 25 31 27 29 33 35 37 39 41 43 45 47 49 57 51 53 55 59 61 63 65 67 69 71 73 75 77 79 81 121 133 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 123 125 127 129 131 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 183 171 173 175 177 179 181 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 403 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 307 291 293 295 297 299 301 303 305 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 381 363 365 367 369 371 373 375 377 379 383 385 387 389 391 393 395 397 399 401 405 407 409 411 413 415 417 419 421 423 425 427 429 431 433 435 437 439 441 741 443 445 447 449 451 453 455 457 459 461 463 465 467 469 471 473 475 477 479 481 483 485 487 489 491 493 495 497 499 501 503 505 507 509 511 513 515 517 519 521 523 525 527 529 553 531 533 535 537 539 541 543 545 547 549 551 555 557 559 561 563 565 567 569 571 573 575 577 579 581 583 585 587 589 591 593 595 597 599 601 603 605 607 609 611 613 615 617 619 621 623 625 781 627 629 631 633 635 637 639 641 643 645 647 649 651 653 655 657 659 661 663 665 667 669 671 673 675 677 679 681 683 685 687 689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 1093 731 733 735 737 739 743 745 747 749 751 753 755 757 759 761 763 765 767 769 771 773 775 777 779 783 785 787 789 791 793 795 797 799 801 803 805 807 809 811 813 815 817 819 821 823 825 827 829 831 833 835 837 839 841 871 843 845 847 849 851 853 855 857 859 861 863 865 867 869 873 875 877 879 881 883 885 887 889 891 893 895 897 899 901 903 905 907 909 911 913 915 917 919 921 923 925 927 929 931 933 935 937 939 941 943 945 947 949 951 953 955 957 959 961 993 963 965 967 969 971 973 975 977 979 981 983 985 987 989 991 995 997 999 1001 Efficiency with OpenCL 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 abac6 e d ciency # thread/process or size of the list of numbers (logarithmic scale) sequential threads/one by one threads/by range threads/dynamic MPI/one by one MPI/dynamic OpenCL Parallel Numerical Verification of the σodd problem 34 / 41
  35. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesverhead 0 1000 2000 3000 4000 5000 0 1 2 3 4 5 6 7 8 efgh head # thripqrstuvixx xiy€i‚ƒp„ thripqxrui …† ui thripqxr…† tp‡i thripqxrq†pˆƒv ‰‘rui …† ui ‰‘rq†pˆƒv Parallel Numerical Verification of the σodd problem 35 / 41
  36. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesverhead only until 4 cores ’“”” ’•”” 0 200 “”” 600 800 1000 1200 0 1 2 – “ —˜™d head # thrfghjklmnfpp sequential threads/one by one threads/by range threads/dynamic MPI/one by one MPI/dynamic Parallel Numerical Verification of the σodd problem 36 / 41
  37. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesverhead with OpenCL 1 10 100 1000 10000 100000 1x106 1x107 1x108 1x10 q 1x1010 1 10 100 1000 10000 100000 1x106 overhead (logarithmic scale) # thread/process or size of the list of numbers (logarithmic scale) sequential threads/one by one threads/by range threads/dynamic MPI/one by one MPI/dynamic OpenCL Parallel Numerical Verification of the σodd problem 37 / 41
  38. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesenchmarks table: sequential, threads & message passing Technology Algorithm # threads/process Time in s Speedup Efficiency Overhead sequential 1 6.853 1.000 1.00000 0.000 threads one by one 1 6.873 0.997 0.99720 19.213 threads one by one 2 57.101 0.120 0.06001 107348.026 threads one by one 3 54.349 0.126 0.04203 156192.179 threads one by one 4 54.395 0.126 0.03150 210726.330 threads one by one 5 53.782 0.127 0.02549 262057.964 threads one by one 6 72.986 0.094 0.01565 431064.690 threads one by one 7 79.897 0.086 0.01225 552426.255 threads one by one 8 81.665 0.084 0.01049 646469.398 threads by range 1 6.858 0.999 0.99931 4.764 threads by range 2 3.980 1.722 0.86105 1105.961 threads by range 3 2.674 2.563 0.85420 1169.809 threads by range 4 1.935 3.542 0.88541 886.991 threads by range 5 2.224 3.081 0.61622 4268.383 threads by range 6 1.900 3.608 0.60132 4543.861 threads by range 7 1.641 4.176 0.59653 4635.448 threads by range 8 1.452 4.722 0.59019 4758.860 threads dynamic 1 6.862 0.999 0.99879 8.274 threads dynamic 2 3.652 1.876 0.93823 451.194 threads dynamic 3 2.432 2.818 0.93918 443.806 threads dynamic 4 1.820 3.765 0.94116 428.429 threads dynamic 5 1.676 4.090 0.81804 1524.452 threads dynamic 6 1.541 4.447 0.74122 2392.667 threads dynamic 7 1.427 4.804 0.68625 3133.355 threads dynamic 8 1.328 5.161 0.64514 3769.762 MPI one by one 1 6.385 1.073 1.07329 -467.966 MPI one by one 2 13.981 0.490 0.24509 21109.499 MPI one by one 3 14.496 0.473 0.15760 36633.994 MPI one by one 4 14.819 0.462 0.11562 52422.147 MPI one by one 5 17.613 0.389 0.07782 81212.792 MPI one by one 6 17.994 0.381 0.06348 101108.177 MPI dynamic 1 6.350 1.079 1.07924 -503.202 MPI dynamic 2 3.373 2.032 1.01581 -106.693 MPI dynamic 3 2.253 3.042 1.01410 -95.266 MPI dynamic 4 1.677 4.088 1.02196 -147.274 MPI dynamic 5 1.560 4.393 0.87862 946.749 MPI dynamic 6 1.440 4.760 0.79339 1784.713 Parallel Numerical Verification of the σodd problem 38 / 41
  39. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tablesenchmarks table: OpenCL Size list Time in s Speedup Efficiency Overhead 64 119.918 0.057 0.00089 7667879.541 128 65.360 0.105 0.00082 8359216.723 256 36.547 0.188 0.00073 9349189.863 512 18.983 0.361 0.00071 9712571.781 1024 10.184 0.673 0.00066 10422012.896 2048 9.033 0.759 0.00037 18493143.642 4096 8.203 0.835 0.00020 33593316.851 8192 7.407 0.925 0.00011 60668647.350 16384 6.490 1.056 0.00006 106333100.097 32768 5.589 1.226 0.00004 183148027.592 65536 5.141 1.333 0.00002 336897704.640 131072 5.208 1.316 0.00001 682584188.174 262144 4.885 1.403 0.00001 1280678502.121 Parallel Numerical Verification of the σodd problem 39 / 41
  40. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe problem 2 Computation Simple algorithm Better algorithm 3 Parallel implementations Multi-threads Message-passing (Open MPI) GPU (OpenCL) 4 Results Speedup Efficiency Overhead Benchmarks tables Parallel Numerical Verification of the σodd problem 40 / 41
  41. Parallel Numerical Verification of the σodd problem The problem Computation

    Simple algo. Better algorithm Parallel implementations Multi-threads Message-passing GPU (OpenCL) Results Speedup Efficiency Overhead Benchmarks tableshe end All results, documents, C++/OpenCL, L A TEX sources and references are available on Bitbucket: https://bitbucket.org/OPiMedia/parallel-sigma odd-problem Parallel Numerical Verification of the σodd problem 41 / 41