for (i = 0; i < 1024; i += block_size) { for (j = 0; j < 1024; j += block_size) { for (k = 0; k < 1024; k += block_size) { for (ii = i; ii < i + block_size; ++ii) { for (jj = j; jj < j + block_size; ++jj) { for (kk = k; kk < k + block_size; ++kk) { elapsed time = 0.172988 sec, 12.414061 GFLOPS naive実装の約35倍を達成した!(コンパイルオプションが違うので直接比較はできないけど) コンパイルオプション: -O3 -mavx2 -std=c11 -Wall -Wextra
Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65–76. DOI:https://doi.org/10.1145/1498765.1498785 • Software Engineering Advice from Building Large-Scale Distributed Systems http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf • コンピュータアーキテクチャ技術入門 https://gihyo.jp/book/2014/978-4-7741-6426-7 • パタヘネ5版 https://www.amazon.co.jp/dp/4822298426/ • スーパーコンピュータ (岩波講座 計算科学 別巻) https://www.amazon.co.jp/dp/4000113070/ • High Performance Computing: Modern Systems and Practices https://www.amazon.co.jp/dp/B077NZ4SW3/