Why_should_I_learn_OS_and_CA.pdf

計算機結構 SITCON 2022 William Mou 作業系統為什麼要學
教練我想打 Code 綜觀⼤部分⼤學的資訊⼯程學系除了了⼤⼀的程式設計外更多的時間在學習計算機相關的知識對於還是新⼿的學⽣全然不知為何要學習這些內容對於⼀個憧憬撰寫程式碼以開發軟體為⼰願的學⽣計算機結構和作業系統到底具有什麼樣的意義真的只是求學歷程上的背科⽤來換取 GPA 嗎

嚴謹的實驗先講環境 Ubuntu 20.04 AMD EPYC 7742 64-Core Processor * 2
L1d cache size: 2MiB ( 32 KiB / core) L1i cache size: 2MiB ( 32 KiB / core) L2 cache size: 32MiB (512 KiB / core) L3 cache size: 256MiB ( 4 MiB / core) capabilities: sse sse2 ht ssse3 fma cx16 sse4_1 sse4_2 avx avx2 DIMM DDR4 Synchronous Registered (Buffered)   3200 MHz (0.3 ns) g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

你的時間不是我的時間 real 0m0.184s user 0m0.120s sys 0m0.064s Wall Time ⼈的觀點 
程式開始時按下  程式結束時放開 CPU 時間程式的觀點  程式在 CPU 內跑了多少時間說到計算 C++ 程式的時間，你想到哪些⽅式？ • Linux 的 time 指令 • 使⽤ <time.h> 的 time() • 使⽤ <time.h> 的 clock() • 使⽤ <chrono> 的 high_resolution_clock()

考完試還記得什麼？ Computer Organization and Architecture is the study of
internal working, structuring, and implementation of a computer system. Architecture in the computer system, same as anywhere else, refers to the externally visual attributes of the system. Organization of a computer system is the way of practical implementation that results in the realization of architectural specifications of a computer system. More Information 2022 Fahule Company

Study with Three Case Branch Prediction Cache Locality Vector Processing
浪費時間排序陣列，程式執⾏卻變快了？迴圈展開！先數 i 還 j 居然差這麼多？ SIMD ⽤好 CPU 的全部價值！

Case 1 Branch Prediction 排序後的陣列排序前的陣列 In computer architecture, a
branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. 浪費時間排序陣列，程式執⾏卻變快了？

在撰寫程式時，評估時間複雜度是⼀個非常重要的步驟，以確保程式運⾏的⾜夠快，滿⾜需求。舉例⽽⾔，排序陣列是⼀個耗時的操作，減少使⽤通常能⼤幅降低程式的執⾏時間。但是，在 Stack Overflow 上，卻出現了⼀個奇妙的問題：「為什麼程式碼中多寫了⼀⾏排序，會使得整個程式變快呢？」前⾔

找出陣列內⼤於 128 的數字  並將其加總。 1. 先建立⼀個陣列，並填入亂數 2. 利⽤ std::sort 來排序這個陣列
3. 利⽤ if 判斷是否符合條件，若是則加總其中，第⼆步驟的「排序」與程式邏輯並無關聯，我們嘗試看看移除與否對性能造成的影響

事先排序的影響其中，第⼆步驟的「排序」與程式邏輯並無關聯，我們嘗試看看移除與否對性能造成的影響單位(秒) 無事先排序有事先排序迴圈部分耗時 20.779 8.089 整⽀程式耗時
20.783 8.096 g++ -o case1.out case1.cpp

事先排序的影響無事先排序有事先排序 0 5.5 11 16.5 22 快 2.5
倍！單位(秒) 無事先排序有事先排序迴圈部分耗時 20.779 8.089 整⽀程式耗時 20.783 8.096

Case 1 Branch Prediction 想像⼀個情況，在通訊不發達的 1800 年代，你站在改變列⾞軌道的操縱杆旁，你並不知道下⼀輛列⾞想往左還往右，因此你只能每次都請列⾞長減速，並告訴你他的⽬的地，然後你再以此判斷。
顯然這是⼀個很沒有效率的做法，因為不論列⾞⽬的為何，都需要減速與你溝通。

Case 1 Branch Prediction 有⼀個更好的想法是，我們來猜下⼀輛列⾞的⽬的地！ • 如果你猜對了，那列⾞長便可以直接通過。 • 如果你猜錯了，那列⾞便會停下來、與你溝通、改變⽅向。

Case 1 Branch Prediction 所以如果你總是都能猜對，列⾞將不需要停下，節省了⼤量時間。如果你時常猜錯，那麼就會耽擱許久。⽽這就是分⽀預測的基本概念！

CPU Pipeline The real reason behind the branch prediction 真實情況是
Time Order Time Order

Time Order Time Order 如果改⾏程衣服洗好才能決定：  下⼀藍要洗誰的衣服

Time 8 MOV EAX … F D X M W
16 CMP EAX 127 F D X M W 24 JLE for F D NOP NOP NOP 32 MOV … F NOP NOP NOP NOP 40 MOV … F D X M W Order 來看 Code

CPU ⾯臨的是⋯ • 如果 CPU 猜對了，就可以不間斷地執⾏程式。  • 如果 CPU
猜錯了，那就要花更多的時間，Rolling Back 回頭重來。 CPU bet… 現代⼿握著操縱杆。當程式有 if 語句時，就會被迫⾯臨⼆選⼀的運氣⼤挑戰。  cmp: 先比較誰⼤誰⼩，  jle: 左邊 ≤ 127 則跳過 25 ⾏。

問題揭曉 Ans 鑑古知今以古為鏡如何猜到  正確的分⽀？

Case1-without-sort.out 9839M branches 1615M branch-misses time: 20.83 second Case1.out 9836M
branches 0.53M branch-misses time: 8.13 second 性能分析 $ perf stat Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel.

1.先建立⼀個陣列，並填入亂數 2.利⽤ `std::sort` 來排序這個陣列 3.利⽤ if 判斷是否符合條件，若是則加總其中，第⼆步驟的「排序」與程式邏輯並無關聯，我們嘗試看看移除與否對性能造成的影響問題回顧

Case 1 Branch Prediction 為了避免分⽀預測帶來的額外開銷，在 for 迴圈前加上⼀⾏排序，確實能⼤幅提升程式效能。 Case1 讓我們看到了計算機結構的知識，確實
會對⾼階程式語⾔的性能，帶來相當顯著的影響。簡單的⼀⾏程式碼，也能像是演算法⼀樣，改善你的程式效能。

Case 2 Cache Locality I 在外⾯ J 在外⾯ Frequently used
cases need to be faster: Programs often invest most of the time in a few core functions and these functions in return have most to do with the loops. So, these loops should be designed in a way that they possess a good locality. 迴圈展開！先數 i 還 j 居然差這麼多？

Data Commentary memory speed is becoming a bottleneck of the
overall performance of a computer. 2022 SITCON 處理器（CPU）和內存（DRAM）之間的性能差距越來越⼤。這對今天的計算機來說是⼀個⼤問題：內存速度正在成為計算機整體性能的瓶頸。妥善地使⽤記憶體，逐漸成為更重要的⽬標。

Locality Temporal locality 時間局部性表明相同的數據對像在程序執⾏期間很可能被 CPU 多次重⽤。⼀旦數據對像在第⼀次未命中時被寫入緩存，則可以預期該對像上的許多後續命中。由於⾼速緩存比主存儲器等下⼀個較低級別的存儲更快，因此這些後續命中可以比原始未命中更快地提供服務。 Spatial
locality 它表明如果⼀個數據對像被引⽤⼀次，那麼它的相鄰數據對像很可能在不久的將來也會被引⽤。內存塊通常包含多個數據對象。由於空間局部性，我們可以預期在未命中後復制塊的成本將通過對該塊內其他對象的後續引⽤來分攤。

Locality Information • L1 Data cache = 32 KB, 64
B/line, 8- WAY. write-back, ECC. Two 256-bit loads and one 256-bit store per cycle. • L1 Instruction cache = 32 KB, 64 B/ line, 8-WAY. 32 bytes fetch / cycle. Latency • L1 Cache Latency = 4 cycles • L2 Cache Latency = 12 cycles • L3 Cache Latency = 38 cycles • RAM Latency = 38 cycles + 66 ns

越近的越常⽤必須快如果⼀個程式由多個迴圈構成通常鄰近的記憶體更常被使⽤ 2022 SITCON

i 在外⾯ miss rate = 4/16 a[i][j] j=0 j=1 j=2
j=3 i=0 w[0] (miss) w[1] (hit) w[2] (hit) w[3] (hit) i=1 w[4] (miss) w[5] (hit) w[6] (hit) w[7] (hit) i=2 w[8] (miss) w[9] (hit) w[10] (hit) w[11] (hit) i=3 w[12] (miss) w[13] (hit) w[14] (hit) w[15] (hit)

j 在外⾯ miss rate = 16/16 a[i][j] j=0 j=1 j=2
j=3 i=0 w[0] (miss) w[4] (miss) w[8] (miss) w[12] (miss) i=1 w[1] (miss) w[5] (miss) w[9] (miss) w[13] (miss) i=2 w[2] (miss) w[6] (miss) w[10] (miss) w[14] (miss) i=3 w[3] (miss) w[7] (miss) w[11] (miss) w[15] (miss)

64 bits a[0,0] 64 bits a[0,1] 64 bits a[0,2] 64
bits a[0,3] a[1,0] a[1,1] a[1,2] a[1,3] a[2,0] a[2,1] … Memory 陣列是 1 維？ a[i][j] j=0 j=1 j=2 j=3 i=0 w[0] (miss) w[1] (hit) w[2] (hit) w[3] (hit) i=1 w[4] (miss) w[5] (hit) w[6] (hit) w[7] (hit) i=2 w[8] (miss) w[9] (hit) w[10] (hit) w[11] (hit) i=3 w[12] (miss) w[13] (hit) w[14] (hit) w[15] (hit) L1d cache size: 2MiB ( 32 KiB / core) 64 bits = 8 Bytes 4 numbers * 8 Bytes = 32 KB

你不會互換 i, j？ 64 bits a[0,0] 64 bits a[0,1] 64
bits a[0,2] 64 bits a[0,3] a[1,0] a[1,1] a[1,2] a[1,3] a[2,0] a[2,1] … 64 bits a[0,0] 64 bits a[0,1] 64 bits a[0,2] 64 bits a[0,3] a[1,0] a[1,1] a[1,2] a[1,3] a[2,0] a[2,1] … 64 bits a[0,0] 64 bits a[0,1] 64 bits a[0,2] 64 bits a[0,3] a[1,0] a[1,1] a[1,2] a[1,3] a[2,0] a[2,1] …

Case 3 Vector Processing with SIMD without SIMD Single instruction,
multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD ⽤好 CPU 的全部價值！

計算 6.4M 個數字的平⽅ 1.先建立⼀個陣列，並填入數字 2.利⽤ _mm_sqrt_ps() ⼀次對 4 個 float
平⽅ 3.利⽤ _mm_store_ps() ⼀次存放 4 個 float ⼀個 float 佔⽤ 4 bytes = 32 bits ⼀個 128 bits 的 register 可存 4 個

使⽤ SSE 指令集的影響單位(秒) 使⽤ SSE 未使⽤ SSE Function 耗時
0.0448 0.1903 ⼀個 128 bits 的 register 可操作 4 個 float 正好加速 4 倍有餘未使⽤ SSE 使⽤ SSE 0 0.05 0.1 0.15 0.2 快 4 倍！

更長的 Register 64 位 CPU 的 register 有 64 bits。
但還有更長的⋯⋯ 如何查看？ (gdb)   info all-registers

Case1-without-sort.out 9839M branches 1615M branch-misses time: 20.83 second Case1.out 9836M
branches 0.53M branch-misses time: 8.13 second 性能分析 $ perf stat Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple command line interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel.

回顧 CPU g++ -S -O3 -msse2 sse2-none.cpp Normal 也可以幫你使⽤到部分的 xmm
但性能不會像是⼿動寫這麼好問題回顧單位(秒) 使⽤ SSE 未使⽤ SSE ⾃動 -O3 Function 耗時 0.0448 0.1903 0.1041 使⽤編譯器加速

⽤ Struct 把 Array 包起來，還是⽤ Array 把 Struct 包起來？
SOA: Structure of Arrays AOS: Array of Structures AOSOA: Array of structures of arrays 當⼀個對 SIMD 友善的⼯程師

More Resource https://hackmd.io/@owlfox/csapptw/ CMU 兩位教授 Randal E. Bryant 和 David
R. O’Hallaron 巧妙地把程式設計及效能最佳化、數位電路基礎、指令集架構、組合語⾔、儲存裝置、連結器與載入器、⾏程、虛擬記憶體等等⾃各不同的學科的核⼼知識點攪和在⼀起，並以程式開發者的視⾓呈現，所以這本書的書名叫 “A Programmer’s Perspective”。

More Resource https://ocw.nthu.edu.tw/ocw/index.php? page=course&cid=231 本課程將介紹平⾏計算的基礎觀念和電腦系統架構，並教授針對不同平⾏計算環境所設計的程式語⾔，包括多核⼼系統使⽤的 Pthread、OpenMP, 叢集計算使⽤的MPI, GPU使⽤的CUDA, 以及
分散式系統使⽤的MapReduce計算框架。 Parallel Programming National Tsing Hua University 2018, Fall Semester

Branch Prediction Cache Locality Vector Processing 浪費時間排序陣列，程式執⾏卻變快了？迴圈展開！先數 i
還 j 居然差這麼多？ SIMD ⽤好 CPU 的全部價值！結論

Thank You! Telegram:@WilliamMou Twitter: @willliam_mou Facebook: 牟展祐 2022 William Mou
SITCON SITCON SITCON

Why_should_I_learn_OS_and_CA.pdf

Why_should_I_learn_OS_and_CA.pdf

More Decks by William-Mou

Featured

Transcript