220903_GFS

Google File System (SOSP ‘03) raf

in-place update를 못할까 mv가 없을까 (rm after cp) request limit이
있을까 "AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)" S3은 왜?

MapReduce worker는 sequential RW를 수행한다

Radically different points to design distributed file system component failure는
당연한것이다. 모니터링, 에러감지, 장애복구 등이 필수적이다. file size는 엄청나게 커졌다. I/O와 block size를 다시 생각해봐야 한다. 대부분 file은 append만 발생하며, in-place update는 거의 발생하지 않는다. 대부분 sequential read 이다. 따라서 write가 아닌 append를 최적화해야한다. application, file system을 함께 design하는것은 전체 system에 이득이 된다.

Design Assumptions 하드웨어는 자주 고장난다 System은 대용량의 파일을 저장한다 (100MB이상).
작은 파일도 들어가지만 최적화 할필요는 없다. large streaming read는 한번에 100KB-1MB 정도의 data를 읽는다. 한 client에서 연속되는 operation은 sequential read를 수행한다. small random read는 10KB 미만의 크기를 읽는다. workload는 large, sequential write을 한다. write operation의 크기는 read와 비슷하며 한번 write된 file은 수정되지 않는다. 여러 client가 동시에 append하는경우 효율적으로 동작해야 한다. 수백개의 producer가 동시에 append할때 atomicity overhead는 작아야한다. low latency보다 high bandwidth가 훨씬 중요하다

Interface POSIX-compatible: create, delete, open, close, read, write POSIX-incompatible -
snapshot: Copy files or directories with lower cost - copy on write - record-append: append concurrently and atomically

Architecture Chunk: a fixed-size block for a file chunkserver: data
저장 single master: filesystem metadata 관리 client: chunk 위치를 캐싱

Metadata - file namespace (directory structure) - /foo/bar - file
to chunk mapping - /foo/bar의 [0, 64MB) 는 chunk 2ef0 - chunk replica location - chunk 2ef0은 A, B, C chunkserver 모든 metadata는 in-memory에 저장 namespace, mapping은 operational log로 mutation을 저장 및 복제 chunk location은 master startup, chunk join 시점에 chunkserver로부터 받아옴

Chunk Size 64MB 는 다른 filesystem block size보다 엄청 큰
값. - network request를 줄여준다. 특히 large file에 대한 sequential R/W - chunk size가 커질수록 metadata 수가 줄어들어, master는 metadata를 in-memory로 들고 있을 수 있다.

Operational Log 순서대로 쌓이는 mutation log, concurrent operation ordering에도 사용가능
Write Ahead Logging과 비슷하며 replica가 log를 저장한 뒤 master가 operation 실행 recovery time을 줄이기 위해 주기적으로 compaction을 진행함

Write a New File 1. client는 master에게 write를 수행할 replicas
정보를 받아옴 2. data를 out-of-order로 replicas에게 전달. (DataFlow) 3. 모든 replica가 data를 받으면 client는 primary replica에게 write request 전달. 4. primary replica는 write order를 결정 5. secondary replicas에게 write order를 전달하여 mutation 수행 6. primary replica가 ACK 수신 7. client에게 완료 리턴

Data Flow Data는 chunkserver가 순차적으로 받아가는 구조 - network bandwidth를
줄일 수 있음 client는 가장 가까운 replica에게만 전달 pipelining을 통해 overall latency를 줄임 record-append data는 동시에 들어오지만 primary가 append순서를 결정함 data가 chunk size를 넘는경우 chunk에 padding을 넣고 재시도를 하도록 함

consistent: N replicas가 consistent data를 가짐 defined: 값이 “유의미함" Consistency
Model

Consistency Model - Write client1: write(62, 4) client2: write(62, 4)
consistent but undefined chunk1 chunk2 chunk1 chunk2 chunk1 chunk2

Consistency Model - Recore Append network partition등으로 인해 실패가 발생해도
inconsistency로 남겨둠 client에서 chunk version number를 통해 실패한부분 식별 request failure retry Replica2 (Primary) Replica1 Replica3

Master: Namespace management GFS는 directory structure를 가지지 않고, namespace(/foo/bar)를 lookup
table로 표현함 lookup table의 각 record는 read/write lock을 가짐 /d1/d2/…/dn/leaf 라는 path가 있을경우 /d1, /d1/d2, …, /d1/…/dn 까지 순서대로 read lock (deadlock을 피하기위한 순서) /d1/d2/…/dn/leaf 는 read or write lock 같은 directory에 대해 여러 file을 contention없이 동시에 생성할 수 있음 반면 inode는 contention이 발생함 directory는 read-lock으로 delete, rename, snapshot을 실행하지 못하게 함 file은 write-lock으로 같은 file name으로 file을 생성하지 못하게 함

Master Operations Namespace Operations Replica Placement, Creation, Re-replication, Rebalancing Garbage
Collection (deletion flag) Shadow Master (read-only) - 자주 mutation이 발생하지 않고 stale해도 괜찮은 application이 shadow master 사용 - Master가 write해둔 operational log를 읽어서 반영하므로 lag이 존재함.

Summary Append가 대다수인 workload와 장애가 아주 빈번히 일어나는 상황으로 설계
sequential RW에 대해 최적화 single master는 control plane으로 data path에 관여하지 않음 application(library)과 filesystem API를 같이 디자인 → weaker consistency model → append 성능 극대화

220903_GFS

220903_GFS

Buzzvil

More Decks by Buzzvil

Other Decks in Programming

Featured

Transcript

Google File System (SOSP ‘03) raf

in-place update를 못할까 mv가 없을까 (rm after cp) request limit이

MapReduce worker는 sequential RW를 수행한다

Radically different points to design distributed file system component failure는

Design Assumptions 하드웨어는 자주 고장난다 System은 대용량의 파일을 저장한다 (100MB이상).

Interface POSIX-compatible: create, delete, open, close, read, write POSIX-incompatible -

Architecture Chunk: a fixed-size block for a file chunkserver: data

Metadata - file namespace (directory structure) - /foo/bar - file

Chunk Size 64MB 는 다른 filesystem block size보다 엄청 큰

Operational Log 순서대로 쌓이는 mutation log, concurrent operation ordering에도 사용가능

Write a New File 1. client는 master에게 write를 수행할 replicas

Data Flow Data는 chunkserver가 순차적으로 받아가는 구조 - network bandwidth를

consistent: N replicas가 consistent data를 가짐 defined: 값이 “유의미함" Consistency

Consistency Model - Write client1: write(62, 4) client2: write(62, 4)

Consistency Model - Recore Append network partition등으로 인해 실패가 발생해도

Master: Namespace management GFS는 directory structure를 가지지 않고, namespace(/foo/bar)를 lookup

Master Operations Namespace Operations Replica Placement, Creation, Re-replication, Rebalancing Garbage

Summary Append가 대다수인 workload와 장애가 아주 빈번히 일어나는 상황으로 설계