Slide 1

Slide 1 text

Team webapp7 https://lb.webapp.ku.whs.in.th

Slide 2

Slide 2 text

Why HBase? ● MapReduce is slowwwwwww ● Takes 40 minutes to import data using MapReduce to HBase ● Too many small files are weak point of Hadoop's MapReduce

Slide 3

Slide 3 text

HTable structure Key timestamp p:f p:p a 1464292227911 rmlav10.txt 613 e 1464292227911 12370-8.txt 387

Slide 4

Slide 4 text

The problem with HBase ● HBase + Hadoop consumes too many memory! ● Solution: 1GB swap file

Slide 5

Slide 5 text

HBase architecture Load Balancer momo Web Server konoha Web Server ene Zookeeper Master Hadoop Namenode ayano Zookeeper Secondary Master Hadoop Datanode kido Zookeeper Quorum Hadoop Datanode seto nginx Rapidoid (Java) Rapidoid (Java)

Slide 6

Slide 6 text

Benchmark

Slide 7

Slide 7 text

Can we go faster?

Slide 8

Slide 8 text

The "fast" implementation https://fast.lb.webapp.ku.whs.in.th 'n reckless

Slide 9

Slide 9 text

Benchmark ● Run on load balancer ● wrk benchmark ○ 30s test ○ 500 connections ● HBase and MapFile has some timeouts Query: Corrected EDITIONS of our eBooks replace the old file and take over the old filename and etext number. The replaced older file is renamed. EBooks posted prior to November 2003, with eBook numbers BELOW #10000, are filed in directories based on their release date. If you want to download any of these eBooks directly, rather than using the regular search system you may utilize the following addresses and just download by the etext year. 2.87x

Slide 10

Slide 10 text

Fast search data structure

Slide 11

Slide 11 text

HashMap!

Slide 12

Slide 12 text

The size problem 1 2 3 4 5 Key Value Key Value Key xxHash 1

Slide 13

Slide 13 text

The size problem 1 2 3 4 5 The Common Edition is the result of a decade long study of the New Testament and numerous English translations in the modern church. The goals for this edition are: { "file": "Introduction_and_Copyright.txt", "paragraph": 2 }

Slide 14

Slide 14 text

"When you see something big, make it small enough" — JL ไมไดกลาวไว 1 2 3 4 5 1234561561865 { "file": "Introduction_and_Copyright.txt", "paragraph": 2 }

Slide 15

Slide 15 text

The size problem 1 2 3 4 5 1234561561865 0001 0010 1 Introduction_and_Copyright.txt struct { int file_id; int paragraph_id; }

Slide 16

Slide 16 text

Database size

Slide 17

Slide 17 text

HashMap implementation ● Initially wrote a custom memory-mapped HashMap in C ● Due to memory limitation of mmap, database size is limited to available memory ● Changed to LMDB from OpenLDAP (via Java binding) ● LMDB database size limit is set to 1GB ● May not scale out really well – Risk of hash collision

Slide 18

Slide 18 text

รบรอยชนะรอย ยังไมดีพอ ชนะโดยไมตองรบ แมเพียงครั้งเดียว ยอมถือวาเลิศจบแดน – Sun Tzu

Slide 19

Slide 19 text

Bloom filter Hello world!! 3 4 xxhash murmur3

Slide 20

Slide 20 text

Bloom filter Hello world!! 3 4 xxhash murmur3

Slide 21

Slide 21 text

Bloom filter Yes! 1 3 xxhash murmur3

Slide 22

Slide 22 text

Bloom filter Noooo 3 4 xxhash murmur3

Slide 23

Slide 23 text

Bloom filter ● 200kB Bloom filter is used = 1,638,400 slots! ● Using a custom Java implementation, every bits are used! ● Bloom filter is preloaded in memory of the server process when it starts ● Planned to send Bloom filter to run on the clientside ○ Different data type limit ○ Complex client side code

Slide 24

Slide 24 text

Did the Bloom filter helped? ● https://en.wikipedia. org/wiki/Opinion_polling_for_the_2015_United_K ingdom_general_election ○ Content Length 289,085 ● Romano Lavo-Lil (rmlav10.txt) ○ Size 288,195 ● HBase implementation also use the same Bloom filter

Slide 25

Slide 25 text

LMDB architecture Load Balancer momo Web Server konoha Web Server ene

Slide 26

Slide 26 text

HDFS Edition ● http://hdfs.lb.webapp.ku.whs.in.th (No HTTPS) ● Using MapReduce job to generate MapFile ● Just wrote this morning

Slide 27

Slide 27 text

Thank you https://github.com/whsatku/lmdbsearch