Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web app project

Web app project

012219351 project presentation - Sentence search in Hadoop

More Decks by Manatsawin Hanmongkolchai

Other Decks in Programming

Transcript

  1. Why HBase? • MapReduce is slowwwwwww • Takes 40 minutes

    to import data using MapReduce to HBase • Too many small files are weak point of Hadoop's MapReduce
  2. The problem with HBase • HBase + Hadoop consumes too

    many memory! • Solution: 1GB swap file
  3. HBase architecture Load Balancer momo Web Server konoha Web Server

    ene Zookeeper Master Hadoop Namenode ayano Zookeeper Secondary Master Hadoop Datanode kido Zookeeper Quorum Hadoop Datanode seto nginx Rapidoid (Java) Rapidoid (Java)
  4. Benchmark • Run on load balancer • wrk benchmark ◦

    30s test ◦ 500 connections • HBase and MapFile has some timeouts Query: Corrected EDITIONS of our eBooks replace the old file and take over the old filename and etext number. The replaced older file is renamed. EBooks posted prior to November 2003, with eBook numbers BELOW #10000, are filed in directories based on their release date. If you want to download any of these eBooks directly, rather than using the regular search system you may utilize the following addresses and just download by the etext year. 2.87x
  5. The size problem 1 2 3 4 5 Key Value

    Key Value Key xxHash 1
  6. The size problem 1 2 3 4 5 The Common

    Edition is the result of a decade long study of the New Testament and numerous English translations in the modern church. The goals for this edition are: { "file": "Introduction_and_Copyright.txt", "paragraph": 2 }
  7. "When you see something big, make it small enough" —

    JL ไมไดกลาวไว 1 2 3 4 5 1234561561865 { "file": "Introduction_and_Copyright.txt", "paragraph": 2 }
  8. The size problem 1 2 3 4 5 1234561561865 0001

    0010 1 Introduction_and_Copyright.txt struct { int file_id; int paragraph_id; }
  9. HashMap implementation • Initially wrote a custom memory-mapped HashMap in

    C • Due to memory limitation of mmap, database size is limited to available memory • Changed to LMDB from OpenLDAP (via Java binding) • LMDB database size limit is set to 1GB • May not scale out really well – Risk of hash collision
  10. Bloom filter • 200kB Bloom filter is used = 1,638,400

    slots! • Using a custom Java implementation, every bits are used! • Bloom filter is preloaded in memory of the server process when it starts • Planned to send Bloom filter to run on the clientside ◦ Different data type limit ◦ Complex client side code
  11. Did the Bloom filter helped? • https://en.wikipedia. org/wiki/Opinion_polling_for_the_2015_United_K ingdom_general_election ◦

    Content Length 289,085 • Romano Lavo-Lil (rmlav10.txt) ◦ Size 288,195 • HBase implementation also use the same Bloom filter