Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amazon Web Services User Group - High Performan...

Avatar for N. Peeters N. Peeters
February 13, 2013

Amazon Web Services User Group - High Performance Data Processing at infohubble

This is the presentation that was given by Nicolas Peeters during the Amazon Web Services User Group in Amsterdam (13 Feb 2013).

Avatar for N. Peeters

N. Peeters

February 13, 2013
Tweet

Other Decks in Technology

Transcript

  1. PUSHING THE BOUNDARIES High Performance Data Processing at infohubble Amazon

    Web Services User Group, Amsterdam, 13 Feb 2013 by Nicolas Peeters
  2. BASE DATA • Name • Address • Phone/Fax • Geo

    • URL RICH DATA • Opening Hours • Taxonomy • Menu, Reservation • Classification • Price Levels vs
  3. • A lot of sources • Entities are disconnected •

    Everything is (kind of) duplicated • Disambiguation • Mashup (for data) • SEO trickery • Legal (Feist v. Rural) CHALLENGES
  4. starbucks starbucks coffee starbucks coffee company starbucks coffee co starbuck's

    corporation starbuck's coffee starbucks coffee co. starbucks coffee co inc starbuck`s ...
  5. ADDRESS Nicolas Peeters Gustav Malherplein 82 1082MA infohubble c/o Nicolas

    Peeters Verd. 1 82 G. Mahlerplein 1082MA Amsterdam Nederland Nicolas Peeters Gustav Mahlerplein, nr. 82 1082 MA Amsterdam Nicolas Peeters infohubble Sevilla Building Gustav Mahlerplein 1082MA Amsterdam
  6. ADDRESS PARSER ★ Grammar Based Parser (NLP, language/locale detection) ★

    Normalization ★ Extraction from raw text ★ Multi-country ★ Unique capability
  7. OPENING HOURS • 12-15 Uhr, tgl. ab 11 Uhr, Küche

    12-15 + 18-23.30 Uhr • Mo - So 12:00 - 00:00 Sonntag Ruhetag • Mo-Fr 12-15 Uhr, Mo-Sa ab 10 Uhr, So ab 11 Uhr • Son- Don 12 -00 sam und Frei. 12- ? • ab Mai tgl. ab 19 Uhr, ab September tgl. ab 18 Uhr • Mo.-Sa. 11-3 Uhr • Mo - Fr 09:00 - 00:00 (open end) Sa + So 10:00 - 00:00 (open end) • Mo – Do 11.30 – 1.00 Uhr • L-J 9:00 23:30 V-S 9:00 1:00 D 9:0 23:30 • Mon - Fri unless specified Open: 0900 (Tue 0930) - 1700 (Thu 1800) (Sat.): 1000 - 1500 • Sun 12:00 ż 23.00 • Mon-Sun 12N-12M Fri-Sat -1am Sun -11.30pm • M-F 8:00AM-4:30PM LUNCH 12:30P-1:30P • Winter Hours Daily 10-8 Saturday 10-7 Sunday 12-5 • M-F: 8:00am Sat.: - - Sun.: closed • Daily til 1am, 2am on Fri/Sat • Mon-Fri (6am-8pm) Sat-Sun (7am-8pm) • Opening times: 10am - 1am (weekends 2am) Sm all sam ple
  8. ★ prefix match in text (locale-aware) ★ parsing (custom grammar)

    ★ meet minimum requirements (hours / days) ★ we have some analysis of the matched structure to ensure there is enough raw information ★ if we detect multiple OHs, check equivalence OPENING HOURS
  9. URL FINDER • Start with the name "Berlin Pension 58

    Mitte" • Generate brute-force the candidate URL http://www.pension-58-berlin-mitte.de, http://www.pension58berlinmitte.de, http://www.berlin-58- pension.de, http://www.berlin-pension-58.de, http://www.58-pension-berlin.de, http://www.pension- berlin-58.de, http://www.58-berlin-pension.de, http://www.pension-58-berlin.de, http:// www.berlin58pension.de, http://www.berlinpension58.de, http://www.pensionberlin58.de, http:// www.pension58berlin.de, http://www.58berlinpension.de, http://www.58pensionberlin.de, http:// www.pension-58-berlin-mitte.com ... • Verify that it belongs to the business by matching its base data
  10. BIG DATA? •How big is it? CommonCrawl.org (45TB), 5B pages

    •Size for crawl data: ~100GB per country •Speed matters •CPU/Hour Costs
  11. WE

  12. APPLICATION PROFILES ★ Grid (PF) (high cpu) ★ API (iPhone

    API and search, low usage) ★ Batch Database processing (DB) ★ Data management apps (low usage) ★ Scraper (high network traffic and disk usage) ★ Build infra (monitoring, SCM, build server...)
  13. GRID COMPUTING • GridGain • Master-node model • Shared design

    pattern for our applications • Dynamic Node discovery by S3 (multicast doesn't work on EC2) • URL deployment
  14. CONSOLE ★ Console is useful. Speed and UI needs to

    be improved. ★ Mobile apps need some love (esp. iOS)
  15. EC2 • We primarily operate in eu-west • Base AMI

    is Linux Ubuntu 11.10 Oneiric, preloaded with a few basic packages. • Instance Types • m1.micro for admin • m1.large is standard • c1.xlarge for grid
  16. •Using spot instances to cut the costs •Now standard procedure

    •Be prepared and design for failure SPOT INSTANCES
  17. •60.5 GiB of memory •88 EC2 Compute Units (2 x

    Intel Xeon E5-2670, eight-core) •3370 GB of instance storage •7.5 GiB memory •4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) •850 GB instance storage DO THE MATH
  18. S3 • Deployment • Templates and configuration • Discovery of

    the grid (used in GG, ElasticSearch and Hadoop)
  19. AUTOSCALING Allows you to scale EC2 group of instances' capacity

    up or down automatically • Checks if the number of server running matches your criteria, if not start or stop some • If an alarm is triggered (or cleared) it can automatically add extra nodes (or kill the exceeding ones) • It scales according to the defined policy • AS is ideal for failover or to handle (burst) spike load (TechCrunch effect)
  20. AUTOSCALING ✓Define which configuration (AMI, SG, PKI) you want to

    run ✓Define the lower- and upper-bound of instances you want to spawn ✓Define alarms (CPU, I/O, network latency...) used as trigger to scale up (or down) ✓associate alarm with the policy, (e.g. if you get the CPU alarm just start a new instance)
  21. AUTOSCALING In our case: ★ We use it to ensure

    HA of our APIs ★ We use it in combination with ELB ★ If a node goes down, autoscaling policy spawns a new one and adds it to the ELB
  22. • A NameServer managed by Amazon • AWS comes with

    a clever way to resolve name: the hostname prod-scriptdriver-i6-master.infohubble.net is resolved as the local-IP address if queried from inside, and as the public one if done from outside AWS • Very handy for Security Group configuration
  23. • since nodes change IP upon every restart, we need

    a way to uniquely reference an instance • custom script baked in the AMI • it extracts its own name from user-data • .... and update its own hostname (update the CNAME entry) using the Boto API at boot-time • delete Route 53 entry at shutdown
  24. RDS • Main feature is snapshots and auto-upgrades • Failover,

    multi-AZ, scaling (MySQL cluster is hard) • Not super useful in our case • Pro-tip™: If you terminate a RDS instance, all snapshots are deleted...
  25. EMR Hosted Hadoop service • built on top of EC2

    and S3 • good for batch processing • way easier to get started with Hadoop than setting it up manually • make sense for very large data sets
  26. EMR Our use case: Andromeda Prototype to extract and index

    URLs from the common-crawl dataset (45TB of data) • Map task, get the file, extract and crunches the webpages in order to extract URL from links/anchors • Reduce task collects them and makes a bulk insert to CouchDB
  27. IAM • Developer users for the console • System Users

    (backups, deployment, Route 53) • Policy Generator is crazy
  28. CHEF ★ Booting fast ★ Creating node, spot instances, attach/detach

    disks ★ Custom settings: bags are nice but don't make your life easier ★ Idempotent recipes ★ Our limited understanding of Ruby ★ Limited set of "Resources" Where Chef is hard:
  29. CUSTOM SCRIPTS • Boto library (Python) • AWS Java API

    • Templates are replaced with tokens supplied in user-data user-data: _HOSTNAME=prod-finders- node5.infohubble.net&_BUCKET=infohubble-grid- v1.1231&_PORT=47500&_MAXJOBS=35 • Next steps?
  30. BACKUP • Every node is responsible for its own backup

    + attached volumes • crontab triggers a script to "snapshot itself" (via EC2 API) • Managed Snapshots rotation (last 3 days)
  31. COOL UTILITIES $ ec2: List servers and log in with

    SSH $ ec2 -c "command": Perform remote actions O pen Source
  32. LOGSTASH • Logstash agent running on each node. • Plugins

    • input: file, syslog, log4j • filter: parse/modify (grok patterns), anonymize, CSV, XML • output: ElasticSearch, *queue, Redis, Riak, http,... • Index and aggregate (ElasticSearch) • Search and display via Kibana (optional)
  33. MONITORING SNS • Custom scripts to check specific disk sizes

    • SNS notifications to send alerts to the right groups of people (Pager Duty!)
  34. ISSUES WITH AWS ★ Reserved Instances ★ Not always elastic

    enough ★ Policy Generator ★ Security Group admin ★ RDS
  35. ISSUES WITH AWS ★ Very high CPU ☛ AWS kills

    the server ★ "Stolen CPU" ★ EBS speed is variable ★ Snapshots taking forever
  36. CLOSING THOUGHTS ★ "Large datacenter" with small team ★ Size

    does not matter, efficiency does. ★ Automate everything! ★ Learning curve ★ Buy peace of mind