Amazon Web Services User Group - High Performance Data Processing at infohubble

PUSHING THE BOUNDARIES High Performance Data Processing at infohubble Amazon
Web Services User Group, Amsterdam, 13 Feb 2013 by Nicolas Peeters

Nicolas Peeters, Head of Development @peetersn [email protected] peetersn ABOUT ME

MISSION Specialized in content aggregation, delivering quality and enriched business
data

CONTENT AGGREGATION ENRICHED & QUALITY DATA

ORG PHYSICAL WEB PRESENCE STARTING POINT { entity

BASE DATA • Name • Address • Phone/Fax • Geo
• URL RICH DATA • Opening Hours • Taxonomy • Menu, Reservation • Classiﬁcation • Price Levels vs

DATA = HARD

• A lot of sources • Entities are disconnected •
Everything is (kind of) duplicated • Disambiguation • Mashup (for data) • SEO trickery • Legal (Feist v. Rural) CHALLENGES

WHO CARES? •Search engines (local) •Mobile apps •Yellow Pages/Directories •Mapping/Navigation
•(Mobile) Phone operators •Business Intelligence

PRODUCING DATA

starbucks starbucks coffee starbucks coffee company starbucks coffee co starbuck's
corporation starbuck's coffee starbucks coffee co. starbucks coffee co inc starbuck`s ...

ADDRESS Nicolas Peeters Gustav Malherplein 82 1082MA infohubble c/o Nicolas
Peeters Verd. 1 82 G. Mahlerplein 1082MA Amsterdam Nederland Nicolas Peeters Gustav Mahlerplein, nr. 82 1082 MA Amsterdam Nicolas Peeters infohubble Sevilla Building Gustav Mahlerplein 1082MA Amsterdam

ADDRESS PARSER ★ Grammar Based Parser (NLP, language/locale detection) ★
Normalization ★ Extraction from raw text ★ Multi-country ★ Unique capability

GEOCODES •Google Maps API •Address normalization •Pretty

OPENING HOURS • 12-15 Uhr, tgl. ab 11 Uhr, Küche
12-15 + 18-23.30 Uhr • Mo - So 12:00 - 00:00 Sonntag Ruhetag • Mo-Fr 12-15 Uhr, Mo-Sa ab 10 Uhr, So ab 11 Uhr • Son- Don 12 -00 sam und Frei. 12- ? • ab Mai tgl. ab 19 Uhr, ab September tgl. ab 18 Uhr • Mo.-Sa. 11-3 Uhr • Mo - Fr 09:00 - 00:00 (open end) Sa + So 10:00 - 00:00 (open end) • Mo – Do 11.30 – 1.00 Uhr • L-J 9:00 23:30 V-S 9:00 1:00 D 9:0 23:30 • Mon - Fri unless specified Open: 0900 (Tue 0930) - 1700 (Thu 1800) (Sat.): 1000 - 1500 • Sun 12:00 ż 23.00 • Mon-Sun 12N-12M Fri-Sat -1am Sun -11.30pm • M-F 8:00AM-4:30PM LUNCH 12:30P-1:30P • Winter Hours Daily 10-8 Saturday 10-7 Sunday 12-5 • M-F: 8:00am Sat.: - - Sun.: closed • Daily til 1am, 2am on Fri/Sat • Mon-Fri (6am-8pm) Sat-Sun (7am-8pm) • Opening times: 10am - 1am (weekends 2am) Sm all sam ple

★ preﬁx match in text (locale-aware) ★ parsing (custom grammar)
★ meet minimum requirements (hours / days) ★ we have some analysis of the matched structure to ensure there is enough raw information ★ if we detect multiple OHs, check equivalence OPENING HOURS

URL FINDER • Start with the name "Berlin Pension 58
Mitte" • Generate brute-force the candidate URL http://www.pension-58-berlin-mitte.de, http://www.pension58berlinmitte.de, http://www.berlin-58- pension.de, http://www.berlin-pension-58.de, http://www.58-pension-berlin.de, http://www.pensionberlin-58.de, http://www.58-berlin-pension.de, http://www.pension-58-berlin.de, http:// www.berlin58pension.de, http://www.berlinpension58.de, http://www.pensionberlin58.de, http:// www.pension58berlin.de, http://www.58berlinpension.de, http://www.58pensionberlin.de, http:// www.pension-58-berlin-mitte.com ... • Verify that it belongs to the business by matching its base data

SCRAPER

VERIFICATION

BIG DATA? MEDIUM DATA BIG CRUNCHING™

BIG DATA? •How big is it? CommonCrawl.org (45TB), 5B pages
•Size for crawl data: ~100GB per country •Speed matters •CPU/Hour Costs

GOOGLE EARTH

page intentionally left blank

HARDWARE

APPLICATION PROFILES ★ Grid (PF) (high cpu) ★ API (iPhone
API and search, low usage) ★ Batch Database processing (DB) ★ Data management apps (low usage) ★ Scraper (high network traﬃc and disk usage) ★ Build infra (monitoring, SCM, build server...)

GRID COMPUTING • GridGain • Master-node model • Shared design
pattern for our applications • Dynamic Node discovery by S3 (multicast doesn't work on EC2) • URL deployment

DATABASES JasDB

AWS SERVICES WE USE

CONSOLE ★ Console is useful. Speed and UI needs to
be improved. ★ Mobile apps need some love (esp. iOS)

EC2 • We primarily operate in eu-west • Base AMI
is Linux Ubuntu 11.10 Oneiric, preloaded with a few basic packages. • Instance Types • m1.micro for admin • m1.large is standard • c1.xlarge for grid

PRICING On Demand Reserved Spot > >

•Using spot instances to cut the costs •Now standard procedure
•Be prepared and design for failure SPOT INSTANCES

•60.5 GiB of memory •88 EC2 Compute Units (2 x
Intel Xeon E5-2670, eight-core) •3370 GB of instance storage •7.5 GiB memory •4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) •850 GB instance storage DO THE MATH

VOLUMES • Ephemeral storage • EBS • Provisioned IOPS

S3 • Deployment • Templates and conﬁguration • Discovery of
the grid (used in GG, ElasticSearch and Hadoop)

ELB • Trivial to implement • Deﬁne group of instances
• Health Check

AUTOSCALING Allows you to scale EC2 group of instances' capacity
up or down automatically • Checks if the number of server running matches your criteria, if not start or stop some • If an alarm is triggered (or cleared) it can automatically add extra nodes (or kill the exceeding ones) • It scales according to the deﬁned policy • AS is ideal for failover or to handle (burst) spike load (TechCrunch eﬀect)

AUTOSCALING ✓Define which configuration (AMI, SG, PKI) you want to
run ✓Define the lower- and upper-bound of instances you want to spawn ✓Define alarms (CPU, I/O, network latency...) used as trigger to scale up (or down) ✓associate alarm with the policy, (e.g. if you get the CPU alarm just start a new instance)

AUTOSCALING In our case: ★ We use it to ensure
HA of our APIs ★ We use it in combination with ELB ★ If a node goes down, autoscaling policy spawns a new one and adds it to the ELB

ROUTE 53

• A NameServer managed by Amazon • AWS comes with
a clever way to resolve name: the hostname prod-scriptdriver-i6-master.infohubble.net is resolved as the local-IP address if queried from inside, and as the public one if done from outside AWS • Very handy for Security Group conﬁguration

• since nodes change IP upon every restart, we need
a way to uniquely reference an instance • custom script baked in the AMI • it extracts its own name from user-data • .... and update its own hostname (update the CNAME entry) using the Boto API at boot-time • delete Route 53 entry at shutdown

CLOUDWATCH super useful zone!

RDS • Main feature is snapshots and auto-upgrades • Failover,
multi-AZ, scaling (MySQL cluster is hard) • Not super useful in our case • Pro-tip™: If you terminate a RDS instance, all snapshots are deleted...

EMR Hosted Hadoop service • built on top of EC2
and S3 • good for batch processing • way easier to get started with Hadoop than setting it up manually • make sense for very large data sets

EMR Our use case: Andromeda Prototype to extract and index
URLs from the common-crawl dataset (45TB of data) • Map task, get the ﬁle, extract and crunches the webpages in order to extract URL from links/anchors • Reduce task collects them and makes a bulk insert to CouchDB

IAM • Developer users for the console • System Users
(backups, deployment, Route 53) • Policy Generator is crazy

AUTOMATION AND CUSTOM-STUFF

CHEF ★ Booting fast ★ Creating node, spot instances, attach/detach
disks ★ Custom settings: bags are nice but don't make your life easier ★ Idempotent recipes ★ Our limited understanding of Ruby ★ Limited set of "Resources" Where Chef is hard:

CUSTOM SCRIPTS • Boto library (Python) • AWS Java API
• Templates are replaced with tokens supplied in user-data user-data: _HOSTNAME=prod-finders- node5.infohubble.net&_BUCKET=infohubble-grid- v1.1231&_PORT=47500&_MAXJOBS=35 • Next steps?

BUILD ENVIRONMENT

ADD NODES

BACKUP • Every node is responsible for its own backup
+ attached volumes • crontab triggers a script to "snapshot itself" (via EC2 API) • Managed Snapshots rotation (last 3 days)

"Automating now will save your money down the road."

COOL UTILITIES $ ec2: List servers and log in with
SSH $ ec2 -c "command": Perform remote actions O pen Source

https://github.com/sirio7g/ec2 https://github.com/infohubble/ec2 O pen Source

LOGGING +

LOGSTASH • Logstash agent running on each node. • Plugins
• input: ﬁle, syslog, log4j • ﬁlter: parse/modify (grok patterns), anonymize, CSV, XML • output: ElasticSearch, *queue, Redis, Riak, http,... • Index and aggregate (ElasticSearch) • Search and display via Kibana (optional)

KIBANA

MONITORING • Application-speciﬁc (Metrics library from Coda Hale) • Standard
CloudWatch (CPU, memory and network latency)

MONITORING SNS • Custom scripts to check speciﬁc disk sizes
• SNS notiﬁcations to send alerts to the right groups of people (Pager Duty!)

METERING

ISSUES WITH AWS ★ Reserved Instances ★ Not always elastic
enough ★ Policy Generator ★ Security Group admin ★ RDS

ISSUES WITH AWS ★ Very high CPU ☛ AWS kills
the server ★ "Stolen CPU" ★ EBS speed is variable ★ Snapshots taking forever

CLOSING THOUGHTS ★ "Large datacenter" with small team ★ Size
does not matter, eﬃciency does. ★ Automate everything! ★ Learning curve ★ Buy peace of mind

THANKS!

@peetersn [email protected] 78

Amazon Web Services User Group - High Performan...

Amazon Web Services User Group - High Performance Data Processing at infohubble

Other Decks in Technology

Featured

Transcript