You know, for search. Querying 24 Billion Records in 900ms.

You know, for search querying 24 000 000 000 Records
in 900ms @jodok

@jodok

source: www.searchmetrics.com

First Iteration

The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php

5 x m2.2xlarge c1.xlarge EBS - bash - ﬁnd -
zcat - curl ES as document store - 5 instances - weekly indexes - 2 replicas - EBS volume

HDFS MAPRED ES • Map/Reduce to push to Elasticsearch •
via NFS to HDFS storage • no dedicated nodes

- Disk IO - concated gzip ﬁles - compression

Namenode Datanode 2 Tasktracker 6x 500 TB HDFS Jobtracker Secondary
NN Datanode 2 Tasktracker 6x 500 TB HDFS Hive Datanode 2 Tasktracker 6x 500 TB HDFS Datanode 4 Tasktracker 6x 500 TB HDFS Datanode 4 Tasktracker 6x 500 TB HDFS Datanode 4 Tasktracker 6x 500 TB HDFS Tasktracker Spot Instances Hadoop Storage - Index “Driver”

create external table $tmp_table_name (size bigint, path string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t' stored as INPUTFORMAT \"org.apache.hadoop.mapred.lib.NLineInputFormat\" OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\" location s3n://...; SET ... from ( select transform (size, path) using './current.tar.gz/bin/importer transform${max_lines}' as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' from $tmp_table_name ) f INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}') select crawl_ts, ... user_json, tweets, Adding S3 / External Tables to Hive

https://launchpad.net/ubuntu/+source/cloud-init http://www.netfort.gr.jp/~dancer/software/dsh.html.en

packages: - puppet # Send pre-generated ssh private keys to
the server ssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY} # set up mount points # remove default mount points mounts: - [ swap, null ] - [ ephemeral0, null ] # Additional YUM Repositories repo_additions: - source: \"lovely-public\" name: \"Lovely Systems, Public Repository for RHEL 6 compatible Distributions\" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: \"file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely\" baseurl: \"https://yum.lovelysystems.com/public/release\" runcmd: - [ hostname, \"${HOST}\" ] - [ sed, -i, -e, \"s/^HOSTNAME=.*/HOSTNAME=${HOST}/\", /etc/sysconfig/network ] - [ wget, \"http://169.254.169.254/latest/meta-data/local-ipv4\", -O, /tmp/local-ipv4 ] - [ sh, -c, echo \"\$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}\" >> /etc/hosts ] - [ rpm, --import, \"https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely\"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ] ${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ] ${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ] ${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo \" server = ${PUPPET_MASTER}\" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo \" certname = ${HOST}\" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]

- IO - ES Memory - ES Backup - ES
Replicas - Load while indexing - AWS Limits

EBS performance http://blog.dt.org

• Shard allocation • Avoid rebalancing (Discovery Timeout) • Uncached
Facets https://github.com/lovelysystems/elasticsearch-ls-plugins • LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efﬁcient data structure.

6 ES Master Nodes c1.xlarge 40 ES nodes per zone
m1.large 8 EBS Volumes 6 Node Hadoop Cluster + Spot Instances 3 AP server / MC c1.xlarge

Everything ﬁne?

Cutting the cost • Reduce the amount of Data use
Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,... • no more time-based indizes • Dedicated Hardware • SSD Disks • Share Hardware for ES and Hadoop

Jenkins for Workﬂows

S3 distcp HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop

That's thirty minutes away. I'll be there in ten. @jodok

You know, for search. Querying 24 Billion Recor...

You know, for search. Querying 24 Billion Records in 900ms.

Jodok Batlogg

Other Decks in Technology

Featured

Transcript

You know, for search querying 24 000 000 000 Records

@jodok

source: www.searchmetrics.com

First Iteration

The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php

5 x m2.2xlarge c1.xlarge EBS - bash - ﬁnd -

HDFS MAPRED ES • Map/Reduce to push to Elasticsearch •

- Disk IO - concated gzip ﬁles - compression

Namenode Datanode 2 Tasktracker 6x 500 TB HDFS Jobtracker Secondary

create external table $tmp_table_name (size bigint, path string) ROW FORMAT

https://launchpad.net/ubuntu/+source/cloud-init http://www.netfort.gr.jp/~dancer/software/dsh.html.en

packages: - puppet # Send pre-generated ssh private keys to

- IO - ES Memory - ES Backup - ES

EBS performance http://blog.dt.org

• Shard allocation • Avoid rebalancing (Discovery Timeout) • Uncached

6 ES Master Nodes c1.xlarge 40 ES nodes per zone

Everything ﬁne?

Cutting the cost • Reduce the amount of Data use

Jenkins for Workﬂows

S3 distcp HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop

That's thirty minutes away. I'll be there in ten. @jodok