Web Content Analytics at Scale with Parse.ly

‹#› Andrew Montalenti! CTO, Parse.ly @amontalenti Web Content Analytics at
Scale with Parse.ly

content analytics fuels real-time decision-making for people who run the
largest sites

175 TB of compressed customer data.! Growing 20 TB+ per
month.

before MongoDB, before Cassandra, before “NoSQL”, there was Lucene.

2013! "Can one use Solr as a Time Series Engine?"

2014! "Are Elasticsearch aggregations a dream query layer?"

2015! "Can we push ES to take 10K+ writes/sec and
store 10TB+ of customer data?"

(turns out, answer to each question is "yes", but lots
of caveats!)

ﬁelddata: don't do it!! ! doc_values All The Things!

_source: don't do it!! ! especially if your schema has
high-cardinality multi-value ﬁelds.

"logstash-style" raw records! are nice, but...! ! to operate with
good query latency, you need rollups, and these are tricky.

{ "url": "http://arstechnica.com/12345", "ts": "2015-01-02T00:00:000Z",! "visitors": ["3f3f", "3f3g", ...millions],! !
"metrics": { "$all/page_views": 6200000, "desktop/page_views": 4200000,! "mobile/page_views": 2000000,! "$all/engaged_secs": 27500000,! "new/engaged_secs": 250000000,! "returning/engaged_secs": 25000000, }, ! "metas": { "title": "Obama gives speech",! "authors": ["Mike Journo"],! "section": "Politics",! "pub_date": "2015-01-02T08:00:000Z", } } partition and time bucket high-cardinality metric numeric metrics metadata 1day rollup! (1 per day)

{ "url": "http://arstechnica.com/12345", "ts": "2015-01-02T08:05:000Z",! "visitors": ["3f3f", "3f3g", ...hundreds],! !
"metrics": { "$all/page_views": 62, "desktop/page_views": 42,! "mobile/page_views": 20,! "$all/engaged_secs": 275,! "new/engaged_secs": 250,! "returning/engaged_secs": 25, }, ! "metas": { "title": "Obama gives speech",! "authors": ["Mike Journo"],! "section": "Politics",! "pub_date": "2015-01-02T08:00:000Z", } } partition and time bucket high-cardinality metric numeric metrics metadata 5min rollup! (288 per day)

{ "url": "http://arstechnica.com/12345", "ts": "2015-01-02T08:05:123Z",! "visitors": ["3f3f3"],! ! "metrics": {
"$all/page_views": 1, "desktop/page_views": 1,! "mobile/page_views": 0,! "$all/engaged_secs": 0,! "new/engaged_secs": 0,! "returning/engaged_secs": 0, }, ! "metas": { "title": "Obama gives speech",! "authors": ["Mike Journo"],! "section": "Politics",! "pub_date": "2015-01-02T08:00:000Z", } } partition and time bucket high-cardinality metric numeric metrics metadata raw event! (millions per day)

url-raw by hour url-5min by day url-1day by month document
grouping! in time-based indices

top_things(...)! thing_details(...)! site_timeline(...) initial data access layer

historical analytics for spotting long-term opportunities, trends, and insights in
heaps of data

Parse.ly "Batch Layer" Topologies with Spark and Amazon S3 Parse.ly
"Speed Layer" Topologies with Storm & Kafka Parse.ly Dashboards and APIs with Elasticsearch & Cassandra Parse.ly Raw Data Pipeline with Amazon Kinesis & S3 Access mage building blocks

rebuild the world!

Mid-2015! "This sort of works, but seems that we need
more hardware... and what's up with response times?"

"You need to give big ! customers their own indices."
- Otis "You need to use node-shard! allocation for hot/cold tiers." - Radu

Time-based indices Index versioning Customer namespaces Node-shard allocation all together
now!

Time-Based • v1_shared-1day-2015.01 • v1_shared-1day-2015.02

Versioning • v1_shared-1day-2015.01 • v2_shared-1day-2015.01

Namespaces • v1_shared-1day-2015.01 • v1_condenast-1day-2015.01

Node-Shard Allocation • v1_shared-1day-2015.01 => cold (mem, rust)! • v1_shared-5min-2015.02.01
=> warm (mem, ssd)! • v1_shared-5min-2015.03.15 => hot (mem, cpu)! • v1_shared-raw-2015.03.15T12 => raw (cpu)

• Cluster: 40 nodes, 500+ indices, 7,000+ shards • Tiers:
4 client, 3 master, 9 raw, 9 hot, 12 warm, 3 cold • Instances: 1TB+ of RAM, 500+ CPU cores • Disks: 12+ TB data, >50% in SSDs, rest in rust • Writes: 10K+ writes per second • Reads: 100's of aggregations per second

Late-2015! "This is shipped! It works! ... but, some issues
remain."

OOMs = bugs timeouts = lies queries = hogs bugs,
lies, and hogs

In the worst case, a bad query takes longer than
its timeout, hogs the cluster, and hits an OOM bug.

better resiliency store compression task management aggregation paging query proﬁling
excited about future

Questions? Tweet to @amontalenti!

links • Lucene: The Good Parts! • Mage: The Magical
Time Series Backend! • Pythonic Analytics with Elasticsearch! • Visit us: http://parse.ly • Join us: http://parse.ly/jobs

appendix

building mage streaming time series engine for our next 1,000
customers

pykafka ingest raw event data at high speed

Python State Code Server 1 Core 2 Core 1 Server
2 Core 2 Core 1 Server 3 Core 2 Core 1 consumer = ... # balanced while True: msg = consumer.consume() msg = json.loads(msg) urlparse(msg["url"]) Python State Code Python State Code Python State Code Python State Code Python State Code pykafka.producer Python State Code

scale-out functions over a stream of inputs in order to
generate a stream of outputs

2 Core 2 Core 1 Server 3 Core 2 Core 1 Python State Code Python State Code Python State Code Python State Code pykafka.producer Python State Code multi-lang json protocol class UrlParser(Topology): url_spout = UrlSpout.spec(p=1) url_bolt = UrlBolt.spec(p=4, input=url_spout)

pyspark scale-out batch functions over static dataset to perform transformations
and actions

2 Core 2 Core 1 Server 3 Core 2 Core 1 Python State Code Python State Code Python State Code Python State Code pyspark.SparkContext sc = SparkContext() file_rdd = sc.textFile(files) file_rdd.map(urlparse).take(1) cloudpickle py4j and binary pipes

lesson learned log-oriented "lambda architecture" works well, but it costs
time and money!

multi-process, not multi-thread multi-node, not multi-core message passing, not shared
memory ! heaps of data and streams of data

Web Content Analytics at Scale with Parse.ly

Web Content Analytics at Scale with Parse.ly

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript