Slide 1

Slide 1 text

Cox Communications Inc. February 28th, 2018 smalenfant@apache.org Traffic Control and Elasticsearch @Cox Steve Malenfant, Engineer

Slide 2

Slide 2 text

Cox Communications A Broadband communication and entertainment company

Slide 3

Slide 3 text

3 Enabling new services

Slide 4

Slide 4 text

4 This is a sample image All of Olympic streaming coverage on all devices

Slide 5

Slide 5 text

• Team • 6 Engineers. Half split between Dev and Ops. • CDN Deployment • Hundreds of physical servers in 13 datacenters • Hundreds of containers (LXC and Docker) • Apache Traffic Server and Traffic Control • Elasticsearch for access logs CDN Team and Deployment overview

Slide 6

Slide 6 text

6 • Set of components that can be used to build, monitor, configure, and provision a large scale content delivery network (CDN) http://trafficcontrol.apache.org/ Apache Traffic Control DNS/HTTP client steering to closest/best available cache Implements CDN health protocol and present states to Traffic Router Acquire CDN wide statistics and store information in InfluxDB An API driven configuration management and configuration file generation system A client facing UI used to manage and operate a CDN

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Traffic Portal 8 Management interface

Slide 9

Slide 9 text

• Supported by the enterprise • Minimal amount of access logs (<100GB) • Functional • Search and visualize logs • Provides reports and dashboards • Triggers alarms We have a logging platform - 2015 9

Slide 10

Slide 10 text

• Explosion in CDN usage • Electronic Program Guides • Images and poster arts • Increasing IP Video Delivery • Experiencing slowdown in reports • Retention time reduction • Filtering events, losing visibility • Getting too expensive… We have a logging problem - 2016 10

Slide 11

Slide 11 text

Elastic proof of concept – early 2016 Production Lab Edge/Mid Traffic Server Logs Filebeat - File Input - Filters - Add Tags - Beat Output Logstash Indexers N+1 - Beat Input - Filtering (KV) - Redis TCP Localhost Output Logstash Indexers N+1 - Redis Input - Elasticsearch Output Elasticsearch (Data 3..N) Elastic Search - HTTP API - Clustered - Replicated Redis (2) Redis - TCP Input/Output - Buffers events - Processed data

Slide 12

Slide 12 text

Our current logging pipeline Production Edge/Mid Traffic Server Logs Filebeat - File Input - Filters - Add Tags - Kafka Output Logtash Indexers N+1 - Kafka Input - KV Filter - Elasticsearch Output Elasticsearch (Data 3..N) Elastic Search - HTTP API - Clustered - Replicated Kafka (3) - Multiple Topics - Retention Policies - Replicas Traffic Router Access Logs Filebeat - File Input - Add Tags - Kafka Output Kafka Stream Aggregator N+1 - Downsample data - Elasticsearch Output

Slide 13

Slide 13 text

• Ansible playbooks with Docker • Elasticsearch • Logstash • Curator Elasticsearch Deployment 13 - name: Create elasticsearch container docker_container: name: "{{ inventory_hostname }}" image: "docker.elastic.co/elasticsearch/elasticsearch:6.1.2" env: xpack.monitoring.enabled: "{{ xpack_monitor_enabled | lower}}” bootstrap.memory_lock: “true” restart_policy: unless-stopped path.data: "{{ item['es_disks'] | join(',') }}" cpu_set: "{{ item.cpu_set | default([]) }}" … ports: - 9200:9200 - 9300:9300

Slide 14

Slide 14 text

14 Early on… Then… Now… • 5 Physical Servers (192GB) • Single node • 6 x 1.8TB 10k SAS (RAID0) • 10 Physical Servers (192GB) • 3 nodes per servers • 3 x 2TB SSDs • 10 Physical Servers (192GB) • 1 node (96GB tmpfs) - HOT • 1 node (SSDs) - WARM Elastic Nodes deployment

Slide 15

Slide 15 text

• Using Hot/Warm architecture • tmpfs (RAM) • NVMe (10us writes/500K IOPS) • Hourly indices • Curator to move indexes to SSD Nodes Increasing indexing performance

Slide 16

Slide 16 text

• This was across 10 Physical hosts - Older E5-26xx. • Limited by Logstash (6 instances) Applying to production

Slide 17

Slide 17 text

• Lower Logstash indexers CPU usage • Better filtering capabilities? • Enable other consumers • Increase Filebeat CPU usage at Edge 1519769996.088 chi=174.79.69.70 phn=edge01.rd.at.cox.net shn=example.com url=http://cdn.example.ott.cox.net/test cqhm=GET cqhv=HTTP/1.1 pssc=200 ttms=0 b=52505 sssc=200 sscl=52505 cfsc=FIN pfsc=FIN crc=TCP_MISS phr=DIRECT uas="NING/1.0" range="-" JSON instead of RAW KV Logs Improving pipeline efficiency

Slide 18

Slide 18 text

Dashboard - Examples

Slide 19

Slide 19 text

Dashboards – Examples 19

Slide 20

Slide 20 text

Thank You!

Slide 21

Slide 21 text

Translog – async vs ? 21 Streaming performances