Rambling about InfluxDB, ntop and Kubernetes

Slide 1

Slide 1 text

InﬂuxDB and ntop on Kubernetes Gianluca Arbezzano, SRE at InﬂuxData

Slide 2

Slide 2 text

Gianluca Arbezzano Site Reliability Engineer @InﬂuxData ● https://gianarb.it ● @gianarb What I like: ● I make dirty hacks that look awesome ● I grow my vegetables ● Travel for fun and work

Slide 3

Slide 3 text

@gianarb - [email protected]

Slide 4

Slide 4 text

@gianarb - [email protected]

Slide 5

Slide 5 text

What is going on right now in production?

Slide 6

Slide 6 text

1. You Know! Your team knows and use Docker for local development and testing 2. Kubernetes! Everyone speaks about kubernetes. 3. Hire! You don’t know why but you hired a DevOps that kind of know k8s. 3. Excitement! You are moving everything and everyone to kubernetes

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Inspired by a true story

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

You need a good book 1. Short 2. Driven by experiences 3. Practical 4. Easy

Slide 11

Slide 11 text

@gianarb - [email protected]

Slide 12

Slide 12 text

@gianarb - [email protected]

Slide 13

Slide 13 text

Kubernetes Resources ¨ Pod ¨ DaemonSet ¨ StatefulSet ¨ Service ¨ Deployment ¨ Persistent Volumes ¨ ConﬁgMap ¨ Secret

Slide 14

Slide 14 text

InﬂuxDB StatefulSet apiVersion: apps/v1 kind: StatefulSet metadata: namespace: monitoring name: influxdb labels: component: influxdb spec: serviceName: influxdb selector: matchLabels: component: influxdb replicas: 1 template: metadata: name: influxdb labels: app: influxdb spec: containers: .... containers: - name: influxdb image: quay.io/influxdb/influxdb:nightly volumeMounts: - name: data mountPath: /var/lib/influxdb ports: - containerPort: 8086 name: server volumeClaimTemplates: - metadata: namespace: monitoring name: data spec: storageClassName: ebs-1 accessModes: - "ReadWriteOnce" resources: requests: storage: "250Gi"

Slide 15

Slide 15 text

InﬂuxDB Service apiVersion: v1 kind: Service metadata: namespace: monitoring name: influxdb labels: component: influxdb app: influxdb spec: clusterIP: None ports: - port: 8086 name: server selector: component: influxdb publishNotReadyAddresses: true

Slide 16

Slide 16 text

Chronograf Deployment apiVersion: apps/v1 kind: Deployment metadata: namespace: monitoring name: chronograf labels: app: chronograf spec: selector: matchLabels: app: chronograf replicas: 1 template: metadata: name: chronograf labels: app: chronograf spec: containers: - name: chronograf image: chronograf:1.7.4 ports: - containerPort: 8888 name: server

Slide 17

Slide 17 text

Chronograf Service apiVersion: v1 kind: Service metadata: namespace: monitoring name: chronograf spec: ports: - port: 80 targetPort: 8888 name: server selector: app: chronograf

Slide 18

Slide 18 text

Telegraf ConﬁgMap apiVersion: v1 kind: ConfigMap metadata: name: telegraf-config-ntopng namespace: monitoring data: telegraf.conf: | [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "ntopng" retention_policy = "weekly" write_consistency = "any" timeout = "5s" username = "$MONITOR_USERNAME" password = "$MONITOR_PASSWORD" [[inputs.influxdb_listener]] service_address = ":8086" [global_tags] env = "$ENV" hostname = "$HOSTNAME"

Slide 19

Slide 19 text

ntopng ConﬁgMap apiVersion: v1 kind: ConfigMap metadata: name: ntopng-runtimeprefs namespace: monitoring data: runtimeprefs.json: |+ {"ntopng.prefs.influx_username":"00000800E4B6B0FFA66628E2","ntopng.prefs.influx_dbname":"00066E746F706E670800C3BD27A9CC635D 8C","ntopng.prefs.influx_password":"00000800E4B6B0FFA66628E2","ntopng.prefs.influx_retention":"00C0070800CDE0A41D2AB244AA", "ntopng.prefs.influx_auth_enabled":"00C0000800216D4BC751498930","ntopng.prefs.ts_post_data_url":"0015687474703A2F2F6C6F6361 6C686F73743A383038360800BC9FDD56721F5346","ntopng.prefs.timeseries_driver":"0008696E666C7578646208002B6EA2EF661124EC"}

Slide 20

Slide 20 text

ntopng daemonset apiVersion: apps/v1 kind: DaemonSet metadata: name: ntopng namespace: monitoring spec: selector: matchLabels: name: ntopng template: metadata: labels: name: ntopng spec: tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: ntopng - name: telegraf

Slide 21

Slide 21 text

Ntopng daemonset - name: ntopng image: ntop/ntopng:stable args: ["-U", "root"] resources: limits: memory: 750Mi requests: memory: 250Mi env: - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: HOSTNAME valueFrom: fieldRef: fieldPath: spec.nodeName volumeMounts: - name: config mountPath: /var/tmp/ntopng/runtimeprefs.json subPath: runtimeprefs.json - name: telegraf image: docker.io/library/telegraf:1.9 volumeMounts: - name: telegraf mountPath: /etc/telegraf env: - name: MONITOR_HOST valueFrom: secretKeyRef: name: telegraf key: monitor_host - name: MONITOR_USERNAME valueFrom: secretKeyRef: name: telegraf key: monitor_username - name: HOSTNAME valueFrom: fieldRef: fieldPath: spec.nodeName

Slide 22

Slide 22 text

Ntopng daemonset ntop uses as identiﬁer IPs, in our case the containers IP. But I would like to correlate by hostname and environment as well telegraf as proxy add those tags for every points. [global_tags] env = "$ENV" hostname = "$HOSTNAME"

Slide 23

Slide 23 text

Now you can query your data from Chronograf

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

You can correlate network metrics with other signals

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

© 2018 InﬂuxData. All rights reserved. 28 @gianarb - gianluca@inﬂuxdb.com github.com/kubernetes-incubator/metrics-server Through the Metrics API you can get the amount of resource currently used by a given node or a given pod. This API doesn’t store the metric values, so it’s not possible for example to get the amount of resources used by a given node 10 minutes ago.

Slide 29

Slide 29 text

gianarb.it ~ @gianarb # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9 # Minimalistic line: metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444 http_request_duration_seconds_bucket{le="0.2"} 100392 http_request_duration_seconds_bucket{le="0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320

Slide 30

Slide 30 text

© 2018 InfluxData. All rights reserved. 30 @gianarb - gianluca@influxdb.com Ntop can be the right tool but we need: ¨ A better configuration via file or/and env var ¨ It should be “kubernetes aware” because IPs are not the right way to identify containers. They change to often

Slide 31

Slide 31 text

~ @gianarb - https://gianarb.it ~ monitor your monitor infrastructure

Slide 32

Slide 32 text

~ @gianarb - https://gianarb.it ~ The monitor infrastructure should notify you when you are down.

Slide 33

Slide 33 text

~ @gianarb - https://gianarb.it ~ The monitor infrastructure should be as far as possible from the environment you are monitoring.

Slide 34

Slide 34 text

~ @gianarb - https://gianarb.it ~ ● Difference cloud provider ● Difference datacenter ● Different company ● Difference region ● Different teams ● As different as you can handle it

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text