issues similar to ours - The people who are thinking about Multi-Tenant architecture - The people who can make decision about architecture Theme - Config Management in Multi-Tenant Kuberenetes - Operation for Fluentd in Multi-Tenant Kubernetes
Elasticsearch Pain Points - Too many sidecars in all of the Pods - All developers must maintain Fluentd regardless of their knowledge - Lack of monitoring, taking care of performance, reliability and durability
Elasticsearch Hard to schedule pods efficiently due to too many containers!! Quality depends on each teams Need to send “Audit logs” but don’t have enough monitoring
logfile "HHSFHBUPS/PEFT Elasticsearch Forwarder Aggregator Aggregator Aggregator /PEF# /PWB /FVUSPO ,FZTUPOF stdout stdout stdout logfile Forwarder Forwarders - Collect logs and send them to aggregators - Deployed as Daemonset It means that a node has only one Fluentd container
logfile "HHSFHBUPS/PEFT Elasticsearch Forwarder Aggregator Aggregator Aggregator /PEF# /PWB /FVUSPO ,FZTUPOF stdout stdout stdout logfile Forwarder Pros - Developers don’t need to maintain Fluentd - Fluentd can buffer logs while the destinations are down - Easy to scale aggregators - Monitored by SRE Team so developers don’t need to do that - Ensured durability, reliability and performance by SRE Team
- All configs shouldn’t affect other teams’ configs - All configs shouldn’t cause process down - All applying config shouldn’t make developers operate manually
Config Operator Aggregator Config FluentdNode Developers SRE - Automatically validate config written in CRD “LoggingPipeline” - Automatically compile the Fluentd config to CM if the config is valid - Automatically notify Fluentd to reload new config - Automatically block config if the config is invalid
Config Operator Aggregator Config FluentdNode Developers SRE All developers need to do is specify log source and destination in CRD. 5XFNPKJzCZ$PQZSJHIU5XJUUFS *ODBOEPUIFSDPOUSJCVUPSTJTMJDFOTFEVOEFS$$#:
- Automatically complicate important parameters - Automatically wrap config with label to isolate it not to affect others - Automatically change directory to buffer logs to ensure durability
got very higher periodically regardless of plenty of aggregator instances - Event thread in Fluentd was hanging up - Connections between the aggregator and forwarders were not too much - It means that aggregation, processing and writing buffer are heavy - But I/O was not hanging up
got very higher periodically regardless of plenty of aggregator instances - Event thread in Fluentd was hanging up - Connections between the aggregator and forwarders were not too much - It means that aggregation, processing and writing buffer are heavy - But I/O was not hanging up Log chunk size may be too much Let’s make chunk size be lower! 5XFNPKJzCZ$PQZSJHIU5XJUUFS *ODBOEPUIFSDPOUSJCVUPSTJTMJDFOTFEVOEFS$$#:
got very higher periodically regardless of plenty of aggregator instances - Event thread in Fluentd was hanging up - Connections between the aggregator and forwarders were not too much - It means that aggregation, processing and writing buffer are heavy - But I/O was not hanging up Resolved 5XFNPKJzCZ$PQZSJHIU5XJUUFS *ODBOEPUIFSDPOUSJCVUPSTJTMJDFOTFEVOEFS$$#:
Query periodically /PEF" /PWB /FVUSPO ,FZTUPOF stdout stdout stdout logfile /PEF" /PWB /FVUSPO ,FZTUPOF stdout stdout stdout logfile "HHSFHBUPS/PEFT Forwarder Aggregator Aggregator Aggregator Forwarder - Prometheus in the cluster scrape metrics from Fluentd containers - It is monitored by a Prometheus which is out of the cluster - The Prometheus writes metrics to VictoriaMetrics TSDB - VM Alert query pre-defined rules for VictoriaMetrics periodically - VM Alert fire alerts to AlertManager if match with the rules - AlertManager sends notifications to the destinations like Slack, PagerDuty
Pod restart count - Whether there is no logs which is sent to the destination - log inflow speed < log processing speed - Disk usage for buffering and buffered bytes - Number of errors and slow flush about Fluentd - Number of errors about Fluentd Config Operator
Operator - Got developers off maintaining Fluentd - All developers need to do is to manage their own logging config itself - Reduced about 172 containers in a cluster - Improve reliability, durability and performance about logging - Found undetected error about logging by monitoring
stdout logfile /PEF" /PWB /FVUSPO ,FZTUPOF stdout stdout stdout logfile "HHSFHBUPS/PEFT Forwarder Aggregator Aggregator Aggregator Forwarder - Remove direct dependency between forwarder and aggregator to improve scalability - Enable developers to send logs from out of the cluster