Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing the future of agent-server communication in RUDDER

Rudder
February 04, 2020

Designing the future of agent-server communication in RUDDER

🎥 https://www.youtube.com/watch?v=l-ztfw_OIow
🧑 Alexis Mousset
📅 Configuration Management Camp 2020

RUDDER is currently used to manage more than 10k machines from the same central server, but our agent-server communication (using HTTP for inventory collection, syslog for reporting and a custom protocol for policy updates) was limiting us in terms of security, performance and extensibility.

With RUDDER 6, we have introduced a new communication infrastructure to match present and future challenges with consistent security, better performance, improved continuity through immediate action triggers, while staying compatible with our fully asynchronous, pull-based workflow.

The talk will focus on the design choices we made, from the use of Rust for our new server component, to the network and message protocols we use. It will also highlight the reasons and constraints behind them, including ensuring a minimal operation overhead and an easy and smooth transition with no breaking change.

Rudder

February 04, 2020
Tweet

More Decks by Rudder

Other Decks in Programming

Transcript

  1. Designing the future of agent-server
    communication in RUDDER
    Alexis Mousset - [email protected]
    February 4, 2020
    CfgMgmtCamp Ghent 2020

    View Slide

  2. Context
    Needs and requirements
    Design choices
    Implementation
    Perspectives
    RUDDER 1

    View Slide

  3. Context

    View Slide

  4. RUDDER 2

    View Slide

  5. Agent-server communication
    • HTTP PUT for inventories
    • Custom protocol in TLS for policy copy
    • Standard syslog for reporting
    RUDDER 3

    View Slide

  6. R: @@osquery_installation_and_configuration
    @@result_success
    @@32377fd7-02fd-43d0-aab7-28460a91347b
    @@de4435a0-b5db-401a-aba1-a685d8a937e1
    @@0
    @@File from HTTP server
    @@/etc/pki/RPM-GPG-KEY-osquery
    @@2020-01-31 22:16:00+00:00
    ##2f4e4400-3206-4fb3-ae7f-16d1192e38ac
    @#File /etc/pki/RPM-GPG-KEY-osquery is correct
    RUDDER 4

    View Slide

  7. Reporting
    R: @@[email protected]@[email protected]@[email protected]@DirectiveId
    @@[email protected]@[email protected]@[email protected]@ExecutionTimeStamp
    ##[email protected]#HumanReadableMessage
    • Produced by the policies (a lot of work actually!)
    • One log in syslog for each component
    • Special cases for the first and last report
    • The webapp only processes them when last report is
    received
    • Parsed using awk for rudder agent run output
    RUDDER 5

    View Slide

  8. Needs and requirements

    View Slide

  9. Reporting - Current limitations
    • Reporting security
    • plain text
    • no authentication (easy to fake)
    • Missing information in reports
    • Differences between expected and current state
    • Information about what has been repaired exactly
    • Syslog itself
    • Requires a specific port (514)
    • Requires root access (to configure local syslog daemon)
    • Can interact with user’s syslog configuration
    • Hard to debug (not much logs about syslog daemon by default)
    • Poor performance (for database insertion)
    RUDDER 6

    View Slide

  10. Constraints
    • Smooth transition
    • Keep compatibility with both reporting modes for several
    versions
    • Allow switching at any time (same data model)
    • Keep It Simple
    • Debuggable
    • Low operation overhead
    • Use well-known technologies
    • Security
    • State-of-the-art security for reporting protocol
    • Allow future homogenization
    • Focus on security for the implementation itself
    RUDDER 7

    View Slide

  11. Design choices

    View Slide

  12. Report → Run log
    • The stream of reports is useless
    • Better transmitted as a single run log
    • Store information by run (in a simple file)
    • Easier to manage and allows lots of improvements
    • Compression (works well!)
    • Database transaction by run
    • A run log is identified by: a node id, a config id and a
    date+time
    RUDDER 8

    View Slide

  13. Improve run log
    • We are missing a lot of valuable information
    • ”Hidden” in agent logs (rudder agent run -i)
    • Need to ssh to understand anything
    • We want to capture and contextualize them
    RUDDER 9

    View Slide

  14. RUDDER 10

    View Slide

  15. Improve run log
    The problem is that we have two (isolated) information
    streams:
    • Reports from inside of the policies
    • Agent logs (errors, executed commands, various outputs,
    etc)
    RUDDER 11

    View Slide

  16. Improve run log
    • Agents are usually not designed for error management at
    scale, and expect human interaction.
    • Nothing built-in for automated outcome analysis (what
    failed and what has been done)
    • no structured errors in the policies, only access to a state
    (error/ok/repaired) from inside the policy
    • no business knowledge in the logs
    RUDDER 12

    View Slide

  17. Use (=parse) stdout
    RUDDER 12

    View Slide

  18. Improve run log
    • Capture full agent output in info mode
    • Parse it on the relay/server
    • Associate simple logs with following contextualized log
    • Works for log from the technique editor or modern
    techniques
    • Not that good for legacy techniques: do everything then
    report
    • Specific insertion/purging configuration for non-results
    logs (=simple logs)
    RUDDER 13

    View Slide

  19. RUDDER 14

    View Slide

  20. Reports authentication
    • We want to authenticate reporting
    • We want to stay asynchronous
    • End to end validation (check signature on root server)
    • We need a signature (like we do for inventories)
    • Prefer a standard
    • We have a hierarchical node structure
    RUDDER 15

    View Slide

  21. S/MIME
    RUDDER 15

    View Slide

  22. HTTP
    Use HTTP as it’s:
    • Already used for inventories (and Windows policy
    downloads)
    • Well-known
    • Easy operation and debugging (curl, etc.)
    • Fast and powerful enough (even more with HTTP/2)
    • Use simple file PUT (like inventories)
    RUDDER 16

    View Slide

  23. Implementation

    View Slide

  24. Agent
    • Use the existing rudder agent run wrapper
    • Collect output, sign and compress
    • Send to the server
    • Retry in case of failure
    • Allows back-filling compliance data
    RUDDER 17

    View Slide

  25. relayd
    • A new daemon that runs on all policy servers (root +
    relay)
    • Reminder: A root server is also a relay
    • Replaces relay python API
    • Layer between the webapp and the nodes
    • Stateless (except for history)
    RUDDER 18

    View Slide

  26. Relay API
    • Based on what existed since 3.2 (implemented in Python)
    • Now versioned and documented
    • Only listens locally
    • Some endpoints behind httpd reverse proxy
    • https://docs.rudder.io/api/relay
    • Still missing a full stats/monitoring API (prometheus?)
    RUDDER 19

    View Slide

  27. Relay API
    • /system/{status, reload, info}
    • /policies/{node-id}/rules
    • /remote-run/nodes/{id}
    • /shared-files/{target-id}/...
    • /shared-folder/{path}
    RUDDER 20

    View Slide

  28. relayd
    • Config files: /opt/rudder/etc/relay
    • A new rudder-relayd service
    • Logs to journald
    • Part of the rudder-server-relay package
    RUDDER 21

    View Slide

  29. Service hardening
    User=rudder-relayd
    ProtectSystem=strict
    ReadWritePaths=/var/rudder/reports
    /var/rudder/inventories
    /var/rudder/shared-files
    PrivateTmp=True
    RUDDER 22

    View Slide

  30. SELinux
    A dedicated SELinux context
    • Write access to work directories
    • Read access to configuration and data files
    • Connect to HTTP and postgresql ports
    • Listen on port 3030
    • Run the ‘rudder remote run‘ command with sudo
    RUDDER 23

    View Slide

  31. HTTP security
    • Enforce TLS1.2+ everywhere (except syslog!)
    • Option to check certificates in all HTTP requests
    • Not directly linked
    • Allows authentication of both ways
    • For now, requires an existing PKI (and proper DNS setup)
    RUDDER 24

    View Slide

  32. Rust
    • We wanted:
    • Reliability and security
    • Maintainability (<3 strong typing)
    • Low footprint (to allow ”embedded” relayd)
    • Easy packaging and deployment
    • Chose Rust!
    RUDDER 25

    View Slide

  33. To sum up
    We added a daemon between root server and nodes to:
    • Forward reports and inventories (inotify-based)
    • Check, parse and store reports on root server
    • Provide the file sharing API
    • Provide the policy and shared-files download API (for
    Windows)
    • Only required httpd and agent to synchronize data files
    RUDDER 26

    View Slide

  34. Perspectives

    View Slide

  35. Future
    • Encryption option (using S/MIME)
    • Use S/MIME for inventories too
    • Allow policy updates over HTTPS for full HTTP
    communication
    • Diffs for all non-compliance or errors (including files!)
    • Connected mode for reactivity and continuity
    • Flexible RUDDER server distribution (container, roles,
    cloud, etc)
    • New (virtual) agents (maybe managed from relayd)
    • Check server certificate by default
    RUDDER 27

    View Slide

  36. Feedbacks?
    RUDDER 27

    View Slide

  37. Thank you!
    RUDDER 27

    View Slide