Slide 1

Slide 1 text

Drone CI Delivering Continuous Testing for large open source projects Patrick Jahns @ DevOps Franken

Slide 2

Slide 2 text

Patrick Jahns Senior Solutions Architect @ ownCloud @patrick_jahns [email protected]

Slide 3

Slide 3 text

3 ownCloud is the open platform for productivity and security in digital collaboration ownCloud

Slide 4

Slide 4 text

4 Delivering CI/CD at ownCloud • Hosted on Github and consists of ~80 separate github repositories • Built on top of a Web Application Stack (PHP, Apache, Javascript, CSS, HTML) • Over 14000 unit tests and 2200 acceptance ( ui/ api ) tests • Pull request for “core” run 15 hours of test time ( Feedback < 30 mins ) • Every night we run over 180 hours of tests • Various infrastructure components – Relational database (MySQL, MariaDB, PostgreSQL, OracleDB) – Memory Cache (Memcached, Redis) – Storage Providers (FileSystem, NFS, SMB, Swift, S3, OneDrive, Dropbox, etc.) – Identity / Authentication Providers (LDAP, Active Directory, Shibboleth / SAML) – Other Infrastructure Components (ClamAV, Elasticsearch, Collabora, etc.)

Slide 5

Slide 5 text

Where it all started…

Slide 6

Slide 6 text

6 Where it all started... • Travis CI – Dav (Litmus, Carddav, Caldav) tests – PHP syntax checks – Selenium testing (arrived beginning 2017) • Jenkins – Unit tests with different PHP and database versions – Storage specific tests (Swift, Ceph, Samba, S3) – Integration tests – Upgrade tests – Smashbox tests Old Infrastructure / setup

Slide 7

Slide 7 text

7 Where it all started... • CI environment not reproducible locally, e.g. “works for me ™” • Test suites encountered regular timeouts • Feedback / Results of test runs sometimes only after days • No real plugin system, not extensible • Travis wasn‘t able to provide extended build power on our open-source repositories (only possible on private repositories)* *) changed in Summer 2018

Slide 8

Slide 8 text

8 Where it all started... • Difficult to keep up to date – Plugin updates result in changes to config format – Only managed via web UI • Secrets are managed via web UI or hacky API scripts • Frequently ran out of disk space • Wasn‘t cleaning up containers properly • Containers (services) required a lot of bash magic • Test results took hours to complete – very slow Feedback cycle • Static number of executors

Slide 9

Slide 9 text

Leaving the dark Ages…

Slide 10

Slide 10 text

10 Drone CI Your friendly neighborhood CI system

Slide 11

Slide 11 text

11 Drone CI Your friendly neighborhood CI system

Slide 12

Slide 12 text

12 Drone CI Your friendly neighborhood CI system • Container native CI/CD platform (everything runs within containers) • Easy to install & maintain (docker pull drone/drone) • Isolated builds • Multi-Arch (amd64, arm64, ) • Mutli-Machine builds ( fan-out & fan-in ) • Simple YAML Configuration • Integrates with several VCS Providers ( Github, Gitea, Gitlab, Bitbucket …) • Rich set of official plugins (any container can be a plugin) • Execute locally with “drone exec” • Open Source (https://github.com/drone)

Slide 13

Slide 13 text

13 Drone CI Your friendly neighborhood CI system Server SERVICES WORKSPACE STEP1 git clone STEP2 make STEP3 publish Agent

Slide 14

Slide 14 text

14 Drone CI Let’s migrate to drone • Provision drone-server & drone-agents via ansible • Provide Docker containers for infrastructure components (PHP / databases / storages) • Gradual migration of “owncloud/core” from Jenkins / Travis to Drone – Basic linting and unit testing – Gradually migrated integration / acceptance tests and UI tests • Expand drone to app repositories – Required “plugin” to install and configure ownCloud => https://github.com/owncloud-ci – Built further custom plugins, e.g. recorder

Slide 15

Slide 15 text

15 Drone CI Recap – Where are we at now? • Too many systems to maintain • Secrets management • Frequently ran out of disk space • Static number of executors / timeouts • Containers required a lot of bash magic • CI environment not reproducible locally • No plugin system / limited extensibility Need to maintain 3 systems: Jenkins, Drone, Travis Drone provided us with API & UI Docker isn’t great at cleaning up after itself No time restriction, but amount of parallel jobs limited Container native Containers & drone exec Any container can be a plugin

Slide 16

Slide 16 text

Entering the golden age…

Slide 17

Slide 17 text

17 Entering the golden age • Dropped Travis and Jenkins entirely • Scaling Drone agents on demand • Number of test suites still increasing • Entirely version controlled and easily manageable infrastructure – Terraform – Ansible – Hetzner Cloud – Autoscaler Final infrastructure

Slide 18

Slide 18 text

18 Entering the golden age Welcome “Autoscaler” Server Autoscaler Agent Agent Agent Agent

Slide 19

Slide 19 text

19 Entering the golden age • Support for various Infrastrucute Providers ( AWS, Azure, Packet. Openstack, hetznercloud …) • Simple service connected to drone server • Hooked into Drone CLI, e.g. “drone server create” • Checks the Drone queue in a loop • Launch servers based on a cloud-init config • Start Drone agent via remote Docker connection (secured by TLS) • Unregister Drone agent if not needed anymore • Destroy server instance after a minimal amount of time Welcome to “Autoscaler”

Slide 20

Slide 20 text

20 0 5000 10000 15000 20000 25000 30000 35000 Nov 6-Nov 11-Nov 16-Nov 21-Nov 26-Nov Dec 5-Dec 12-Dec 17-Dec 22-Dec 27-Dec Jan 5-Jan 11-Jan 16-Jan 21-Jan 26-Jan 31-Jan 4-Feb 9-Feb 14-Feb 19-Feb 24-Feb Mar 5-Mar 10-Mar 15-Mar 20-Mar 25-Mar 30-Mar 3-Apr 8-Apr 13-Apr 18-Apr 26-Apr 1-May 7-May 12-May 17-May 22-May 28-May 1-Jun 6-Jun 11-Jun 16-Jun commulated runtime time to finish time to finish (including queue wait) Entering the golden age

Slide 21

Slide 21 text

21 Entering the golden age

Slide 22

Slide 22 text

Evolution of our CI

Slide 23

Slide 23 text

23 Evolution of our CI • Native integration of build scheduling in yaml configuration • Split feedback on pull requests per test suite • Archiving build data – we got lots of build logs • Downstream cross repository checks • Upstream cross repository checks • Ability to restart one branch of a fan-in/fan-out scenario • Triggers – Invoke builds from different sources – Split pipelines into different configurations What else is missing?

Slide 24

Slide 24 text

24 Evolution of our CI • Just adding ${FAVORITE_CI} as a tool to your company, doesn’t guarantee success – Promote the tool in your team / get the team onboard – Gradual adoption helped to gain traction within teams • Everything works great, until you move to production – Load can will kill your application at some point … and also your CI system... • Smaller Infrastructure components vs. Monolithic Infrastructure – Docker default network limitations – General resource limits, e.g. Disks, IOPS, CPU, Memory • Technical fallacies – Everything that can be unreachable, will be unreachable ( this is also true for your SaaS repository provider) – Database compression is really not a good idea for write heavy loads Lessons Learned

Slide 25

Slide 25 text

Thank You