Slide 1

Slide 1 text

Mahmoud Ben Hassine March 2020 Spring Batch on Kubernetes Efficient batch processing at scale Copyright © 2020 VMware, Inc. or its affiliates.

Slide 2

Slide 2 text

About me ● Principal Software Engineer at VMWare ● Spring Batch Co-Lead ● Open source enthusiast @benas @b_e_n_a_s

Slide 3

Slide 3 text

What about you? ● Any Spring Batch users? ● Any Spring Boot users? ● Any Kubernetes users?

Slide 4

Slide 4 text

Agenda ● Spring Batch 101 ● Kubernetes Jobs 101 ● Spring Batch on Kubernetes, a perfect match! ● Demo ● Q+A

Slide 5

Slide 5 text

Spring Batch 101

Slide 6

Slide 6 text

What is batch processing? “Batch processing … is defined as the processing of a finite amount of data without interaction or interruption.” Michael Minella, The definitive guide to Spring Batch

Slide 7

Slide 7 text

Batch domain language (1/2)

Slide 8

Slide 8 text

Batch domain language (2/2) Once successfully completed, a job instance cannot be (re)started again.

Slide 9

Slide 9 text

Batch domain model

Slide 10

Slide 10 text

Chunk-oriented processing

Slide 11

Slide 11 text

Core Features Robustness ● Repeat/Retry/Skip/Restart ● Transaction management ● Chunk-oriented processing Scalability ● Multi-threaded steps ● Parallel steps ● Remote chunking/partitioning Flexibility ● XML/Java config styles ● Declarative I/O ● Rich library of item readers/writers And based on Spring Framework!

Slide 12

Slide 12 text

Use cases ● ETL processing ● Generation of statements/reports ● Data analysis ● Data science ● Business intelligence

Slide 13

Slide 13 text

History of Spring Batch • Step scope • Chunk-oriented processing • Remote chunking/partitioning • Java 5 • Spring Framework 3 v2.0 Apr 11, 2009 • Initial APIs • Item-oriented processing • XML configuration • Java 1.4 • Spring Framework 2.5 • Java configuration • Spring Data support • Non-identifying Job params • AMQP support • SQLFire support • Job scope • JSR-352 support • SQLite support • Spring Batch Integration • Spring Boot support • Builders for readers • Builders for writers • Java 8 • Spring Framework 5 v3.0 May 22, 2014 v1.0 Mar 28, 2008 v2.2 Jun 05, 2013 v4.0 Dec 1, 2017

Slide 14

Slide 14 text

Why would you need a framework like Spring Batch? https://www.deseret.com/2000/8/29/19526136/oops-walgreen-accidentally-bills-credit-card-customers-twice

Slide 15

Slide 15 text

Why would you need a framework like Spring Batch? https://www.smh.com.au/national/anzs-45m-bank-bungle-20060519-gdnksi.html

Slide 16

Slide 16 text

Why would you need a framework like Spring Batch? https://www.computerweekly.com/news/2240042675/NatWest-in-double-debit-error

Slide 17

Slide 17 text

Kubernetes Jobs 101

Slide 18

Slide 18 text

Kubernetes Jobs ● A Job creates one or more Pods and ensures that a specified number of them successfully terminate ● The Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot). ● You can also use a Job to run multiple Pods in parallel. ● Deleting a Job will clean up the Pods it created. https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion

Slide 19

Slide 19 text

Kubernetes Jobs - example

Slide 20

Slide 20 text

Kubernetes CronJobs ● A Cron Job creates Jobs on a time-based schedule (Unix like cron). ● There are certain circumstances where two jobs might be created, or no job might be created [..] Therefore, jobs should be idempotent. ● The CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. ● The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents. https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

Slide 21

Slide 21 text

Kubernetes CronJobs - example

Slide 22

Slide 22 text

Why would you need Kubernetes (for your batch jobs) ? ● It is not hype anymore, Kubernetes is awesome! Give it a try, even if you are not Google or Netflix (See the reasonable migration path suggested in next slides) ● Ability to run batch jobs on any node in the cluster with a single command ● Ability to query the entire cluster for running jobs with a single command ● Ability to automatically run jobs to completion (in case of node/pod failure) ● Efficient resources management (k8s plays Tetris with your cluster) ● Scalability

Slide 23

Slide 23 text

Spring Batch on Kubernetes A perfect match!

Slide 24

Slide 24 text

Cloud friendly batch jobs, how? ● Spring Batch jobs maintain their state in an external database, and as such, they are already 12 factors processes [1] (and could be easily 12 factorized: log to standard output, configured from the environment, etc) ● Skip successfully executed steps in previous run in case of failure (cost efficient) ● Retry failed items in case of transient errors (like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment) ● Restart from the last save point within the same step thanks to the chunk-oriented processing model (cost efficient) ● Safe against duplicate job executions (due to a human error or k8s pod rescheduling or CronJob limitation when it might run the same job twice) [1]: https://12factor.net/processes

Slide 25

Slide 25 text

Containerised batch jobs, why? ● Separate logs ● Independent life cycle (bugs/features, deployment, etc) ● Separate parameters / exit codes ! ● Restartability (in case of failure, only restart the failed job) ● Testability ● Scalability ● Resource usage efficiency (optimized resource limits => better pod scheduling)

Slide 26

Slide 26 text

BigBang-less migration plan ● Keep the database outside Kubernetes [1], migrate only stateless batch jobs: gradual, hybrid migration path ● Traditional job => bootify it => 12 factorize it => dockerize it => kubernetize it ● Use kubernetes namespaces for testing/deploying jobs in staging/production [2] ● CI/CD CronJobs live with “kubectl set image” or its REST API equivalent [3] [1]: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-mapping-external-services [2]: https://kubernetes.io/blog/2015/08/using-kubernetes-namespaces-to-manage/ [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#patch-cronjob-v1beta1-batch

Slide 27

Slide 27 text

What you should know before the migration ● Job/Container exit code is very important! ● Understand graceful/abrupt shutdown implications ● Choose the right job pattern [1] (Job instances volume) ● Choose the right restart/concurrency policies ● Understand CronJobs limitations [2] [1]: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-patterns [2]: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations

Slide 28

Slide 28 text

Job deployment and administration with Spring Cloud Data Flow + Spring Batch Spring Boot Spring Cloud Data Flow = +

Slide 29

Slide 29 text

DEMO

Slide 30

Slide 30 text

Q+A

Slide 31

Slide 31 text

Thank you! © 2020 Spring. A VMware-backed project. Code: https://github.com/benas/spring-batch-lab/tree/master/talks Slides: https://speakerdeck.com/benas/spring-batch-kubernetes Spring Batch home: https://projects.spring.io/spring-batch/ Kubernetes home: https://kubernetes.io Github: @benas Twitter: @b_e_n_a_s