Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Out with the Monolith - a new data ingestion system in the Heliophysics Archives

Out with the Monolith - a new data ingestion system in the Heliophysics Archives

Short presentation describing the monolithic data ingestion system we previously had and the new ingestion system created using:
Apache Camel
Spring Boot
Spring Data API

Jonathan Cook

November 10, 2017
Tweet

More Decks by Jonathan Cook

Other Decks in Technology

Transcript

  1. ESA UNCLASSIFIED - Releasable to the Public Out with the

    Monolith A new data ingestion system Jonathan Cook – Software Engineer ESDC Heliophysics Team
  2. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Background and Motivation Simple Data Ingestion use case: • Scientist has added a new super important field to his Data files which we need to ingest into our database. • Technically this translates into adding one new column to our observation table in our database and populating it during ingest. • In the case of P2SA our database only has 8 tables. • This should be a simple testable change, right? How long should this type of change take to code test and deploy • Couple of hours?
  3. ESA UNCLASSIFIED - Releasable to the Public SSA/P2SA Status Existing

    Import and Ingest system This is not fetching data remotely.
  4. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Existing Import and Ingest system cont. In my experience, this type of simple change can take around 1-2 days and of course when we finish, the scientist says “Can you do the same again” for... Drawbacks: • We have to update and deploy 8 modules, this is the same number of modules as database tables!! • It is error prone, along the way it is easy to make a mistake or forget to do something. • Difficult to write unit tests as we need many mocks (although possible) • Running and debugging locally is very complicated (rmi registry) • Older technology in general and performance is not great. When something is complicated, generally we tend to avoid doing it.
  5. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Created a new Ingestion Application Using mature and robust frameworks: • Spring Boot - https://projects.spring.io/spring-boot/ • Apache Camel - http://camel.apache.org/ • Spring Data JPA - https://projects.spring.io/spring-data-jpa/ Spring Boot: • Makes it very easy to create stand-alone, production-grade Applications that you can "just run“. • Prefers Convention over Configuration. • Comes with many features such as metrics, health checks and externalized configuration all for free.
  6. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Created a new Ingestion Application cont Apache Camel: • Concrete implementation of the widely used EIP (Enterprise Integration Patterns) • http://www.enterpriseintegrationpatterns.com/patterns/messaging/toc.html • Provides a way of constructing routes (data flows) to wire processing and transports together. • Connectivity to a great variety of transports and APIs • Highly configurable Simple java example: from(“file:/p2sa/stage/lyra/data”).to(“file:/p2sa/final-repository”);
  7. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Created a new Ingestion Application cont Spring Data JPA: • A much simplified version of DAOs minus all the implementations. • Uses concept of CRUD repositories. Interesting for other existing archives: • Changed the connection pooling to use HikariCP (https://github.com/brettwooldridge/HikariCP) which seems to be considered the defacto connection pool to use now over CP30.
  8. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Conclusion and Results • LOC reduced from 9706 to 1.8k and from 8 modules to 1 • Writing tests is easy, see @DataJpaTest for using in memory databases from tests. 82% unit test coverage • Running and debugging locally is easy (override local props) • Performance is much better. Virtual Machine 4GB RAM: Old system Time to ingest (exclude import) New system Time to ingest 2000 FITS Files 52mins 2000 FITS Files 1min 8 secs 10000 FITS files (21GB) 4hrs 35mins 10000 FITS files (21GB) 5mins 11sec
  9. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Conclusion and Results cont
  10. ESA UNCLASSIFIED - Releasable to the Public P2SA New Ingestion

    Application Final points • Definitely worth considering this approach for new archives or small existing archives. • Lots of potential to quickly develop other useful services with these frameworks. • Happy to show or explain more anytime :-)