Composable Data Pipelines for not-so-Big Data

Composable Data Pipelines for not-so-Big Data

A case study of how some of Clojure's features helped us solve problems we were facing in onboarding the data of our enterprise customers to our platform, at numberz.

8b4e839d23841280666fde3eb0bc56c5?s=128

Akaash Patnaik

November 10, 2019
Tweet

Transcript

  1. Composable Data Pipelines for not-so-Big Data

  2. Composable Data Pipelines for not-so-Big Data Or: Why you may

    not need that shiny, new data processing framework
  3. Composable Data Pipelines for not-so-Big Data Or: Why you may

    not need that shiny, new data processing framework (And not because your Data isn’t Big enough. I'm sure it is.)
  4. Composable Data Pipelines for not-so-Big Data Or: Why you may

    not need that shiny, new data processing framework (And not because your Data isn’t Big enough. I'm sure it is.) (But because sometimes you're not always working with ALL of that data, and you may have simpler tools at your disposal.)
  5. Hi! I’m Akaash. Web Developer Underscore.js user -turned- Functional Programming

    fanboy -turned- Clojure convert @worldpiece akaash.xyz
  6. About this talk… • Experience report of a team using

    Clojure in Production for the first time. • Demonstrate how plain ol’ Clojure and its features provide reasonably powerful alternatives to distributed computation frameworks.
  7. Outline • Context - numberz & what it does •

    Business problem • Iterations - improving the solution • Challenges faced • Conclusion
  8. Context • numberz is a fin-tech startup • Our product

    helps enterprises automate & streamline their payments collection processes • We need financial data from ERPs to be pulled into our system
  9. Problem • Direct integrations with ERPs aren’t a feasible option

    (ERP inflexibility, licensing issues, etc.) • Need to rely on a hodgepodge of pre-existing canned reports (Excel, CSV) to get data we need • Reports are large-ish: ~50mb in size • Custom, complicated, compute-intensive transformations needed to get data into our format
  10. Decisions, decisions… What we wanted • Support for reading/writing large

    spreadsheets (Excel, CSV) • Expressive data transformations • Support for processing reasonably large data-sets, preferably in-memory What we didn’t want • Maintenance overhead of a distributed computing platform • Incidental complexity *cough*type-systems*cough* • Proprietary ETL solutions with high licensing costs
  11. Enter Clojure… • Java interop gave us robust options for

    processing large Excel sheets • Lazy sequences, transducers helped with handling large payloads in-memory • Dynamic typing helped us avoid proliferation of unnecessary type definitions • Declarative data transformations
  12. Iteration 1 Custom Transformation Scripts Pros: • Simple enough to

    get us going • Transformations were represented as pipelines • Threading macros kept the code clean, readable
  13. Iteration 1 Custom Transformation Scripts Cons: • Expensive - Required

    e2e involvement of devs in customer onboarding • Lack of standardisation was a potential risk when incorporating fixes/enhancements at a later time • Above was especially true for “multi-branch” pipelines • Clojure familiarity was low, hence only a handful of team-members were able to contribute
  14. Iteration 2 DAG-Based Transformations • Structurally standardised the transformations by

    representing them as Directed Acyclic Graphs (DAGs). • Similar to workflow engines. Eg. Airflow, Oozie, etc. • Each node in the graph represented a simple transformation • Composing multiple nodes together, in the right sequence, allowed for complex, multi-branch flows
  15. Iteration 2 DAG-Based Transformations • Choice of Clojure proved to

    be an astute one • We wrote our own DSL to define nodes and DAGs, declaratively • Also introduced clojure.spec for validation and generative testing
  16. Iteration 2 DAG-Based Transformations

  17. Iteration 2 DAG-Based Transformations Pros: • Helped reduce amount of

    code to be written • Moved us towards an even more declarative expression of transformation • clojure.spec gave greater confidence when making changes to transformers Cons: • Less readable than using threading macros • Higher memory footprint than custom scripts, but still works on a single machine
  18. Future Enhancements • Serialize the DAG & node definitions and

    persist to database • Move to declarative, config-driven expression of individual nodes in the DAG • Generate DAG documentation, diagrams from definitions
  19. Challenges Faced • Are we called Clojurians or Clojurists? •

    Clojure learning curve • Stacktraces pretty complicated for beginners, especially when working with lazy sequences and macros • Low convergence within the community on de facto libraries/frameworks for common tasks
  20. Conclusions • We chose to trade off the scalability of

    a distributed computation platform, for the simplicity and predictability of a single-machine setup. • A lot of Clojure’s features fit very well with our problem space. • We’d imagine this would be the case for any ETL- like use case on data-sets that aren’t large enough to warrant Big-Data solutions.
  21. – Joseph ‘Proposition Joe’ Stewart, The Wire “Keep it boring,

    String. You keep it dead f***king boring.”
  22. Thank you!