Composable Data Pipelines for not-so-Big Data

Composable Data Pipelines for not-so-Big Data Or: Why you may
not need that shiny, new data processing framework

not need that shiny, new data processing framework (And not because your Data isn’t Big enough. I'm sure it is.)

not need that shiny, new data processing framework (And not because your Data isn’t Big enough. I'm sure it is.) (But because sometimes you're not always working with ALL of that data, and you may have simpler tools at your disposal.)

Hi! I’m Akaash. Web Developer Underscore.js user -turned- Functional Programming
fanboy -turned- Clojure convert @worldpiece akaash.xyz

About this talk… • Experience report of a team using
Clojure in Production for the ﬁrst time. • Demonstrate how plain ol’ Clojure and its features provide reasonably powerful alternatives to distributed computation frameworks.

Outline • Context - numberz & what it does •
Business problem • Iterations - improving the solution • Challenges faced • Conclusion

Context • numberz is a ﬁn-tech startup • Our product
helps enterprises automate & streamline their payments collection processes • We need ﬁnancial data from ERPs to be pulled into our system

Problem • Direct integrations with ERPs aren’t a feasible option
(ERP inﬂexibility, licensing issues, etc.) • Need to rely on a hodgepodge of pre-existing canned reports (Excel, CSV) to get data we need • Reports are large-ish: ~50mb in size • Custom, complicated, compute-intensive transformations needed to get data into our format

Decisions, decisions… What we wanted • Support for reading/writing large
spreadsheets (Excel, CSV) • Expressive data transformations • Support for processing reasonably large data-sets, preferably in-memory What we didn’t want • Maintenance overhead of a distributed computing platform • Incidental complexity *cough*type-systems*cough* • Proprietary ETL solutions with high licensing costs

Enter Clojure… • Java interop gave us robust options for
processing large Excel sheets • Lazy sequences, transducers helped with handling large payloads in-memory • Dynamic typing helped us avoid proliferation of unnecessary type deﬁnitions • Declarative data transformations

Iteration 1 Custom Transformation Scripts Pros: • Simple enough to
get us going • Transformations were represented as pipelines • Threading macros kept the code clean, readable

Iteration 1 Custom Transformation Scripts Cons: • Expensive - Required
e2e involvement of devs in customer onboarding • Lack of standardisation was a potential risk when incorporating ﬁxes/enhancements at a later time • Above was especially true for “multi-branch” pipelines • Clojure familiarity was low, hence only a handful of team-members were able to contribute

Iteration 2 DAG-Based Transformations • Structurally standardised the transformations by
representing them as Directed Acyclic Graphs (DAGs). • Similar to workflow engines. Eg. Airflow, Oozie, etc. • Each node in the graph represented a simple transformation • Composing multiple nodes together, in the right sequence, allowed for complex, multi-branch flows

Iteration 2 DAG-Based Transformations • Choice of Clojure proved to
be an astute one • We wrote our own DSL to deﬁne nodes and DAGs, declaratively • Also introduced clojure.spec for validation and generative testing

Iteration 2 DAG-Based Transformations

Iteration 2 DAG-Based Transformations Pros: • Helped reduce amount of
code to be written • Moved us towards an even more declarative expression of transformation • clojure.spec gave greater conﬁdence when making changes to transformers Cons: • Less readable than using threading macros • Higher memory footprint than custom scripts, but still works on a single machine

Future Enhancements • Serialize the DAG & node definitions and
persist to database • Move to declarative, config-driven expression of individual nodes in the DAG • Generate DAG documentation, diagrams from definitions

Challenges Faced • Are we called Clojurians or Clojurists? •
Clojure learning curve • Stacktraces pretty complicated for beginners, especially when working with lazy sequences and macros • Low convergence within the community on de facto libraries/frameworks for common tasks

Conclusions • We chose to trade off the scalability of
a distributed computation platform, for the simplicity and predictability of a single-machine setup. • A lot of Clojure’s features ﬁt very well with our problem space. • We’d imagine this would be the case for any ETL- like use case on data-sets that aren’t large enough to warrant Big-Data solutions.

– Joseph ‘Proposition Joe’ Stewart, The Wire “Keep it boring,
String. You keep it dead f***king boring.”

Thank you!

Composable Data Pipelines for not-so-Big Data

Composable Data Pipelines for not-so-Big Data

Akaash Patnaik

Other Decks in Programming

Featured

Transcript