- Building big data infrastructures is no easy task. - Leveraging data for decision making requires a mix of multiples skills : . System Engineering . Distributed computing . Statistics . Machine Learning
Solutions …. - Build Data platforms as a service. - Build robust and consistent APIs to bring big data to the masses. - Leverages fluent APIs for fast data science
Data Sources - High Throughput distributed mssaging platform - Publish Subscribe Model - Modelled as a distributed replicated log - Persists messages to disk - Categorizes messages into Topics - Allows message retention for long specified amount of time - Allows stream replay in case of failure
Text, Images, etc Feature Extraction Predictive Model New Data Prediction X = vect.fit_transform(input) clf.fit(X,y) X_new = vect.fit_transform(input) y_new= clf.predict(X_new)
- Data locality and data gravity - Support the full workflow - Verticalization of platforms - Scalability - Collaboration and interoperability - Black boxing of implementations