beginning of the 2000 All things around search and data processing Free and Open Source contributor and enthusiast Wannabe speaker #Dadops practitioner Can talk faster than you think! Long distance runner TV series aficionado
common roles has extended to Data Engineer, Data Scientist, BI and Data Analyst Roles tend to be blurry, with not exact boundaries between them. Asking How? Or Why? You are one or the other With the advent of “Big Data” the role of data engineers is used in the industry to define the specialized Software Engineers that take care of the infrastructure required to fulfill the requirements of data intensive applications. Data Engineers usually deal with things such as databases of several kinds (relational and NoSQL), Hadoop, Spark, Flink, Elasticsearch, Redis, Kafka, Java, Scala and others
often than not dealing with (big) data— from acquisition over cleaning, conversion, disambiguation, de-duplication—and also developing & deploying solutions. The role of a Data Scientist, without doubt requires a mathematical base education. https://medium.com/@mhausenblas/got-a-data-in-your-job-title-9b4c8919973b While the Data Engineer need a solid understanding of several topics such as distributed systems, databases, operating systems and software engineering without forgetting the human interactions.
background with touches of systems Strong background in mathematics Databases, NoSQL, Big Data, Batch Processing, Streams, Distributed Systems, Scalability, HA,... Statistics, Models, Machine Learning, Predictions, Inferences, ... Data Ingestion, Cleansing, ETL Warehousing, Modeling,... Data analysis, Machine Learning, Research, Business Basic understanding of Scientist tasks is a desired skill Communication with Engineers is crucial Architecture and Infrastructure Data analysis
Everyone is living a happy life • The raise of multiple applications • The service oriented architecture • Introduction of an Analytics Team (BI / Data Science / …) • Data need to be prepared and integrated for analytic • Integration with external data providers • Data import, normalization, validation at scale • [Semantic] Data Integration • Company merges • Uniformization of similar or heterogeneous datasets
to have this set of data centric roles Infrastructure specialist (database, brokers, virtualization and cloud) The data technicians Data Processing Data Modeling and Administration Data Discovery Data Architecture The data quality control How do you organize this specialities? There is usually not a perfect solution for everybody however….
Specific teams are created adoc, highly specialized. Processing Specific team, with broader scope embedded in specific product teams. Modeling and Administration Part of specific product teams. Discovery Specific team, highly specialized. Architecture Part of the global architecture team as domain experts. Quality Control Global chapter part of product team.
without any sign to decay. • SQL is not going away • Mass adoption of NoSQL solutions. • Schema on write vs Schema on read • Data Integration and Standardization • Atomicity, Consistency, Isolation and Durability (The ACID properties) • The CAP theorem • Ever increasing need to process data • Batch processing, Stream Processing, Hybrid approaches • Interaction with external systems (SOAP, REST, JSON, XML,...) • Machine learning is everywhere
generic) Have a common format, at least (for example: avro) Schema information is propagated continuously Apply data governance principles by introducing explicit data dependencies Data is managed by their most experienced entity Data is either push or stream into the next step in the flow Model flow orchestration explicitly Facilitate error handling with immutability Setup with versatility in mind
you recruite and build a good team: Build an enviable team, smart engineers wanna work with people they can learn from. Remove the walls. Data engineers and Data Scientist shuold work in close collaboration. Have an environment where creatibity is free to flow.
of engineers: • BI, Warehouse, Data lake, Ingestion, strong focus on data processing ananalytics. • Tools: Specialized on a certain tool (Spark, Hadoop, …) • Architecture: End to end thinker, going from data processing to how teams use it. • Infrastructure: Focused on setting up databases and tools.
[https:// www.stitchdata.com/resources/reports/the-state-of-data-engineering/] [2] Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald. Magic Quadrant for Operational Database Management Systems. Gartner. October 5, 2016. [3] Thomas W. Oestreich, Andrew White. Must-Have Roles for Data and Analytics, 2017. Garter 2017 [4] Jay Kreps, Questioning the Lambda Architecture. O’Really Media, 2014. [https:// www.oreilly.com/ideas/questioning-the-lambda-architecture]