Slide 1

Slide 1 text

Data Engineering without borders Pere Urbon Bayes Data Wrangler pere.urbon @ { gmail.com, acm.org, springernature.com}

Slide 2

Slide 2 text

About me Architecting and Building data centric applications since the beginning of the 2000 All things around search and data processing Free and Open Source contributor and enthusiast Wannabe speaker #Dadops practitioner Can talk faster than you think! Long distance runner TV series aficionado

Slide 3

Slide 3 text

Working at Springer Nature @ Berlin We are looking for: Lead QA Java/Scala Developers IT Agile Business Analyst Frontend Developers Infrastructure & Cloud Engineer Platform Engineer

Slide 4

Slide 4 text

Topics of today Data engineering, What is this? The data centric organization From the past to the current days Facts and Skills for Data Engineers Hiring and developing in house talent

Slide 5

Slide 5 text

Data Engineering?

Slide 6

Slide 6 text

Data Engineering, What is this? Nothing new under the sun, common roles has extended to Data Engineer, Data Scientist, BI and Data Analyst Roles tend to be blurry, with not exact boundaries between them. Asking How? Or Why? You are one or the other With the advent of “Big Data” the role of data engineers is used in the industry to define the specialized Software Engineers that take care of the infrastructure required to fulfill the requirements of data intensive applications. Data Engineers usually deal with things such as databases of several kinds (relational and NoSQL), Hadoop, Spark, Flink, Elasticsearch, Redis, Kafka, Java, Scala and others

Slide 7

Slide 7 text

Data Engineering, What is this? Data Engineers find themselves more often than not dealing with (big) data— from acquisition over cleaning, conversion, disambiguation, de-duplication—and also developing & deploying solutions. The role of a Data Scientist, without doubt requires a mathematical base education. https://medium.com/@mhausenblas/got-a-data-in-your-job-title-9b4c8919973b While the Data Engineer need a solid understanding of several topics such as distributed systems, databases, operating systems and software engineering without forgetting the human interactions.

Slide 8

Slide 8 text

Data Engineering, What is this? Data Engineer Data Scientist Software background with touches of systems Strong background in mathematics Databases, NoSQL, Big Data, Batch Processing, Streams, Distributed Systems, Scalability, HA,... Statistics, Models, Machine Learning, Predictions, Inferences, ... Data Ingestion, Cleansing, ETL Warehousing, Modeling,... Data analysis, Machine Learning, Research, Business Basic understanding of Scientist tasks is a desired skill Communication with Engineers is crucial Architecture and Infrastructure Data analysis

Slide 9

Slide 9 text

The data organization

Slide 10

Slide 10 text

All started with...

Slide 11

Slide 11 text

….To be continued with...

Slide 12

Slide 12 text

The era of the monolith application

Slide 13

Slide 13 text

Everything changed in the 2000’s The advent of NoSQL and Big Data

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

The rise of data democracy

Slide 16

Slide 16 text

On the human side of things….

Slide 17

Slide 17 text

The data anarchism

Slide 18

Slide 18 text

Democracy should be governed

Slide 19

Slide 19 text

From Anarchism to Governance • The Single Application principle • Everyone is living a happy life • The raise of multiple applications • The service oriented architecture • Introduction of an Analytics Team (BI / Data Science / …) • Data need to be prepared and integrated for analytic • Integration with external data providers • Data import, normalization, validation at scale • [Semantic] Data Integration • Company merges • Uniformization of similar or heterogeneous datasets

Slide 20

Slide 20 text

A framework for Data Governance ISO 8000:150 as a framework for data governance

Slide 21

Slide 21 text

Squads, Chapters, Tribes, …. for Data This ideas lead you to have this set of data centric roles Infrastructure specialist (database, brokers, virtualization and cloud) The data technicians Data Processing Data Modeling and Administration Data Discovery Data Architecture The data quality control How do you organize this specialities? There is usually not a perfect solution for everybody however….

Slide 22

Slide 22 text

Squads, Chapters, Tribes, …. for Data Role Organization Infrastructure specialists Specific teams are created adoc, highly specialized. Processing Specific team, with broader scope embedded in specific product teams. Modeling and Administration Part of specific product teams. Discovery Specific team, highly specialized. Architecture Part of the global architecture team as domain experts. Quality Control Global chapter part of product team.

Slide 23

Slide 23 text

Have all responsibilities extended? • Relational databases are strong and without any sign to decay. • SQL is not going away • Mass adoption of NoSQL solutions. • Schema on write vs Schema on read • Data Integration and Standardization • Atomicity, Consistency, Isolation and Durability (The ACID properties) • The CAP theorem • Ever increasing need to process data • Batch processing, Stream Processing, Hybrid approaches • Interaction with external systems (SOAP, REST, JSON, XML,...) • Machine learning is everywhere

Slide 24

Slide 24 text

Have all responsibilities extended? • From Data warehouse to Data Lake • schema on write vs schema on read • Relational and bare metal vs NoSQL and cloud • ….

Slide 25

Slide 25 text

Architecture patterns and tips

Slide 26

Slide 26 text

Architecture patterns and tips

Slide 27

Slide 27 text

Architecture patterns and tips Dedicated data ingestion platform (dedicated but generic) Have a common format, at least (for example: avro) Schema information is propagated continuously Apply data governance principles by introducing explicit data dependencies Data is managed by their most experienced entity Data is either push or stream into the next step in the flow Model flow orchestration explicitly Facilitate error handling with immutability Setup with versatility in mind

Slide 28

Slide 28 text

Facts and Skills for Data Engineers

Slide 29

Slide 29 text

Number of data engineers over time

Slide 30

Slide 30 text

Prior role for data engineers

Slide 31

Slide 31 text

Worldwide location

Slide 32

Slide 32 text

Distribution by industry

Slide 33

Slide 33 text

Distribution per company

Slide 34

Slide 34 text

Top Skills

Slide 35

Slide 35 text

Skills based on company size

Slide 36

Slide 36 text

Data Engineering vs Data Scientists (Skills)

Slide 37

Slide 37 text

Hiring and Inhouse talent

Slide 38

Slide 38 text

Hiring and Retaining Data Engineers Some factors that will help you recruite and build a good team: Build an enviable team, smart engineers wanna work with people they can learn from. Remove the walls. Data engineers and Data Scientist shuold work in close collaboration. Have an environment where creatibity is free to flow.

Slide 39

Slide 39 text

Hiring and Retaining Data Engineers Focus on the right kind of engineers: • BI, Warehouse, Data lake, Ingestion, strong focus on data processing ananalytics. • Tools: Specialized on a certain tool (Spark, Hadoop, …) • Architecture: End to end thinker, going from data processing to how teams use it. • Infrastructure: Focused on setting up databases and tools.

Slide 40

Slide 40 text

References [1] Asim Jalis. The State of Data Engineering. stitchdata.com. [https:// www.stitchdata.com/resources/reports/the-state-of-data-engineering/] [2] Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald. Magic Quadrant for Operational Database Management Systems. Gartner. October 5, 2016. [3] Thomas W. Oestreich, Andrew White. Must-Have Roles for Data and Analytics, 2017. Garter 2017 [4] Jay Kreps, Questioning the Lambda Architecture. O’Really Media, 2014. [https:// www.oreilly.com/ideas/questioning-the-lambda-architecture]

Slide 41

Slide 41 text

Thanks, questions?

Slide 42

Slide 42 text

Data Engineering without borders Pere Urbon Bayes Data Wrangler pere.urbon @ { gmail.com, acm.org, springernature.com}