Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering without borders

Data Engineering without borders

Presentation about all facts about data engineering done during a Goto night in Berlin.


Pere Urbón

July 04, 2017

More Decks by Pere Urbón

Other Decks in Technology


  1. About me Architecting and Building data centric applications since the

    beginning of the 2000 All things around search and data processing Free and Open Source contributor and enthusiast Wannabe speaker #Dadops practitioner Can talk faster than you think! Long distance runner TV series aficionado
  2. Working at Springer Nature @ Berlin We are looking for:

    Lead QA Java/Scala Developers IT Agile Business Analyst Frontend Developers Infrastructure & Cloud Engineer Platform Engineer
  3. Topics of today Data engineering, What is this? The data

    centric organization From the past to the current days Facts and Skills for Data Engineers Hiring and developing in house talent
  4. Data Engineering, What is this? Nothing new under the sun,

    common roles has extended to Data Engineer, Data Scientist, BI and Data Analyst Roles tend to be blurry, with not exact boundaries between them. Asking How? Or Why? You are one or the other With the advent of “Big Data” the role of data engineers is used in the industry to define the specialized Software Engineers that take care of the infrastructure required to fulfill the requirements of data intensive applications. Data Engineers usually deal with things such as databases of several kinds (relational and NoSQL), Hadoop, Spark, Flink, Elasticsearch, Redis, Kafka, Java, Scala and others
  5. Data Engineering, What is this? Data Engineers find themselves more

    often than not dealing with (big) data— from acquisition over cleaning, conversion, disambiguation, de-duplication—and also developing & deploying solutions. The role of a Data Scientist, without doubt requires a mathematical base education. https://medium.com/@mhausenblas/got-a-data-in-your-job-title-9b4c8919973b While the Data Engineer need a solid understanding of several topics such as distributed systems, databases, operating systems and software engineering without forgetting the human interactions.
  6. Data Engineering, What is this? Data Engineer Data Scientist Software

    background with touches of systems Strong background in mathematics Databases, NoSQL, Big Data, Batch Processing, Streams, Distributed Systems, Scalability, HA,... Statistics, Models, Machine Learning, Predictions, Inferences, ... Data Ingestion, Cleansing, ETL Warehousing, Modeling,... Data analysis, Machine Learning, Research, Business Basic understanding of Scientist tasks is a desired skill Communication with Engineers is crucial Architecture and Infrastructure Data analysis
  7. From Anarchism to Governance • The Single Application principle •

    Everyone is living a happy life • The raise of multiple applications • The service oriented architecture • Introduction of an Analytics Team (BI / Data Science / …) • Data need to be prepared and integrated for analytic • Integration with external data providers • Data import, normalization, validation at scale • [Semantic] Data Integration • Company merges • Uniformization of similar or heterogeneous datasets
  8. Squads, Chapters, Tribes, …. for Data This ideas lead you

    to have this set of data centric roles Infrastructure specialist (database, brokers, virtualization and cloud) The data technicians Data Processing Data Modeling and Administration Data Discovery Data Architecture The data quality control How do you organize this specialities? There is usually not a perfect solution for everybody however….
  9. Squads, Chapters, Tribes, …. for Data Role Organization Infrastructure specialists

    Specific teams are created adoc, highly specialized. Processing Specific team, with broader scope embedded in specific product teams. Modeling and Administration Part of specific product teams. Discovery Specific team, highly specialized. Architecture Part of the global architecture team as domain experts. Quality Control Global chapter part of product team.
  10. Have all responsibilities extended? • Relational databases are strong and

    without any sign to decay. • SQL is not going away • Mass adoption of NoSQL solutions. • Schema on write vs Schema on read • Data Integration and Standardization • Atomicity, Consistency, Isolation and Durability (The ACID properties) • The CAP theorem • Ever increasing need to process data • Batch processing, Stream Processing, Hybrid approaches • Interaction with external systems (SOAP, REST, JSON, XML,...) • Machine learning is everywhere
  11. Have all responsibilities extended? • From Data warehouse to Data

    Lake • schema on write vs schema on read • Relational and bare metal vs NoSQL and cloud • ….
  12. Architecture patterns and tips Dedicated data ingestion platform (dedicated but

    generic) Have a common format, at least (for example: avro) Schema information is propagated continuously Apply data governance principles by introducing explicit data dependencies Data is managed by their most experienced entity Data is either push or stream into the next step in the flow Model flow orchestration explicitly Facilitate error handling with immutability Setup with versatility in mind
  13. Hiring and Retaining Data Engineers Some factors that will help

    you recruite and build a good team: Build an enviable team, smart engineers wanna work with people they can learn from. Remove the walls. Data engineers and Data Scientist shuold work in close collaboration. Have an environment where creatibity is free to flow.
  14. Hiring and Retaining Data Engineers Focus on the right kind

    of engineers: • BI, Warehouse, Data lake, Ingestion, strong focus on data processing ananalytics. • Tools: Specialized on a certain tool (Spark, Hadoop, …) • Architecture: End to end thinker, going from data processing to how teams use it. • Infrastructure: Focused on setting up databases and tools.
  15. References [1] Asim Jalis. The State of Data Engineering. stitchdata.com.

    [https:// www.stitchdata.com/resources/reports/the-state-of-data-engineering/] [2] Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald. Magic Quadrant for Operational Database Management Systems. Gartner. October 5, 2016. [3] Thomas W. Oestreich, Andrew White. Must-Have Roles for Data and Analytics, 2017. Garter 2017 [4] Jay Kreps, Questioning the Lambda Architecture. O’Really Media, 2014. [https:// www.oreilly.com/ideas/questioning-the-lambda-architecture]