Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DATA MANAGEMENT with RUBY

DATA MANAGEMENT with RUBY

Data is one of the most valuable assets to nowadays business. However, most of them are struggling to manage it with their applications.

There are dozens of concepts and variables around data management, which can get confusing. That's why, before exploring some of those key definitions, let’s bust some myths around data management:

"A data management plan or a strategy is only for big companies." – Wrong! The sooner you become aware of the importance of having organized, manageable, and integrated data, the better positioned your company will be. Startups and small teams can increase their productivity and improve their customer interactions by managing data properly.

"Managing data is for mathematicians and scientists" – Definitely not. Though everyone is looking to hire Data Engineers and Data Scientists, the manual work involved and the real process of integrating data processes into applications often results in coding, architecting backend, and implementation of services.

“You need to learn Python to tackle data management” - Not necessarily. There are many easy-to-use Ruby-based solutions that allow you to collect, store, maintain, process, and integrate your data.

In this topic I will cover more about What is Data Management, Data Management Goals, Best Practices and Ruby Based Solutions for that.

Key Takeaways:

1️⃣ We will give a definition of what is Data Management and review three basic stages that it encompass: collection, storing, and processing
2️⃣ We will review real life examples of Enterprise Data Management Strategies and will learn how to build data processes aligned to your business objectives
3️⃣ You will learn best practices and techniques for managing data from Ruby language point of view, including libraries, frameworks and read-to-go boiler-plates.
4️⃣ Finally you will get a chance to look under the hood of working samples of Data Management concepts in a real production project.

Sergey Sergyenko

September 22, 2022
Tweet

More Decks by Sergey Sergyenko

Other Decks in Programming

Transcript

  1. Ruby News - Ruby Weekly News Digest https://ruby.news BRUG -

    Belarus Ruby User Group https://theBRUG.t.me whoami Sergey Sergyenko
  2. Data Management is not Database Management is not Data Governance

    and is not ETL Data Engineer is not Database Administrator is not Data Analyst and is not Data Scientist
  3. Data Management creation and implementation of architectures, policies, and procedures

    that manage the full data lifecycle needs of an organization and includes: • Data preparation • Data pipelines • Data extract, transform, load (ETL/ELT) • Data warehouses • Data governance • Data architecture • Data security
  4. Was responsible for: • Frameworks • Backend • Frontend •

    Infrastructure • Testing • Data Ruby Engineer in 2012
  5. • Rails Engineer (not always know how Ruby works) 🙈

    • Frontend Engineer ¯\_(ツ)_/¯ 󰥤 • DevOps (knows Ruby, but doesn’t know how to use it) 🧐 • QA Automation Engineer (dreams to become a Ruby Engineer one day) 󰠈 • Data Engineer (knows how to write parsers and scrapers, can read YAML and CSV) 🐙 • Ruby Data Engineer (the old and good Ruby Engineer, who can do all the things) 󰥠 Ruby Engineer in 2022 doesn’t want to deal with Frameworks, Frontend, Infrastructure, Testing, Data
  6. Create and manage ETL processes • Importing 30+ million records

    once a month • Normalizing data from dozens of sources Data (base) management • Preparing and correcting data as needed for use in the app • Creating and managing views (materialized) • Researching how to design and configure databases and tables General backend focused software development • Models, controllers, APIs Ruby Data Engineer 💎 in the Real World 🌎
  7. • Seeding a large database or creating test data •

    Clearing out unneeded data • Importing large amount of data - Database Migration • Data normalization w/ intermediate tables • Compliance, Security and Data Protection When I will ever care about it ?
  8. 80% of working with data is cleaning and 20% is

    modeling. Ruby is great for scripting, data transformation and manipulation. Rails is more maintainable, testable and works cross databases Do I need to learn ETL? ETL - has a lot of great solutions (Airflow, Kiba (Ruby), NiFi, SSIS, Golden Gate, n8n) , which require a dedicated specialist who needs to dive into it and write a ton code to integrate those tools with application business logic. In most of the cases ETL solution will be base on a different programming language, apart from the one that’s used in the app development. Does Ruby have enough to deal with it?
  9. • Avoid N+1 queries w/ joins or includes • Avoid

    passing results of one query to another w/ merge • Add indexes where appropriate • Sort and filter in the database where possible • Figure out how destroy_all vs delete_all work • Don’t be greedy about keeping data - delete is gold • Avoid Data Dictated Development And we do it day by day - General Tips
  10. HIPPA compliant medical application is not allowed to manage PII

    data on the premise without patients to consent it PII - Personally Identifiable Information - Levels of Data Sensitivity: • Public: This is data poses no security threat when presented in public. • Internal: This is internal data that is used within the organization • Sensitive: belongs to users in an organization and is highly confidential. • Restricted: This is data that only a few members of an organization have access to, such as highly classified business information. Use Case - Healthcare App - HIPPA Compliance
  11. • Inserting Data into the Database is a nightmare, as

    it’s not normalized and there is no chance to check uniqueness of data, so we might entering the same data twice, thrice, four times and more • In order to comply with Data Storing requirements you have to use “compliant” hosting - which is so narrow and tight - we use Health####Blocks • No high level of Business Data interpretation (accountancy and billing), no BI tools and visualization - you have to move data out and obfuscate it all the time Challenges - Healthcare App
  12. • Protection of personal data - Without data obfuscation, the

    third party or even the testing or development team who uses the data can clearly see the person’s details without any alteration. • Compliance - Medical, Finance, Governmental and etc. HIPPA, GDPR, PCI Data Obfuscation - good enough solution
  13. • Analyzing production databases for BI, Machine Learning, troubleshooting with

    respecting GDPR • Stress test / Integration test • Performance optimization for slow queries often needs a similar amount of records/cardinality with production • Simulating database migration - some schema migrations lock tables and it causes trouble during the execution (small DB will complete migration faster) • Better feature development flow - using similar data with the production database makes better development experience • Making data unreadable or unusable if data breach occurs. Benefits of anonymized production database
  14. Name: Smith Anderson Data Encryption - DB Obfuscation techniques Email:

    [email protected] Phone Number: +506 512.151.9159 Name: 4a/p8kuIYwvNpygSaN7/KdW 6d5XMG7NStayM7lU9N7dSe UkYN5f/WZzUiRuoZ+Nl Email: prEaMM0M6935JExKTYxT+6 Sh2hq3kJk+W7PPWf6/F3OJ Ecb/5jEzOKxZeTp+Un00 Phone Number: Wm/y4Xw8BysAAho/qDo1Bx MQRTScCLRy+yqF1jh4osD8l xceBnaOEWsgtQD0K+91
  15. Name: Smith Anderson Data Tokenization - DB Obfuscation techniques Email:

    [email protected] Phone Number: +506 512.151.9159 Name: 0b930571-a645-4ffe-b937-83f 82a21819f Email: d4f31129-fee4-472e-8bca-c3 38599f06b0 Phone Number: gIbR_arYFRZ-vCq6_8yzQjN1 fjSbt9DHdz4O097R76A
  16. Name: Smith Anderson Data Masking - DB Obfuscation techniques Email:

    [email protected] Phone Number: +506 512.151.9159 Name: Ahmad Mayert Email: [email protected] Phone Number: +381 (189) 297-9558
  17. Case: Excessive Data Exposure occurs when an API response returns

    more data than the client needs. if a client application needs three fields, you shouldn't return the whole object. Types of Excessive Data Exposure: - Returning Unfiltered Data - Using Auto-Incrementing Primary Keys - Returning Personally Identifiable Information in an API Response - Exposing Data to 3rd Parties Data Exposure
  18. Faker gem is a port of Perl's Data::Faker library that

    generates fake data. gem install ‘faker’
  19. Faker::Compass Faker::Computer Faker::Construction Faker::Cosmere Faker::Crypto Faker::CryptoCoin Faker::Currency Faker::Date Faker::DcComics Faker::Demographic

    Faker::Dessert Faker::Device Faker::DrivingLicence Faker::Drone Faker::Educator Faker::ElectricalComponents Faker::Emotion Faker::Esport Faker::File Faker::Fillmurray Faker::Finance Faker::Food Faker::FunnyName Faker::Gender Faker::GreekPhilosophers Faker::Hacker Faker::Hipster Faker::Hobby Faker::House Faker::IDNumber Faker::IndustrySegments Faker::Internet Faker::Invoice Faker::Job Faker::Json Faker::Kpop Faker::Lorem Faker::LoremFlickr Faker::LoremPixel Faker::Markdown Faker::Marketing Faker::Measurement Faker::Military Faker::Mountain Faker::Name Faker::Nation Faker::NatoPhoneticAlphabet Faker::NationalHealthService Faker::Address Faker::Alphanumeric Faker::Ancient Faker::App Faker::Appliance Faker::Artist Faker::Avatar Faker::Bank Faker::Barcode Faker::Beer Faker::Blood Faker::Boolean Faker::BossaNova Faker::Business Faker::Camera Faker::Cannabis Faker::ChileRut Faker::ChuckNorris Faker::Code Faker::Coffee Faker::Coin Faker::Color Faker::Commerce Faker::Company Faker::Number Faker::Omniauth Faker::PhoneNumber Faker::Placeholdit Faker::ProgrammingLanguage Faker::Relationship Faker::Restaurant Faker::Science Faker::SlackEmoji Faker::Source Faker::SouthAfrica Faker::Space Faker::String Faker::Stripe Faker::Subscription Faker::Superhero Faker::Tea Faker::Team Faker::Time Faker::Twitter Faker::Types Faker::University Faker::Vehicle Faker::Verbs Default Libraries
  20. Blockchain Faker::Blockchain::Aeternity Faker::Blockchain::Bitcoin Faker::Blockchain::Ethereum Faker::Blockchain::Tezos Books Faker::Book Faker::Books::CultureSeries Faker::Books::Dune Faker::Books::Lovecraft

    Faker::Books::TheKingkillerChronicle Creature Faker::Creature::Animal Faker::Creature::Bird Faker::Creature::Cat Faker::Creature::Dog Faker::Creature::Horse Quotes Faker::Quote Faker::Quotes::Chiquito Faker::Quotes::Rajnikanth Faker::Quotes::Shakespeare Sports Faker::Sports::Basketball Faker::Sports::Football Japanese Media Faker::JapaneseMedia::DragonBall Faker::JapaneseMedia::OnePiece Movies Faker::Movie Faker::Movies::BackToTheFuture Faker::Movies::Departed Faker::Movies::Ghostbusters Faker::Movies::HarryPotter Faker::Movies::Hobbit Music Faker::Music Faker::Music::GratefulDead Faker::Music::Hiphop Faker::Music::Opera Faker::Music::PearlJam Faker::Music::Phish Tv Shows Faker::TvShows::BigBangTheory Faker::TvShows::BojackHorseman Faker::TvShows::BreakingBad Games Faker::Game Faker::Games::ClashOfClans Faker::Games::DnD Faker::Games::Dota Faker::Games::ElderScrolls Faker::Games::Fallout Faker::Games::HalfLife Faker::Games::Heroes Faker::Games::HeroesOfTheStorm Faker::Games::LeagueOfLegends Faker::Games::Minecraft
  21. Faker.rb + Production DB = Grazer.rb Work in Progress -

    gem install ‘grazer’ Someone who takes food (or drink) from a store and eats it in the store while looking around.
  22. rails g grazer:install # adds required ‘grazer-files’ to link with

    sensitive models data in the folder db/grazer/
  23. rails db:grazer:validate # validates data consistency (on CI or manually)

    for existing Grazer configs vs current DB state rails db:grazer:update # updates grazer configs with current DB state
  24. Ruby News - Ruby Weekly News Digest https://ruby.news @sergyenko BRUG

    - Belarus Ruby User Group https://theBRUG.t.me Thank you!