Is Agile Data Science just two buzzwords put together? I argue that agile is a very practical and applicable methodology, that does work well in the real world for all sorts of Analytics and Data Science workflows. Here is what we've learned.
Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan * agilemanifesto.org
communication Very short feedback loop and adaptation cycle Quality focus - iterations, timeboxed estimates - no to tasks by email (with no face-to-face) - daily standups, pair analysis - verifiable, reproducible findings
matter; whiteboard next to your desk 2. Work with decision maker; share preliminary findings 3. Make a research plan; pivot early 4. Book “Findings” meeting before project start 5. MVP for Data Products 6. Do Daily Stand-ups !
notebooks: Dropbox over Git 2. Google Slides over Powerpoint Google Slides over Email with images (>2 images) 3. Google Spreadsheets over Excel (for analytics) 4. Podio over Jira (for analytics) 5. Data Transformations in DWH in SQL over Hadoop 6. Don’t copy-paste code in IPython notebooks; use functions; don’t copy-paste functions in notebooks, use modules
Analytics Predictive Analytics Data Products * inspired by Agile Data Science, Russell Jurney, O'Reilly Media 2013 Record what Happened Was it good or bad? Why did it happen? What will happen? Affect the outcome complexity value
build predictive models that you can’t act upon. Don’t analyse stuff that does not help to make a decision 2. The best way to deal with Analytics Spiral is to avoid the spiral. Practise Crack a Case and “what if” method. 3. Climb the Data Value Pyramid fast. Once climbed - optimise the Data Value Loop. 4. Limit the number of “open loops”
access and security is abstracted away. Focus on SQL, not data access formatting and publishing a .png in one line of code PyCharm has great SQL editor
you get rid of Excel • ipynb are always shared and versioned; Prefer simple cloud sharing to VCS • Streamline data access functions • Cache long-running code and queries • Develop a common library