Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A view from the ivory tower: Participating in Apache as a member of academia

A view from the ivory tower: Participating in Apache as a member of academia

Academics in an ivory tower conjures images of people toiling away nicely insulated from many of the concerns of reality. While this has it's advantages, anyone who's tried to use a project written for a research paper under a deadline can attest that it doesn't always result in useful code. While completing my PhD, I found an Apache project that fit well with the work I was doing s I rolled up my sleeves to write some code to make it more useful for solving my own problems. I've since had the opportunity to join the project's PMC and now as a faculty member, I continue to find value in encouraging my own students to contribute to Apache projects. I'll discuss how academics and Apache projects can find mutual benefit in close collaboration.

Michael Mior

October 01, 2020

More Decks by Michael Mior

Other Decks in Education


  1. A view from the ivory tower Participating in Apache as

    a member of academia Michael Mior, Rochester Institute of Technology
  2. Overview 1. Introduction 2. My path to Apache 3. The

    Apache Way and the Academic Way 4. Apache success stories 5. How to get involved
  3. My Background 2009 Undergraduate degree in Computing Science 2011 Masters

    degree in Computer Science Joined the startup Bunch 2013 Started a PhD in Computer Science 2016 Joined Apache Calcite 2018 Graduated the PhD program Joined the Rochester Institute of Technology
  4. My Research • NoSQL database schema design and integration •

    Connecting data sources of diverse formats • Distributed data processing • Open data analysis • Data-driven schema discovery
  5. How I found Apache • Searched for existing work on

    heterogeneous data processing • Found Apache Calcite, a data processing framework • Contributed an adapter for Apache Cassandra • Also contributed to Apache Spark
  6. Apache Calcite • A dynamic data management framework • Basis

    for query optimization in Hive and Drill • Used by Alibaba, Uber, Tencent, … • Connects to MongoDB, Cassandra, Spark, …
  7. Apache Calcite History 2014 Optiq enters incubation Optiq is renamed

    Calcite 2015 Calcite 1.0.0 released Calcite graduates from incubator 2016 I join as a committer 2017 I join the PMC and start a term as PMC chair 2018 Apache Calcite members publish in ACM SIGMOD
  8. My contributions • Searched for existing work on heterogeneous data

    processing • Found Apache Calcite, a data processing framework • Contributed an adapter for Apache Cassandra • Materialized view query rewriting with joins • Minor improvements to Apache Spark
  9. The Apache Way • Earned Authority • Community of Peers

    • Open Communications • Consensus Decision Making • Responsible Oversight
  10. Earned Authority • Everyone can participate • Influence is based

    on publicly earned merit • The most productive labs have diversity • Anyone can have a good idea
  11. Community of Peers • Individual participation, not organizations • Roles

    and titles are equal • Collaboration is common and expected • Students are also junior colleagues
  12. Open Communications • All communication is publicly accessible • Private

    decisions are disallowed • Unfortunately, not all work is public • But publication is critical
  13. Consensus Decision Making • Projects are overseen by volunteers •

    All votes are equal regardless of position • Shared governance is key
  14. Responsible Oversight • Projects are self-governing • Commits are peer-reviewed

    • Labs are typically fairly independent • Published work is peer-reviewed
  15. Independence • The ASF is vendor-neutral • No organization has

    special privilege • True in aspiration, if not always in practice
  16. Community Over Code • Healthy community is high priority •

    Good communities write good code • Burnout is real, the principles hold true
  17. Success Stories • Apache Flink (TU Berlin, HU Berlin, HPI)

    • Apache Kylin (eBay R&D) • Apache Mesos (UC Berkeley) • Apache Pig (Yahoo Research) • Apache Spark (UC Berkeley) • Apache Stanbol (Salzburg Research)
  18. Apache Spark Successes • Apache spark: a unified engine for

    big data processing. 1,336 citations • Mllib: Machine learning in apache spark 1,456 citations • Spark sql: Relational data processing in spark 1,121 citations • Hundreds of other papers
  19. Apache Calcite Successes • Apache Calcite: A Foundational Framework for

    Optimized Query Processing Over Heterogeneous Data Sources, SIGMOD 2018. 47 citations • One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, SIGMOD 2019. 10 citations • Automated Reasoning of Database Queries Shumo Chu, PhD thesis
  20. Apache Calcite Successes Out of the 39 test cases that

    use SQL features supported by Cosette, Cosette is now able to formally prove that Calcite's rewrite in 33 of them are correct. This includes a few fairly complicated ones, like "testPushFilterPassAggThree" The good news is that we haven't found any bugs so far :) … We have also used the test cases to improve Cosette.
  21. Apache Calcite Successes RLO: a reinforcement learning-based method for join

    optimization Xinyi ZHANG et al. We implement RLO in Apache Calcite and Postgres. Extensive experiments demonstrate that Apache Calcite RLO is 10 ×–56 × faster in finding the execution plan and 80% faster in executing the plan than the state-of-the-art heuristics.
  22. Why choose Apache? (for academics) • Find people who may

    be interested in your problems • Find problems that may be interesting to your people • Discover interesting technology (maybe unpublished) • Save time by building on existing platforms
  23. How to get started (for academics) • Find a project

    suited to your interests • Look for ways to apply your expertise • Write some code!
  24. Why choose academia? (for Apache folks) • Get more exposure

    (and potentially more committers) • Find a different perspective on problems to be solved • Meet people who may want to solve your problems • Potentially discover new technologies
  25. How to get started (with a problem) • Contact researchers

    working on relevant problems • Consider GSoC and other mentorship programs • Many faculty would love to find good problems
  26. How to get started (with a solution) • Academic conferences

    are not limited to academics! • Many conferences have an “industrial” track • Perhaps find an academic partner to publish with
  27. Challenges • Writing working code is harder than writing code

    ready to commit • Grad students find it difficult to make time for good code • Many advisors don’t have time for code review • Industry folks find it difficult to find time to write papers
  28. What do I do now? • Still a (somewhat) active

    member of the Calcite PMC • I’m fortunate to still be able to write code regularly • My students regularly work on code for Apache projects