Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Growing a biomedical search engine in Clojure

Growing a biomedical search engine in Clojure

Presentation at Dutch Clojure Days 2019.

Starting a new project is always hard. The blank pages stare at you. Sometimes you know what needs to be done, but sometimes the requirements are totally unknown. Three years ago Doctor Evidence decided to build a biomedical search engine, and this talk will be about our journey into the unknown.

Joël Kuiper

April 06, 2019
Tweet

Other Decks in Programming

Transcript

  1. Evidence Based Medicine “Evidence-based medicine (EBM) is an approach to

    medical practice intended to optimize decision-making by emphasizing the use of evidence from well-designed and well-conducted research.” —wikipedia
  2. It is estimated that in 2015 a hundred publications of

    randomized controlled trials were published every day. Over one million new articles are published each year across tens of thousands of journals and over 25 000 journals globally.
  3. The process 1. Search known literature 2. Screen the literature

    for relevancy 3. Extract data from PDFs 4. Synthesize data using advanced statistics 5. Report the findings 6. (repeat)
  4. "It’s like trying to build a ten piece puzzle, except

    you have to find the ten pieces within a pile of 10,000,000,000 pieces, all of which are square, and you don’t know what it’s supposed to look like at the end. Good luck.” — Stephen Mann
  5. The Idea • Annotate public knowledge with medical ontologies (“knowledge

    graphs”) • Include massive databases of public knowledge • Harmonize different data sources (“data warehousing”) • Be able iteratively expand or contract search parameters using these annotations • Provide high level analytics about retrieved literature
  6. • Part-of-Speech tagging & Dependency Parsing • Deal with synonyms,

    acronyms/abbreviations, misspellings, etc • Entity linking to concept URIs using RDF & SPARQL 
 to provide a conceptual index rather than a text string index • Provide Population, Outcome & Intervention disambiguation
 using a BiLSTM-CRF neural network • Predict document characteristics (e.g. is it an RCT)
  7. Apply this trick across literature Our database currently contains ~30.9

    million items and 1.3 million concepts (2.8 million terms)
  8. Our stack • Front-end • Clojurescript • Re-frame / Reagent

    • LESS CSS • Boot builds • Back-end • Clojure on JVM • PostgreSQL • ElasticSearch • ZomboDB • RabbitMQ • ML • Python • Keras / Tensorflow • Scikit-Learn
  9. At that point • Proven technology using Clojure/ClojureScript/Python • We

    have users & clients that are excited • Still major academic & technological challenges ahead 
 (Word Sense Disambiguation, more reliable document class prediction, better entity linking…) • … feature requests
  10. • Name synonymity / homogeneity (same name different person? different

    name same person?) • How to calculate “influence” (PageRank of co-author graph) • Drawing sensible networks (Dijkstra’s / spanning trees) • Fuzzy matching of affiliation names? • etc, etc,
  11. Hofstadter's Law: It always takes longer than you expect, even

    when you take into account Hofstadter's Law.
  12. growth | ɡrəʊθ | noun 1 [mass noun] the process

    of increasing in size: the upward growth of plants | the growth of the city affects the local climate. • the process of developing physically, mentally, or spiritually: keeping a journal can be a vital step in our personal growth. • the process of increasing in amount, value, or importance: the rates of population growth are lowest in the north. • the increase in number and spread of small or microscopic organisms: some additives slow down the growth of micro-organisms. • increase in economic activity or value: the government aims to get growth back into the economy. 2 something that has grown or is growing: a day's growth of unshaven stubble on his chin. • Medicine & Biology a tumour or other abnormal formation: the method enables doctors to distinguish between malignant and benign growths.
  13. • Dynamic language with immutable data structures • REPL driven

    development • Sensible dependency system • It’s relatively slow moving • Draws from decades of Java/JVM and Lisp history
  14. In a next life I’ll become a civil engineer so

    I can work on something more concrete
  15. In a next life I’ll become a civil engineer so

    I can work on something more concrete
  16. How to move • Don’t bother with priorities, milestones, planning

    poker, time keeping, “productivity metrics”, etc • Accept that your backlog is infinite • Take breaks • The only thing to “manage” is vision and focus
  17. Don’t waste time on • logging or metrics for UI/UX

    decisions • No amount of feature tickets or tweaks will save you from a bad idea • unit tests • In a fail fast system the e2e tests should catch them • super complex cloud deployments • bare metal and bash will do • specs anywhere except system boundaries • they’re more annoying than useful
  18. Do spend time on • Creating a mental picture, story,

    or vision • Communicating & listening • Learning the domain more broadly and deeply • Developing a feel for quality (see Pirsig’s Zen and the Art of Motorcycle Maintenance)
  19. “I need my sleep. I need about eight hours a

    day, and about ten at night.” – Bill Hicks