Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Growing a biomedical search engine in Clojure

Growing a biomedical search engine in Clojure

Presentation at Dutch Clojure Days 2019.

Starting a new project is always hard. The blank pages stare at you. Sometimes you know what needs to be done, but sometimes the requirements are totally unknown. Three years ago Doctor Evidence decided to build a biomedical search engine, and this talk will be about our journey into the unknown.

E8f326b8c2aefbacb5d22412165ed21f?s=128

Joël Kuiper

April 06, 2019
Tweet

Other Decks in Programming

Transcript

  1. None
  2. None
  3. None
  4. vortext.systems EVIDENCE DOCTOR ® Transforming data into insights Joël Kuiper

  5. vortext.systems EVIDENCE DOCTOR ® Transforming data into insights Joël Kuiper

  6. Growing a biomedical search engine in Clojure

  7. 3 parts Setting the stage Growth Lessons learned

  8. 1. The stage

  9. Evidence Based Medicine “Evidence-based medicine (EBM) is an approach to

    medical practice intended to optimize decision-making by emphasizing the use of evidence from well-designed and well-conducted research.” —wikipedia
  10. None
  11. Randomized Controlled Trials

  12. None
  13. It is estimated that in 2015 a hundred publications of

    randomized controlled trials were published every day. Over one million new articles are published each year across tens of thousands of journals and over 25 000 journals globally.
  14. Evidence Synthesis

  15. None
  16. 15 613 46

  17. None
  18. The process 1. Search known literature 2. Screen the literature

    for relevancy 3. Extract data from PDFs 4. Synthesize data using advanced statistics 5. Report the findings 6. (repeat)
  19. Systematic Reviews take years

  20. Can we do better with “AI”?

  21. None
  22. Software always exists at the intersection of technology, people &

    processes
  23. Learn the domain! & be extremely respectful

  24. 2. Growth

  25. "It’s like trying to build a ten piece puzzle, except

    you have to find the ten pieces within a pile of 10,000,000,000 pieces, all of which are square, and you don’t know what it’s supposed to look like at the end. Good luck.” — Stephen Mann
  26. The Idea • Annotate public knowledge with medical ontologies (“knowledge

    graphs”) • Include massive databases of public knowledge • Harmonize different data sources (“data warehousing”) • Be able iteratively expand or contract search parameters using these annotations • Provide high level analytics about retrieved literature
  27. • Part-of-Speech tagging & Dependency Parsing • Deal with synonyms,

    acronyms/abbreviations, misspellings, etc • Entity linking to concept URIs using RDF & SPARQL 
 to provide a conceptual index rather than a text string index • Provide Population, Outcome & Intervention disambiguation
 using a BiLSTM-CRF neural network • Predict document characteristics (e.g. is it an RCT)
  28. None
  29. None
  30. Apply this trick across literature Our database currently contains ~30.9

    million items and 1.3 million concepts (2.8 million terms)
  31. None
  32. None
  33. None
  34. Our stack • Front-end • Clojurescript • Re-frame / Reagent

    • LESS CSS • Boot builds • Back-end • Clojure on JVM • PostgreSQL • ElasticSearch • ZomboDB • RabbitMQ • ML • Python • Keras / Tensorflow • Scikit-Learn
  35. At that point • Proven technology using Clojure/ClojureScript/Python • We

    have users & clients that are excited • Still major academic & technological challenges ahead 
 (Word Sense Disambiguation, more reliable document class prediction, better entity linking…) • … feature requests
  36. For example: Influential Authors

  37. • Name synonymity / homogeneity (same name different person? different

    name same person?) • How to calculate “influence” (PageRank of co-author graph) • Drawing sensible networks (Dijkstra’s / spanning trees) • Fuzzy matching of affiliation names? • etc, etc,
  38. Hofstadter's Law: It always takes longer than you expect, even

    when you take into account Hofstadter's Law.
  39. “A competitor has X, can we also do X?”

  40. “oh could we also use the blockchain?”

  41. None
  42. I believe the majority of projects struggle (or fail) at

    this point.
  43. What is growth?

  44. growth | ɡrəʊθ | noun 1 [mass noun] the process

    of increasing in size: the upward growth of plants | the growth of the city affects the local climate. • the process of developing physically, mentally, or spiritually: keeping a journal can be a vital step in our personal growth. • the process of increasing in amount, value, or importance: the rates of population growth are lowest in the north. • the increase in number and spread of small or microscopic organisms: some additives slow down the growth of micro-organisms. • increase in economic activity or value: the government aims to get growth back into the economy. 2 something that has grown or is growing: a day's growth of unshaven stubble on his chin. • Medicine & Biology a tumour or other abnormal formation: the method enables doctors to distinguish between malignant and benign growths.
  45. Most projects grow like crystals

  46. & like crystals projects solidify and become brittle when growing

  47. Maybe growth is the wrong concept

  48. Bouba/kiki projects

  49. Clojure helps you to shape projects, not grow them

  50. Why?

  51. • Dynamic language with immutable data structures • REPL driven

    development • Sensible dependency system • It’s relatively slow moving • Draws from decades of Java/JVM and Lisp history
  52. Code is heavy don’t drag it along

  53. bouba or kiki?

  54. 3. Lessons learned

  55. In a next life I’ll become a civil engineer so

    I can work on something more concrete
  56. In a next life I’ll become a civil engineer so

    I can work on something more concrete
  57. None
  58. How to move • Don’t bother with priorities, milestones, planning

    poker, time keeping, “productivity metrics”, etc • Accept that your backlog is infinite • Take breaks • The only thing to “manage” is vision and focus
  59. Don’t waste time on • logging or metrics for UI/UX

    decisions • No amount of feature tickets or tweaks will save you from a bad idea • unit tests • In a fail fast system the e2e tests should catch them • super complex cloud deployments • bare metal and bash will do • specs anywhere except system boundaries • they’re more annoying than useful
  60. Do spend time on • Creating a mental picture, story,

    or vision • Communicating & listening • Learning the domain more broadly and deeply • Developing a feel for quality (see Pirsig’s Zen and the Art of Motorcycle Maintenance)
  61. “I need my sleep. I need about eight hours a

    day, and about ten at night.” – Bill Hicks
  62. Learn how to say “no”

  63. Learn how to say “yes”

  64. Software is a cultural construct

  65. A strange juxtaposition … maybe it’s always about insight and

    narratives
  66. The vision

  67. Clojure is awesome!