Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating an Open Source Genealogical Search Engine with Apache Solr

Creating an Open Source Genealogical Search Engine with Apache Solr

Set Your Records Free!

LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before.

With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike.

Oh, and one more thing: LeafSeek is free and open source. No strings attached.

Brooke Schreier Ganz

February 06, 2012
Tweet

More Decks by Brooke Schreier Ganz

Other Decks in Technology

Transcript

  1. Creating an Open Source Genealogical Search Engine With Apache Solr

    Brooke Schreier Ganz [email protected] Twitter: @LeafSeek www.LeafSeek.com
  2. Hi, I’m Brooke •  I make web stuff for fun,

    and (sometimes) for profit •  Web Developer at IBM.com and Disney Consumer Products •  Lead Programmer at TMZ.com (yikes, sorry about that) •  Senior Web Producer at Bravo cable TV network and its spin-off websites •  Big dork •  Big genealogy dork •  #BigData dork
  3. Meet Gesher Galicia •  Non-profit 501(c)3 genealogy society •  Founded

    in 1993 •  Hundreds of members, worldwide •  E-mail discussion group •  New website development in progress (existing website is fugly) •  Needs a search engine…for data
  4. The New Problem •  Diverse Data Languages (German, Polish, Ukrainian,

    Russian, Yiddish, Hebrew, English…) •  Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  5. Existing solutions •  They’re okay...for small numbers of databases, with

    small amounts of data – Steve Morse's One-Step Tool Creator – Roll-your-own solution with PHP and MySQL •  Both get more difficult to manage as data sets increase in number and complexity
  6. To Sum Up •  There are lots of ways to

    publish your tree •  …but not so many ways to publish your data •  Surely there must be a way to deal with this?
  7. So I Made A Thing But “That Thing I Made

    With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  8. This is the part where I show you all the

    shiny new All Galicia Database http://search.geshergalicia.org/
  9. Meet Apache Solr •  Highly functional open source search platform

    •  Based on Apache Lucene (Java)… •  …plus a web wrapper/API •  Not the prettiest or simplest tool •  FREE and open source
  10. How to get your data into Solr •  Step 1:

    Make a properly-formatted spreadsheet •  Step 2: Save spreadsheet as a .CSV file •  Step 3: Create a MySQL database + table •  Step 4: Import CSV into that new table •  Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT) •  Step 6: Add this table’s information to db-data-config.xml
  11. db-data-config.xml •  Basic XML file that tells Solr how to

    grab data from your MySQL database(s) •  Add new <dataSource> for new databases •  Add new <entity> for new tables within the databases •  You need to make sure your MySQL connector .jar is installed for this to work
  12. schema.xml •  FieldTypes, Fields, and CopyFields •  FieldTypes give indexing

    and querying instructions to “buckets” •  Fields say what’s what and whether to make something facetable or not •  CopyFields collect Fields together into extra FieldTypes
  13. schema.xml - FieldTypes •  5 Custom FieldTypes (so far): – givenname

    – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we’re treating as text right now)
  14. schema.xml - Fields •  Uppercase fields come from the name

    of the MySQL column name •  Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  15. schema.xml - Fields •  Lowercase fields were added once the

    data is getting inputted to Solr, and start with the prefix record_ •  Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  16. schema.xml - Fields •  You do not have to explicitly

    define every Field. •  If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it. •  Which is fine. •  But IMHO it’s better to define everything anyway so you can remember what’s what and what you are doing to it.
  17. Add-ons and nice-to-have’s (for the back-end) •  Wildcards, and lots

    of ‘em •  Non-name words handled through stopwords.txt •  Nicknames and name synonyms handled through synonyms.txt •  Two files included: –  synonyms_-_american-anglo-saxon.txt –  synonyms_-_polish-ukrainian-jewish.txt •  Should be based on your data and your historical/ethnic community standards
  18. More add-ons and nice-to-have’s (for the back-end) •  Translate your

    site into different languages – multi-lingual content deserves a real multi- lingual website –  Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want •  Built-in performance monitoring hooks for New Relic •  Soundalike searches for surname variants –  Levenstein distance –  “Regular” Soundex, Metaphone, Caverphone, etc.
  19. This is the part where I tell the story about

    THE SAGA of Beider-Morse Phonetic Matching (BMPM)
  20. Relevancy •  Right now, we’re using exact matches •  (Of

    course, “exact” includes wildcards, alternate names / synonyms, etc.) •  Like “Old Search” on Ancestry.com •  DisMax! Boosting fields! Scoring! •  (…but not yet) •  Problems with records with multiple people’s names in the record
  21. Lots of Front-End Options •  Ruby: Sunspot, RSolr, Tanning Bed,

    acts-as-solr •  Django/Python: Haystack, Sunburnt, solrpy, pysolr •  Older PHP options: PECL, solr-php-client •  Plugins for blog/CMS systems: Drupal, WordPress
  22. Meet Solarium •  http://www.solarium-project.org/ •  New, open source PHP wrapper

    for Solr •  Very active development •  Version 2.4 coming soon
  23. Meet Solarium: The Guts •  You choose the parts of

    your data to facet •  Data is submitted to the front-end by POST, not by GET, so the URL never changes •  You can (and should) paginate results listings •  You can't actually see the Solr server's URL from the front-end, not even in view- source
  24. Add-ons and nice-to-have’s (for the front-end) •  A welcome screen

    with information about the database's contents •  Instructions (maybe twice) •  How many records in the database? •  How many datasets? •  What features are coming next? •  What datasets are coming next?
  25. Add-ons and nice-to-have’s (for the front-end) •  Make good UI

    choices •  Pop-Up Google Maps •  Tooltips to reduce UI clutter •  Cross-browser compatibility •  Still stuck with IE 7 and 8 •  CSS and code that degrades gracefully •  No small text
  26. Bird’s Eye View of Your Data •  What (surnames, towns,

    etc.) do I have in my data? •  What are the TOP (surnames, towns, etc.) in my data? •  Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname •  Make charts and graphs from your data
  27. The (Back-End) Future! (Maybe.) •  Date ranges, instead of just

    years •  Auto-complete as you type •  “Did you mean...?” (based on data frequency) •  “More Like This” (would have to do scoring) •  Record bookmarking system (hashes?)
  28. The (Front-End) Future! (Maybe.) •  Hierarchical facets for locations • 

    Disambiguating locations •  Social sharing of individual records •  New genealogy data schema http://historical-data.org/ •  Membership login system
  29. Please Do Not Build That Wall •  Password protect some

    of the databases •  Password protect some of the data •  Open data, but pay for record or surname bookmarking system •  Open data, but pay for API access •  Open data, but sell online ads •  Open data, but give people guilt trips
  30. Presenting LeafSeek! •  Free and Open Source •  Code is

    all on GitHub •  Please add, edit, fix, change, tinker •  …and use it!