Creating an Open Source Genealogical Search Engine with Apache Solr

Creating an Open Source Genealogical Search Engine With Apache Solr
Brooke Schreier Ganz [email protected] Twitter: @LeafSeek www.LeafSeek.com

Hi, I’m Brooke •  I make web stuff for fun,
and (sometimes) for profit •  Web Developer at IBM.com and Disney Consumer Products •  Lead Programmer at TMZ.com (yikes, sorry about that) •  Senior Web Producer at Bravo cable TV network and its spin-off websites •  Big dork •  Big genealogy dork •  #BigData dork

Meet Gesher Galicia •  Non-profit 501(c)3 genealogy society •  Founded
in 1993 •  Hundreds of members, worldwide •  E-mail discussion group •  New website development in progress (existing website is fugly) •  Needs a search engine…for data

The Old Problem

The New Problem

The New Problem •  Diverse Data Languages (German, Polish, Ukrainian,
Russian, Yiddish, Hebrew, English…) •  Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)

Diverse Data Shapes

Existing solutions •  They’re okay...for small numbers of databases, with
small amounts of data – Steve Morse's One-Step Tool Creator – Roll-your-own solution with PHP and MySQL •  Both get more difficult to manage as data sets increase in number and complexity

In space, no one can hear your data scream

To Sum Up •  There are lots of ways to
publish your tree •  …but not so many ways to publish your data •  Surely there must be a way to deal with this?

So I Made A Thing But “That Thing I Made
With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek

This is the part where I show you all the
shiny new All Galicia Database http://search.geshergalicia.org/

Meet Apache Solr •  Highly functional open source search platform
•  Based on Apache Lucene (Java)… •  …plus a web wrapper/API •  Not the prettiest or simplest tool •  FREE and open source

Saves Time, and Heartache

Saves Time, and Stomachache

File Structure: Back-End

Welcome to /conf

The Important Stuff

solrconfig.xml

solrconfig.xml Make sure this part is configured, so you can
import data:

How to get your data into Solr •  Step 1:
Make a properly-formatted spreadsheet •  Step 2: Save spreadsheet as a .CSV file •  Step 3: Create a MySQL database + table •  Step 4: Import CSV into that new table •  Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT) •  Step 6: Add this table’s information to db-data-config.xml

db-data-config.xml •  Basic XML file that tells Solr how to
grab data from your MySQL database(s) •  Add new <dataSource> for new databases •  Add new <entity> for new tables within the databases •  You need to make sure your MySQL connector .jar is installed for this to work

Import!

schema.xml •  FieldTypes, Fields, and CopyFields •  FieldTypes give indexing
and querying instructions to “buckets” •  Fields say what’s what and whether to make something facetable or not •  CopyFields collect Fields together into extra FieldTypes

schema.xml - FieldTypes •  5 Custom FieldTypes (so far): – givenname
– surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we’re treating as text right now)

schema.xml - FieldTypes

schema.xml - Fields

schema.xml - Fields •  Uppercase fields come from the name
of the MySQL column name •  Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName

schema.xml - Fields •  Lowercase fields were added once the
data is getting inputted to Solr, and start with the prefix record_ •  Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)

schema.xml - Fields •  You do not have to explicitly
define every Field. •  If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it. •  Which is fine. •  But IMHO it’s better to define everything anyway so you can remember what’s what and what you are doing to it.

schema.xml - CopyFields

Add-ons and nice-to-have’s (for the back-end) •  Wildcards, and lots
of ‘em •  Non-name words handled through stopwords.txt •  Nicknames and name synonyms handled through synonyms.txt •  Two files included: –  synonyms_-_american-anglo-saxon.txt –  synonyms_-_polish-ukrainian-jewish.txt •  Should be based on your data and your historical/ethnic community standards

More add-ons and nice-to-have’s (for the back-end) •  Translate your
site into different languages – multi-lingual content deserves a real multi- lingual website –  Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want •  Built-in performance monitoring hooks for New Relic •  Soundalike searches for surname variants –  Levenstein distance –  “Regular” Soundex, Metaphone, Caverphone, etc.

This is the part where I tell the story about
THE SAGA of Beider-Morse Phonetic Matching (BMPM)

Relevancy •  Right now, we’re using exact matches •  (Of
course, “exact” includes wildcards, alternate names / synonyms, etc.) •  Like “Old Search” on Ancestry.com •  DisMax! Boosting fields! Scoring! •  (…but not yet) •  Problems with records with multiple people’s names in the record

Lots of Front-End Options •  Ruby: Sunspot, RSolr, Tanning Bed,
acts-as-solr •  Django/Python: Haystack, Sunburnt, solrpy, pysolr •  Older PHP options: PECL, solr-php-client •  Plugins for blog/CMS systems: Drupal, WordPress

Meet Solarium •  http://www.solarium-project.org/ •  New, open source PHP wrapper
for Solr •  Very active development •  Version 2.4 coming soon

File Structure: Front-End

Meet Solarium: The Config

Meet Solarium: The Guts

Meet Solarium: The Guts •  You choose the parts of
your data to facet •  Data is submitted to the front-end by POST, not by GET, so the URL never changes •  You can (and should) paginate results listings •  You can't actually see the Solr server's URL from the front-end, not even in view- source

Add-ons and nice-to-have’s (for the front-end) •  A welcome screen
with information about the database's contents •  Instructions (maybe twice) •  How many records in the database? •  How many datasets? •  What features are coming next? •  What datasets are coming next?

Add-ons and nice-to-have’s (for the front-end) •  Make good UI
choices •  Pop-Up Google Maps •  Tooltips to reduce UI clutter •  Cross-browser compatibility •  Still stuck with IE 7 and 8 •  CSS and code that degrades gracefully •  No small text

Bird’s Eye View of Your Data •  What (surnames, towns,
etc.) do I have in my data? •  What are the TOP (surnames, towns, etc.) in my data? •  Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname •  Make charts and graphs from your data

The (Back-End) Future! (Maybe.) •  Date ranges, instead of just
years •  Auto-complete as you type •  “Did you mean...?” (based on data frequency) •  “More Like This” (would have to do scoring) •  Record bookmarking system (hashes?)

The (Front-End) Future! (Maybe.) •  Hierarchical facets for locations • 
Disambiguating locations •  Social sharing of individual records •  New genealogy data schema http://historical-data.org/ •  Membership login system

Please Do Not Build That Wall •  Password protect some
of the databases •  Password protect some of the data •  Open data, but pay for record or surname bookmarking system •  Open data, but pay for API access •  Open data, but sell online ads •  Open data, but give people guilt trips

Presenting LeafSeek! •  Free and Open Source •  Code is
all on GitHub •  Please add, edit, fix, change, tinker •  …and use it!

Why is this FREE? And why is this important?

Thank you! :-)

Creating an Open Source Genealogical Search Eng...

Creating an Open Source Genealogical Search Engine with Apache Solr

More Decks by Brooke Schreier Ganz

Other Decks in Technology

Featured

Transcript