Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB DC 2012: How IKANOW Uses MongoDB to Hel...

mongodb
June 26, 2012
320

MongoDB DC 2012: How IKANOW Uses MongoDB to Help Organizations Solve Really Big Problems

Craig Vitter, IKANOW
The world is awash in raw data like never before. Important information is buried inside of untold petabytes of structured (databases, automate sensor feeds, XML documents, etc.) and unstructured (emails, web pages, tweets, PDFs, PowerPoint presentations, etc.) data sources. Organizations have had limited tools with which to extract value from these potential treasure troves, up until recently. Within the past few years the open source community has developed a set of tools that, when combined, can help organizations fuse massive volumes of structured and unstructured data together into actionable knowledge. In this presentation learn how IKANOW built their open source document discovery and analysis platform Infinit.e using the best open source tools available and why MongoDB is at the heart of that platform.

mongodb

June 26, 2012
Tweet

Transcript

  1. to help organizations solve really big problems! The Open Source

    document analysis platform! Or, how IKANOW uses!
  2. Agenda! •  What is Document Analysis?! •  The Infinit.e Solution!

    – Infinit.e’s Architecture! – Why and How we use MongoDB! •  Analyzing #MongoDC! •  Questions!
  3. This is what Big Data Looks Like! Shamelessly  stolen  from:

        h$p://techbuddha.wordpress.com/2011/09/04/big-­‐data-­‐are-­‐you-­‐crea=ng-­‐a-­‐garbage-­‐dump-­‐or-­‐mountains-­‐of-­‐gold/  
  4. What is Document Analysis?! "Document Analysis refers to
 computer-assisted analysis

    of large numbers of documents in order to answer questions about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html!
  5. Document Analysis! •  Common document source formats:! RSS   JSON

      XML   HTML   PDF   TXT   RTF   Word   PPT   Mul=media  Files   RDBMS  Records   ETC.  
  6. Document Analysis! •  The goal is to:! – Extract Entities (people,

    places, things)! – Create Associations between entities (in the form of noun-verb-noun), e.g.:! •  John Doe lives in Washington, D.C! •  John Doe is married to Jane Doe! •  John Doe is a Virgo! •  John Doe traveled to Mexico on July 6th, 2011! •  And…!
  7. Document Analysis! •  Turn Who, What, When and
 Where into

    a unified data structure that supports data analytics and visualization.! Who   people,  organiza5ons,     facili5es,  company   What   events,  summaries,   facts,  themes   When   past,  present,  future     dates   Where   city,  state,  country,     coordinate  
  8. •  Infinit.e is an Open Source document discovery and analysis

    platform that has 
 these very cool Open Source
 tools lurking under the hood.! The Infinit.e Solution! github.com/ikanow/Infinit.e  
  9. The Infinit.e Solution! Collec=ng   Storing   Enriching   Retrieving

      Analyzing   Visualizing   Structured  and   Unstructured  Documents   Infinit.e  is  a   scalable   framework  for  
  10. Sample RSS Document! <rss version="2.0">! <channel>! …! <item>! <title>Mediterranean conference

    seeks to flourish tourism in Egypt, Tunisia… </title>! <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to- flourish-!tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</ link>! <description>Report by egyptlastminute.com CAIRO: On Monday, the !countries of the !Mediterranean opened a conference seeking to !enhance the !future of tourism in the region. The conference focuses on the countries !of Egypt and Tunisia the most … ! </description>! <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>! <dc:creator>unknown</dc:creator>! <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>! </item>! …! </channel>! </rss>!
  11. Document Metadata! •  doc_metadata.metadata! {! "_id" : ObjectId("4f93638e0cf212156d0559d2"),! "title" :

    "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",! "url" : " http://www.pressreleasebureau.com/mediterranean-conference-seeks-to- flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute- com-13613.html"! "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...",! "created" : ISODate("2012-04-22T01:49:02Z"),! “metadata” : {…},! "associations" : […],! "entities" : […],! ...! }!
  12. Harvested Document Metadata! •  doc_metadata.metadata.metadata! "metadata" : {! !"location" :

    [! ! !{! ! ! !"region" : "South Asia",! ! ! !"citystateprovince" : {! ! ! ! !"stateprovince" : "Rolpa”, "city" : "Newang"! ! ! !},! ! ! !"country" : "Nepal"! ! !}! !],! !"icn" : [ "200573487" ],! !"incidentdate" : [ "07/25/2005" ],! !"organization" : [! ! !"Communist Party of Nepal (Maoist)/United People's Front” ! !],! !...! },! Note:  It  is  okay  to  laugh  at  this  
  13. Document Enrichment! •  Infinit.e supports the extraction of entities and

    creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:!
  14. Harvested Entities! •  feature.entity! {! "_id" : ObjectId("4f9189d48baf188282a1c9ef"),! "alias" :

    [! "Zine el Abidine Ben Ali",! "Zine El Abidine Ben Ali",! "Zine el Abidine ben Ali"! ],! "batch_resync" : true,! "communityId" : ObjectId("4f8f138103644ee8003bf518"),! "db_sync_doccount" : NumberLong(143),! "db_sync_time" : "1338751174988",! "dimension" : "Who",! "disambiguated_name" : "Zine El Abidine Ben Ali",! "doccount" : 152,! "index" : "zine el abidine ben ali/person",! "totalfreq" : 353,! "type" : "Person"! }!
  15. Harvested Associations! •  feature.association! {! !"_id" : ObjectId("4f9189d48baf188282a1ca24"),! !"assoc_type" :

    "Fact",! !"communityId" : ObjectId("4f8f138103644ee8003bf518"),! !"db_sync_doccount" : NumberLong(70),! !"db_sync_time" : "1338491609281",! !"doccount" : NumberLong(73),! !"entity1" : [! ! !"zine el abidine ben ali",! ! !"zine el abidine ben ali/person"! !],! !"entity1_index" : "zine el abidine ben ali/person",! !"entity2" : ["president”,"president/position”],! !"entity2_index" : "president/position",! !"index" : "5e3fff27ddb78d6873ccfc77cf05c52f",! !"verb" : ["career”,"current”,"past”],! !"verb_category" : "career"! }!
  16. Geolocation of Entities/Events! •  feature.geo! {! "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),! "search_field"

    : "cairo",! "country" : "Egypt",! "country_code" : "EG",! "city" : "cairo",! "region" : "Al Qahirah",! "region_code" : "EG11",! "population" : 7734602,! "latitude" : "30.05",! "longitude" : "31.25",! "geoindex" : {! "lat" : 30.05,! "lon" : 31.25! }! }! Note:  MongoDB  2d  Index  
  17. Why MongoDB? – Reason #1! Document-Oriented Storage! •  MongoDB’s document-oriented

    storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format!
  18. Why MongoDB? – Reason #2! JSON! •  The standard language

    of open document analysis! –  JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines! –  BSON (Binary JSON) is MongoDB’s native data format! –  Infinit.e ingests and exports JSON 
 natively via the REST based API
 
 Note: Infinit.e uses Google’s GSON JAVA library to convert 
 JSON to POJOs and back! This  is  the  JSON  logo  
  19. Why MongoDB? – Reason #3! MongoDB Is Web Scale*! *Shards

     are  the  secret  ingredients  in  the  web  scale  sauce.  They  just  work.  
  20. Why MongoDB? – Reason #3! Scalability! •  Seriously, MongoDB Scales!

    – Harvesting and enriching documents requires a lot of disk space! – MongoDB scales to arbitrary sizes in both read/write dimensions! – Sophisticated sharding keys provide powerful/ flexible balancing! -  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”!
  21. Why MongoDB? – Reason #4! Integration with Hadoop! •  Hadoop

    is rapidly becoming the de-facto standard for data analytics! –  Open Source, very customizable! –  Proven scalability! –  Java libraries! •  The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS! +   =  
  22. Tweeting about MongoDC! •  Source: http://search.twitter.com/search.rss?q=mongodc! – Who’s Tweeting?! – What are

    they Tweeting?! – What does basic document analysis of these Tweets tell us?!