Slide 1

Slide 1 text

to help organizations solve really big problems! The Open Source document analysis platform! Or, how IKANOW uses!

Slide 2

Slide 2 text

Agenda! •  What is Document Analysis?! •  The Infinit.e Solution! – Infinit.e’s Architecture! – Why and How we use MongoDB! •  Analyzing #MongoDC! •  Questions!

Slide 3

Slide 3 text

This is what Big Data Looks Like! Shamelessly  stolen  from:     h$p://techbuddha.wordpress.com/2011/09/04/big-­‐data-­‐are-­‐you-­‐crea=ng-­‐a-­‐garbage-­‐dump-­‐or-­‐mountains-­‐of-­‐gold/  

Slide 4

Slide 4 text

What is Document Analysis?! "Document Analysis refers to
 computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html!

Slide 5

Slide 5 text

Document Analysis! •  Common document source formats:! RSS   JSON   XML   HTML   PDF   TXT   RTF   Word   PPT   Mul=media  Files   RDBMS  Records   ETC.  

Slide 6

Slide 6 text

Document Analysis! •  The goal is to:! – Extract Entities (people, places, things)! – Create Associations between entities (in the form of noun-verb-noun), e.g.:! •  John Doe lives in Washington, D.C! •  John Doe is married to Jane Doe! •  John Doe is a Virgo! •  John Doe traveled to Mexico on July 6th, 2011! •  And…!

Slide 7

Slide 7 text

Document Analysis! •  Turn Who, What, When and
 Where into a unified data structure that supports data analytics and visualization.! Who   people,  organiza5ons,     facili5es,  company   What   events,  summaries,   facts,  themes   When   past,  present,  future     dates   Where   city,  state,  country,     coordinate  

Slide 8

Slide 8 text

•  Infinit.e is an Open Source document discovery and analysis platform that has 
 these very cool Open Source
 tools lurking under the hood.! The Infinit.e Solution! github.com/ikanow/Infinit.e  

Slide 9

Slide 9 text

The Infinit.e Solution! Collec=ng   Storing   Enriching   Retrieving   Analyzing   Visualizing   Structured  and   Unstructured  Documents   Infinit.e  is  a   scalable   framework  for  

Slide 10

Slide 10 text

IkanMeow!

Slide 11

Slide 11 text

Document Collection! •  Infinit.e harvests documents from:
 ! – URLs
 ! – File Shares
 ! – Databases!

Slide 12

Slide 12 text

Sample RSS Document! ! ! …! ! Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… ! http://www.pressreleasebureau.com/mediterranean-conference-seeks-to- flourish-!tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html link>! Report by egyptlastminute.com CAIRO: On Monday, the !countries of the !Mediterranean opened a conference seeking to !enhance the !future of tourism in the region. The conference focuses on the countries !of Egypt and Tunisia the most … ! ! Latest Press Releases | Press Release Bureau! unknown! Sat, 21 Apr 2012 00:00:00 GMT! ! …! ! !

Slide 13

Slide 13 text

Full Text Source!

Slide 14

Slide 14 text

Source Ingestion Data Flow!

Slide 15

Slide 15 text

Document DBs and Collections!

Slide 16

Slide 16 text

Document Metadata! •  doc_metadata.metadata! {! "_id" : ObjectId("4f93638e0cf212156d0559d2"),! "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",! "url" : " http://www.pressreleasebureau.com/mediterranean-conference-seeks-to- flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute- com-13613.html"! "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...",! "created" : ISODate("2012-04-22T01:49:02Z"),! “metadata” : {…},! "associations" : […],! "entities" : […],! ...! }!

Slide 17

Slide 17 text

Harvested Document Metadata! •  doc_metadata.metadata.metadata! "metadata" : {! !"location" : [! ! !{! ! ! !"region" : "South Asia",! ! ! !"citystateprovince" : {! ! ! ! !"stateprovince" : "Rolpa”, "city" : "Newang"! ! ! !},! ! ! !"country" : "Nepal"! ! !}! !],! !"icn" : [ "200573487" ],! !"incidentdate" : [ "07/25/2005" ],! !"organization" : [! ! !"Communist Party of Nepal (Maoist)/United People's Front” ! !],! !...! },! Note:  It  is  okay  to  laugh  at  this  

Slide 18

Slide 18 text

Document Enrichment! •  Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:!

Slide 19

Slide 19 text

Harvested Entities! •  feature.entity! {! "_id" : ObjectId("4f9189d48baf188282a1c9ef"),! "alias" : [! "Zine el Abidine Ben Ali",! "Zine El Abidine Ben Ali",! "Zine el Abidine ben Ali"! ],! "batch_resync" : true,! "communityId" : ObjectId("4f8f138103644ee8003bf518"),! "db_sync_doccount" : NumberLong(143),! "db_sync_time" : "1338751174988",! "dimension" : "Who",! "disambiguated_name" : "Zine El Abidine Ben Ali",! "doccount" : 152,! "index" : "zine el abidine ben ali/person",! "totalfreq" : 353,! "type" : "Person"! }!

Slide 20

Slide 20 text

Harvested Entities!

Slide 21

Slide 21 text

Harvested Associations! •  feature.association! {! !"_id" : ObjectId("4f9189d48baf188282a1ca24"),! !"assoc_type" : "Fact",! !"communityId" : ObjectId("4f8f138103644ee8003bf518"),! !"db_sync_doccount" : NumberLong(70),! !"db_sync_time" : "1338491609281",! !"doccount" : NumberLong(73),! !"entity1" : [! ! !"zine el abidine ben ali",! ! !"zine el abidine ben ali/person"! !],! !"entity1_index" : "zine el abidine ben ali/person",! !"entity2" : ["president”,"president/position”],! !"entity2_index" : "president/position",! !"index" : "5e3fff27ddb78d6873ccfc77cf05c52f",! !"verb" : ["career”,"current”,"past”],! !"verb_category" : "career"! }!

Slide 22

Slide 22 text

Harvested Associations!

Slide 23

Slide 23 text

Geolocation of Entities/Events! •  feature.geo! {! "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),! "search_field" : "cairo",! "country" : "Egypt",! "country_code" : "EG",! "city" : "cairo",! "region" : "Al Qahirah",! "region_code" : "EG11",! "population" : 7734602,! "latitude" : "30.05",! "longitude" : "31.25",! "geoindex" : {! "lat" : 30.05,! "lon" : 31.25! }! }! Note:  MongoDB  2d  Index  

Slide 24

Slide 24 text

Geolocation of Entities/Events!

Slide 25

Slide 25 text

Who, What, Where and When!

Slide 26

Slide 26 text

Why MongoDB? – Reason #1! Document-Oriented Storage! •  MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format!

Slide 27

Slide 27 text

Why MongoDB? – Reason #2! JSON! •  The standard language of open document analysis! –  JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines! –  BSON (Binary JSON) is MongoDB’s native data format! –  Infinit.e ingests and exports JSON 
 natively via the REST based API
 
 Note: Infinit.e uses Google’s GSON JAVA library to convert 
 JSON to POJOs and back! This  is  the  JSON  logo  

Slide 28

Slide 28 text

Why MongoDB? – Reason #3! MongoDB Is Web Scale*! *Shards  are  the  secret  ingredients  in  the  web  scale  sauce.  They  just  work.  

Slide 29

Slide 29 text

Why MongoDB? – Reason #3! Scalability! •  Seriously, MongoDB Scales! – Harvesting and enriching documents requires a lot of disk space! – MongoDB scales to arbitrary sizes in both read/write dimensions! – Sophisticated sharding keys provide powerful/ flexible balancing! -  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”!

Slide 30

Slide 30 text

Why MongoDB? – Reason #4! Integration with Hadoop! •  Hadoop is rapidly becoming the de-facto standard for data analytics! –  Open Source, very customizable! –  Proven scalability! –  Java libraries! •  The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS! +   =  

Slide 31

Slide 31 text

Tweeting about MongoDC! •  Source: http://search.twitter.com/search.rss?q=mongodc! – Who’s Tweeting?! – What are they Tweeting?! – What does basic document analysis of these Tweets tell us?!

Slide 32

Slide 32 text

Who’s Tweeting about MongoDC?!

Slide 33

Slide 33 text

How are Tweeter’s Connected?!

Slide 34

Slide 34 text

What are they Tweeting About?!

Slide 35

Slide 35 text

Sentiment?!

Slide 36

Slide 36 text

Twitter has its Limits…!

Slide 37

Slide 37 text

Thank You!! Craig Vitter! ! ! www.ikanow.com! [email protected]!