Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Sao Paulo 2012: Open Library on MongoDB

mongodb
July 13, 2012
160

MongoDB Sao Paulo 2012: Open Library on MongoDB

Luciano Ramalho, Principal, Oficinas Turing
O Open Library é um projeto do Internet Archive com o objetivo de compilar mais o completo catálogo de livros do mundo. Tem atualmente mais de 117 milhões de registros bibliográficos, incluindo um histórico de alterações que chega a 18 versões em alguns casos. Esta apresentação mostra como o conjunto de dados OL foi convertido para MongoDB e como Map/Reduce e o novo framework de agregação foram usados para analisar os registros e subsidiar uma refatoração profunda do esquema, tornando-o mais adequado às características de desempenho e funcionalidade do MongoDB.

mongodb

July 13, 2012
Tweet

Transcript

  1. Leveraging Map/Reduce and the Aggregation Framework for Data Analysis and

    Schema Design Luciano Ramalho [email protected] @ramalhoorg Open Library on MongoDB Tuesday, July 10, 12
  2. @ramalhoorg Topics • About the Open Library project • Converting

    and importing the OLC dataset • Data analysis with the Aggregation Framework • Data analysis with Map! /Reduce • Refactoring the schema design for MongoDB • Wrapping up Tuesday, July 10, 12
  3. @ramalhoorg About the Open Library • Mission: • A project

    by the Internet Archive • 117,439,126 million bibliographic records as of June, 2012 • More than 1,000,000 free e-books for download “One web page for every book” Tuesday, July 10, 12
  4. @ramalhoorg Open Library Technology • Infobase: a semi-structured database abstraction

    built on top of normalized PostgreSQL tables • a.k.a ThingDB • Record versioning support • Many joins to retrieve one conceptual entity • Heavily dependent on SOLR/Lucene for production Tuesday, July 10, 12
  5. @ramalhoorg Semi-structured data model • Theoretical basis • Literature •

    Notation • Normalizing without the First Normal Form (N1NF) Tuesday, July 10, 12
  6. @ramalhoorg Handling the dataset • 118.598.056 lines (as of June

    1st, 2012) • 91 GB uncompressed • 32 different types of records • 1.158.930 (~1%) are not bibliographic records Tuesday, July 10, 12
  7. @ramalhoorg Converting to load • The simplest conversion possible •

    Minimal changes to original schema • Choosing a primary key for the _id field Tuesday, July 10, 12
  8. @ramalhoorg Using mongoimport • Scripts used • Monitoring import performance

    • Monitoring database size increase Tuesday, July 10, 12
  9. @ramalhoorg Deep data analysis • Stonebraker’s “schema after” • Schema

    analysis • Statistics about record structure • Statistics about embedded data types Tuesday, July 10, 12
  10. @ramalhoorg Verifying the Referential Integrity of the Dataset • Schema

    additions • Indexes • Scripts Tuesday, July 10, 12
  11. @ramalhoorg Refactoring the schema • Primary key for the _id

    field • Choosing related documents to embed Tuesday, July 10, 12
  12. @ramalhoorg Representing versions • Record versioning • Statistics from the

    dataset • Alternative representations • Chosen representation Tuesday, July 10, 12
  13. @ramalhoorg Referential Integrity • Maintaining the referential integrity going forward

    • Support tools: • Schema additions • Indexes • Application Framework Support • Asynchronous tasks Tuesday, July 10, 12