MongoDB Sao Paulo 2012: Open Library on MongoDB

Leveraging Map/Reduce and the Aggregation Framework for Data Analysis and
Schema Design Luciano Ramalho [email protected] @ramalhoorg Open Library on MongoDB Tuesday, July 10, 12

@ramalhoorg Topics • About the Open Library project • Converting
and importing the OLC dataset • Data analysis with the Aggregation Framework • Data analysis with Map! /Reduce • Refactoring the schema design for MongoDB • Wrapping up Tuesday, July 10, 12

@ramalhoorg About the Open Library Project Tuesday, July 10, 12

@ramalhoorg About the Open Library • Mission: • A project
by the Internet Archive • 117,439,126 million bibliographic records as of June, 2012 • More than 1,000,000 free e-books for download “One web page for every book” Tuesday, July 10, 12

@ramalhoorg Open Library Technology • Infobase: a semi-structured database abstraction
built on top of normalized PostgreSQL tables • a.k.a ThingDB • Record versioning support • Many joins to retrieve one conceptual entity • Heavily dependent on SOLR/Lucene for production Tuesday, July 10, 12

@ramalhoorg Semi-structured data model • Theoretical basis • Literature •
Notation • Normalizing without the First Normal Form (N1NF) Tuesday, July 10, 12

@ramalhoorg Converting and Importing the OLC Dataset Tuesday, July 10,
12

@ramalhoorg Handling the dataset • 118.598.056 lines (as of June
1st, 2012) • 91 GB uncompressed • 32 different types of records • 1.158.930 (~1%) are not bibliographic records Tuesday, July 10, 12

@ramalhoorg Converting to load • The simplest conversion possible •
Minimal changes to original schema • Choosing a primary key for the _id ﬁeld Tuesday, July 10, 12

@ramalhoorg Using mongoimport • Scripts used • Monitoring import performance
• Monitoring database size increase Tuesday, July 10, 12

@ramalhoorg Data Analysis with the Aggregation Framework Tuesday, July 10,
12

@ramalhoorg Aggregation Framework Basics Tuesday, July 10, 12

@ramalhoorg Indexing for data analysis Tuesday, July 10, 12

@ramalhoorg Example 1 Tuesday, July 10, 12

@ramalhoorg Limitations of the Aggregation Framework Tuesday, July 10, 12

@ramalhoorg Data Analysis with Map/Reduce Tuesday, July 10, 12

@ramalhoorg Map/Reduce Basics Tuesday, July 10, 12

@ramalhoorg Deep data analysis • Stonebraker’s “schema after” • Schema
analysis • Statistics about record structure • Statistics about embedded data types Tuesday, July 10, 12

@ramalhoorg Verifying the Referential Integrity of the Dataset • Schema
additions • Indexes • Scripts Tuesday, July 10, 12

@ramalhoorg Refactoring the schema design for MongoDB Tuesday, July 10,
12

@ramalhoorg Refactoring the schema • Primary key for the _id
ﬁeld • Choosing related documents to embed Tuesday, July 10, 12

@ramalhoorg Representing versions • Record versioning • Statistics from the
dataset • Alternative representations • Chosen representation Tuesday, July 10, 12

@ramalhoorg Referential Integrity • Maintaining the referential integrity going forward
• Support tools: • Schema additions • Indexes • Application Framework Support • Asynchronous tasks Tuesday, July 10, 12

@ramalhoorg Sharding • Sharding basics • Criteria for sharding •
Setting up sharding Tuesday, July 10, 12

@ramalhoorg Incremental conversion • Record migration on demand Tuesday, July
10, 12

@ramalhoorg Wrapping Up Tuesday, July 10, 12

@ramalhoorg Conclusion Tuesday, July 10, 12

@ramalhoorg Next Steps Tuesday, July 10, 12

@ramalhoorg Q & A • Open Library mailing lists •
... • My e-mail: [email protected] Tuesday, July 10, 12

MongoDB Sao Paulo 2012: Open Library on MongoDB

MongoDB Sao Paulo 2012: Open Library on MongoDB

mongodb

More Decks by mongodb

Featured

Transcript

Leveraging Map/Reduce and the Aggregation Framework for Data Analysis and

@ramalhoorg Topics • About the Open Library project • Converting

@ramalhoorg About the Open Library Project Tuesday, July 10, 12

@ramalhoorg About the Open Library • Mission: • A project

@ramalhoorg Open Library Technology • Infobase: a semi-structured database abstraction

@ramalhoorg Semi-structured data model • Theoretical basis • Literature •

@ramalhoorg Converting and Importing the OLC Dataset Tuesday, July 10,

@ramalhoorg Handling the dataset • 118.598.056 lines (as of June

@ramalhoorg Converting to load • The simplest conversion possible •

@ramalhoorg Using mongoimport • Scripts used • Monitoring import performance

@ramalhoorg Data Analysis with the Aggregation Framework Tuesday, July 10,

@ramalhoorg Aggregation Framework Basics Tuesday, July 10, 12

@ramalhoorg Indexing for data analysis Tuesday, July 10, 12

@ramalhoorg Example 1 Tuesday, July 10, 12

@ramalhoorg Example 2 Tuesday, July 10, 12

@ramalhoorg Limitations of the Aggregation Framework Tuesday, July 10, 12

@ramalhoorg Data Analysis with Map/Reduce Tuesday, July 10, 12

@ramalhoorg Map/Reduce Basics Tuesday, July 10, 12

@ramalhoorg Deep data analysis • Stonebraker’s “schema after” • Schema

@ramalhoorg Verifying the Referential Integrity of the Dataset • Schema

@ramalhoorg Example 3 Tuesday, July 10, 12

@ramalhoorg Example 4 Tuesday, July 10, 12

@ramalhoorg Refactoring the schema design for MongoDB Tuesday, July 10,

@ramalhoorg Refactoring the schema • Primary key for the _id

@ramalhoorg Representing versions • Record versioning • Statistics from the

@ramalhoorg Referential Integrity • Maintaining the referential integrity going forward

@ramalhoorg Sharding • Sharding basics • Criteria for sharding •

@ramalhoorg Incremental conversion • Record migration on demand Tuesday, July

@ramalhoorg Wrapping Up Tuesday, July 10, 12

@ramalhoorg Conclusion Tuesday, July 10, 12

@ramalhoorg Next Steps Tuesday, July 10, 12

@ramalhoorg Q & A • Open Library mailing lists •