Slide 1

Slide 1 text

Leveraging Map/Reduce and the Aggregation Framework for Data Analysis and Schema Design Luciano Ramalho [email protected] @ramalhoorg Open Library on MongoDB Tuesday, July 10, 12

Slide 2

Slide 2 text

@ramalhoorg Topics • About the Open Library project • Converting and importing the OLC dataset • Data analysis with the Aggregation Framework • Data analysis with Map! /Reduce • Refactoring the schema design for MongoDB • Wrapping up Tuesday, July 10, 12

Slide 3

Slide 3 text

@ramalhoorg About the Open Library Project Tuesday, July 10, 12

Slide 4

Slide 4 text

@ramalhoorg About the Open Library • Mission: • A project by the Internet Archive • 117,439,126 million bibliographic records as of June, 2012 • More than 1,000,000 free e-books for download “One web page for every book” Tuesday, July 10, 12

Slide 5

Slide 5 text

@ramalhoorg Open Library Technology • Infobase: a semi-structured database abstraction built on top of normalized PostgreSQL tables • a.k.a ThingDB • Record versioning support • Many joins to retrieve one conceptual entity • Heavily dependent on SOLR/Lucene for production Tuesday, July 10, 12

Slide 6

Slide 6 text

@ramalhoorg Semi-structured data model • Theoretical basis • Literature • Notation • Normalizing without the First Normal Form (N1NF) Tuesday, July 10, 12

Slide 7

Slide 7 text

@ramalhoorg Converting and Importing the OLC Dataset Tuesday, July 10, 12

Slide 8

Slide 8 text

@ramalhoorg Handling the dataset • 118.598.056 lines (as of June 1st, 2012) • 91 GB uncompressed • 32 different types of records • 1.158.930 (~1%) are not bibliographic records Tuesday, July 10, 12

Slide 9

Slide 9 text

@ramalhoorg Converting to load • The simplest conversion possible • Minimal changes to original schema • Choosing a primary key for the _id field Tuesday, July 10, 12

Slide 10

Slide 10 text

@ramalhoorg Using mongoimport • Scripts used • Monitoring import performance • Monitoring database size increase Tuesday, July 10, 12

Slide 11

Slide 11 text

@ramalhoorg Data Analysis with the Aggregation Framework Tuesday, July 10, 12

Slide 12

Slide 12 text

@ramalhoorg Aggregation Framework Basics Tuesday, July 10, 12

Slide 13

Slide 13 text

@ramalhoorg Indexing for data analysis Tuesday, July 10, 12

Slide 14

Slide 14 text

@ramalhoorg Example 1 Tuesday, July 10, 12

Slide 15

Slide 15 text

@ramalhoorg Example 2 Tuesday, July 10, 12

Slide 16

Slide 16 text

@ramalhoorg Limitations of the Aggregation Framework Tuesday, July 10, 12

Slide 17

Slide 17 text

@ramalhoorg Data Analysis with Map/Reduce Tuesday, July 10, 12

Slide 18

Slide 18 text

@ramalhoorg Map/Reduce Basics Tuesday, July 10, 12

Slide 19

Slide 19 text

@ramalhoorg Deep data analysis • Stonebraker’s “schema after” • Schema analysis • Statistics about record structure • Statistics about embedded data types Tuesday, July 10, 12

Slide 20

Slide 20 text

@ramalhoorg Verifying the Referential Integrity of the Dataset • Schema additions • Indexes • Scripts Tuesday, July 10, 12

Slide 21

Slide 21 text

@ramalhoorg Example 3 Tuesday, July 10, 12

Slide 22

Slide 22 text

@ramalhoorg Example 4 Tuesday, July 10, 12

Slide 23

Slide 23 text

@ramalhoorg Refactoring the schema design for MongoDB Tuesday, July 10, 12

Slide 24

Slide 24 text

@ramalhoorg Refactoring the schema • Primary key for the _id field • Choosing related documents to embed Tuesday, July 10, 12

Slide 25

Slide 25 text

@ramalhoorg Representing versions • Record versioning • Statistics from the dataset • Alternative representations • Chosen representation Tuesday, July 10, 12

Slide 26

Slide 26 text

@ramalhoorg Referential Integrity • Maintaining the referential integrity going forward • Support tools: • Schema additions • Indexes • Application Framework Support • Asynchronous tasks Tuesday, July 10, 12

Slide 27

Slide 27 text

@ramalhoorg Sharding • Sharding basics • Criteria for sharding • Setting up sharding Tuesday, July 10, 12

Slide 28

Slide 28 text

@ramalhoorg Incremental conversion • Record migration on demand Tuesday, July 10, 12

Slide 29

Slide 29 text

@ramalhoorg Wrapping Up Tuesday, July 10, 12

Slide 30

Slide 30 text

@ramalhoorg Conclusion Tuesday, July 10, 12

Slide 31

Slide 31 text

@ramalhoorg Next Steps Tuesday, July 10, 12

Slide 32

Slide 32 text

@ramalhoorg Q & A • Open Library mailing lists • ... • My e-mail: [email protected] Tuesday, July 10, 12