Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analytic data report with MongoDB

Li Jia Li
November 01, 2015

Analytic data report with MongoDB

Short talk about pre-aggregated schema for analytic report.
Presented at MongoDB & Laravel Meet up ,Yangon

Li Jia Li

November 01, 2015
Tweet

More Decks by Li Jia Li

Other Decks in Programming

Transcript

  1. How would you design the data schema ? ‣ No

    need to retain transactional event data in MongoDB. ‣ You require up-to-the minute data, or up-to-the-second if possible. ‣ The queries for ranges of data (by time) must be as fast as possible.
  2. Solution ‣ Use pre-aggregated schema using upserts and increment operations.

    ‣ This will allow you to - calculate statistics, - produce simple range-based queries, and - generate filters to support time-series charts of aggregated data.
  3. Schema { _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1",

    page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": 3612, "1": 3241, ... "1439": 2819 } } One Document Per Page Per Day ‣ For every request on the website, you only need to update one document. ‣ Reports for time periods within the day, for a single page require fetching a single document. Advantages
  4. Pre-allocate Documents ‣ initializing all documents with 0 values in

    all fields. After create, documents will never grow. ‣ there will be no need to migrate documents within the data store ‣ MongoDB will not add padding to the records, which leads to a more compact data representation and better memory use of your memory.
  5. Add Intra-Document Hierarchy MongoDB stores BSON documents as a sequence

    of fields and values, not as a hash table. As a result, writing to the field stats.mn.0 is considerably faster than writing to stats.mn.1439. In order to update the value in minute #1349, MongoDB must skip over all 1349 entries before it.
  6. { _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page:

    "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": { "0": 3612, "1": 3241, ... "59": 2130 }, "1": { "60": ... , }, ... "23": { ... "1439": 2819 } } } Split minute field up into 24 hours fields To update the value in minute #1349, MongoDB first skips the first 23 hours and then skips 59 minutes for only 82 skips as opposed to 1439 skips in the previous schema.
  7. Separate Documents by Granularity Level Daily Statistics <= Schema in

    previous slide Monthly Statistics { _id: "201010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-00T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: { "1": 5445326, "2": 5214121, ... } }
  8. Retrieving Data for a Real-Time Chart Retrieve the number of

    hits to a specific resource (i.e. /index.html) with minute-level granularity db.stats.daily.findOne( ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/index.html'}}, ... { 'minute': 1 }) Retrieve the number of hits to a specific resource with hour-level granularity db.stats.daily.findOne( ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/index.html'}}, ... { 'hourly': 1 }) A few days of hourly data db.stats.daily.find( ... { ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, ... 'metadata.site': 'site-1', ... 'metadata.page': '/index.html'}, ... { 'metadata.date': 1, 'hourly': 1 } }, ... sort=[('metadata.date', 1)]) INDEXING db.stats.daily.ensure_index([ ... ('metadata.site', 1), ... ('metadata.page', 1), ... ('metadata.date', 1)])
  9. Get Data for a Historical Chart Daily data for a

    single month db.stats.monthly.findOne( ... {‘metadata': {‘date':dt, 'site': ‘site-1', 'page':'/index.html'}}, ... { 'daily': 1 }) Several months of daily data db.stats.monthly.find( ... { ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, ... 'metadata.site': 'site-1', ... 'metadata.page': '/index.html'}, ... { 'metadata.date': 1, 'daily': 1 } }, ... sort=[('metadata.date', 1)]) INDEXING db.stats.monthly.ensure_index([ ... ('metadata.site', 1), ... ('metadata.page', 1), ... ('metadata.date', 1)])
  10. https://docs.mongodb.org/ecosystem/use-cases ‣ Storing Log Data ‣ Pre-Aggregated Reports ‣ Hierarchical

    Aggregation ‣ Product Catalog ‣ Inventory Management ‣ Category Hierarchy ‣ Metadata and Asset Management ‣ Storing Comments