Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Schema Design for NoSQL Databases

Automated Schema Design for NoSQL Databases

Selecting appropriate indices and materialized views is critical for high performance in relational databases. By example, we show that the problem of schema optimization is also highly relevant for NoSQL databases. We explore the problem of schema design in NoSQL databases with a goal of optimizing query performance while minimizing storage overhead. Our suggested approach uses the cost of executing a given workload for a given schema to guide the mapping from the application data model to a physical schema. We propose a cost-driven approach for optimization and discuss its usefulness as part of an automated schema design tool.


Michael Mior

June 22, 2014


  1. Automated Schema Design for NoSQL Databases Michael J. Mior Supervised

    by Kenneth Salem
  2. Schema optimization For a given workload, construct the optimal physical

    schema to answer queries in the workload for a given storage budget
  3. What is NoSQL? • Not SQL? • NoACID? • Not

    only SQL
  4. Types of NoSQL Databases • Document stores • Key-value stores

    • Graph databases • … • Wide column stores
  5. Cassandra data model

  6. Cassandra data model Queries specify A set of row keys

    and either A list of column keys or A prefix of the column keys with a range
  7. Cassandra schema design Physical schema POIsWithHotels POI17 Hotel3 “Holiday Inn”

    … … No single obvious mapping from the conceptual schema Conceptual schema Hotel PointOfInterest Near Hotel8 “Motel 6”
  8. Cassandra schema design POIsWithHotels POI17 Hotel3 “Holiday Inn” … …

    HotelsWithPOIs Hotel3 POI17 “Liberty Bell” … … • Multiple choices of logical and physical schema are possible for a given conceptual schema • Different choices are more suitable for different queries • Harder to add additional structures in the future
  9. Cassandra schema design POIs POI17 Name “Liberty Bell” … …

    Hotels Hotel3 Name “Holiday Inn” … … HotelToPOI Hotel3 POI17 (null) … … POIToHotel POI17 Hotel3 (null) … …
  10. Cassandra schema design POIToHotel POI17 Hotel3 “Holiday Inn” … …

    HotelToPOI Hotel3 POI17 “Liberty Bell” … … HotelToPOI HotelID POIID (null) … … POIToHotel POIID HotelID (null) … …
  11. Challenges • Storage costs • View maintenance • Transactions •

    Data distribution
  12. Relational schema design • Commercial database vendors have existing tools

    ◦ Microsoft AutoAdmin for SQL Server ◦ DB2 Design Advisor ◦ Oracle SQL Tuning Advisor • These tools don’t solve all the problems for NoSQL databases • Usually an existing schema is assumed, and tools suggest secondary indices and materialized views
  13. Relational schema design Conceptual schema Logical schema Physical schema Hotel

    CREATE TABLE Hotel(…); CREATE TABLE POI(…); CREATE TABLE HotelToPOI(…, FOREIGN KEY (HotelID) REFERENCES Hotel(HotelID) FOREIGN KEY (POOID) REFERENCES POI(POIID) ); CREATE MATERIALIZED VIEW POIsWithHotels AS SELECT POI.*, Hotel.* FROM POI, HotelToPOI, Hotel WHERE HotelToPOI.HotelID = Hotel.HotelID AND HotelToPOI.POIID = POI.POIID; Mapping from conceptual to logical schema is straightforward Near PointOfInterest
  14. Cassandra schema design Physical schema POIsWithHotels POI17 Hotel3 “Holiday Inn”

    … … No single obvious mapping from the conceptual schema Conceptual schema Hotel Near Hotel8 “Motel 6” PointOfInterest
  15. Abstract example

  16. Query paths Get all points of interests near hotels a

    guest has stayed at SELECT Name FROM POI WHERE POI.HotelToPOI .Hotel.Room.Reservation .Guest.GuestID = ? Guest Reservation Room Hotel Point of Interest
  17. Index paths • Queries can “skip” entities on a path

    using an index • Indices to the final entity are equivalent to materialized views • Schema design is now the selection of these indices Guest Reservation Room Hotel Point of Interest
  18. Problem definition Input: • ER diagram with entity counts and

    cardinalities • Weighted queries over the diagram • Storage constraint • Cost model for target DB Output: • Suggested index configuration • Plans for executing each query
  19. Proposed solution 1. Enumerate possible useful indices for all queries

    2. Determine the benefit of a given index for each query 3. Select the optimal indices for a given space constraint
  20. Future Work • Retargeting to other NoSQL DBs • Automatic

    implementation of query plans • Handling workloads which include updates • Richer query language • Exploitation of DBMS-specific features
  21. Summary • Schema design in NoSQL databases presents unique challenges

    • A workload-driven approach is necessary • Conceptual schemata and queries are viable abstractions for NoSQL schema design tools
  22. Query language SELECT [attributes] FROM [entity] WHERE [path](=|<|<=|>|>=) [value] AND

    … ORDER BY [path] LIMIT [count] Conjunctive queries with simple predicates and ordering Higher level queries can be composed by the application
  23. Solution Index Enumerator Query Planner Index Selection Index benefit analysis

    Query Plans Selected Indices Workload Input Output
  24. Implementation • Focus on Cassandra as target system • Simple

    cost model for read-only in-memory workloads • DBMS-independent index enumerator and query planner • Index selection ILP implemented with GLPK
  25. Evaluation - TBD • Re-implement existing workloads in the target

    DBMS (e. g. RUBiS, RUBBoS) • Develop new benchmarks • Comparison with a human DBA • Random ER diagrams/queries with realistic properties