Slide 1

Slide 1 text

Automated Schema Design for NoSQL Databases Michael J. Mior Supervised by Kenneth Salem

Slide 2

Slide 2 text

Schema optimization For a given workload, construct the optimal physical schema to answer queries in the workload for a given storage budget

Slide 3

Slide 3 text

What is NoSQL? ● Not SQL? ● NoACID? ● Not only SQL

Slide 4

Slide 4 text

Types of NoSQL Databases ● Document stores ● Key-value stores ● Graph databases ● … ● Wide column stores

Slide 5

Slide 5 text

Cassandra data model

Slide 6

Slide 6 text

Cassandra data model Queries specify A set of row keys and either A list of column keys or A prefix of the column keys with a range

Slide 7

Slide 7 text

Cassandra schema design Physical schema POIsWithHotels POI17 Hotel3 “Holiday Inn” … … No single obvious mapping from the conceptual schema Conceptual schema Hotel PointOfInterest Near Hotel8 “Motel 6”

Slide 8

Slide 8 text

Cassandra schema design POIsWithHotels POI17 Hotel3 “Holiday Inn” … … HotelsWithPOIs Hotel3 POI17 “Liberty Bell” … … ● Multiple choices of logical and physical schema are possible for a given conceptual schema ● Different choices are more suitable for different queries ● Harder to add additional structures in the future

Slide 9

Slide 9 text

Cassandra schema design POIs POI17 Name “Liberty Bell” … … Hotels Hotel3 Name “Holiday Inn” … … HotelToPOI Hotel3 POI17 (null) … … POIToHotel POI17 Hotel3 (null) … …

Slide 10

Slide 10 text

Cassandra schema design POIToHotel POI17 Hotel3 “Holiday Inn” … … HotelToPOI Hotel3 POI17 “Liberty Bell” … … HotelToPOI HotelID POIID (null) … … POIToHotel POIID HotelID (null) … …

Slide 11

Slide 11 text

Challenges ● Storage costs ● View maintenance ● Transactions ● Data distribution

Slide 12

Slide 12 text

Relational schema design ● Commercial database vendors have existing tools ○ Microsoft AutoAdmin for SQL Server ○ DB2 Design Advisor ○ Oracle SQL Tuning Advisor ● These tools don’t solve all the problems for NoSQL databases ● Usually an existing schema is assumed, and tools suggest secondary indices and materialized views

Slide 13

Slide 13 text

Relational schema design Conceptual schema Logical schema Physical schema Hotel CREATE TABLE Hotel(…); CREATE TABLE POI(…); CREATE TABLE HotelToPOI(…, FOREIGN KEY (HotelID) REFERENCES Hotel(HotelID) FOREIGN KEY (POOID) REFERENCES POI(POIID) ); CREATE MATERIALIZED VIEW POIsWithHotels AS SELECT POI.*, Hotel.* FROM POI, HotelToPOI, Hotel WHERE HotelToPOI.HotelID = Hotel.HotelID AND HotelToPOI.POIID = POI.POIID; Mapping from conceptual to logical schema is straightforward Near PointOfInterest

Slide 14

Slide 14 text

Cassandra schema design Physical schema POIsWithHotels POI17 Hotel3 “Holiday Inn” … … No single obvious mapping from the conceptual schema Conceptual schema Hotel Near Hotel8 “Motel 6” PointOfInterest

Slide 15

Slide 15 text

Abstract example

Slide 16

Slide 16 text

Query paths Get all points of interests near hotels a guest has stayed at SELECT Name FROM POI WHERE POI.HotelToPOI .Hotel.Room.Reservation .Guest.GuestID = ? Guest Reservation Room Hotel Point of Interest

Slide 17

Slide 17 text

Index paths ● Queries can “skip” entities on a path using an index ● Indices to the final entity are equivalent to materialized views ● Schema design is now the selection of these indices Guest Reservation Room Hotel Point of Interest

Slide 18

Slide 18 text

Problem definition Input: ● ER diagram with entity counts and cardinalities ● Weighted queries over the diagram ● Storage constraint ● Cost model for target DB Output: ● Suggested index configuration ● Plans for executing each query

Slide 19

Slide 19 text

Proposed solution 1. Enumerate possible useful indices for all queries 2. Determine the benefit of a given index for each query 3. Select the optimal indices for a given space constraint

Slide 20

Slide 20 text

Future Work ● Retargeting to other NoSQL DBs ● Automatic implementation of query plans ● Handling workloads which include updates ● Richer query language ● Exploitation of DBMS-specific features

Slide 21

Slide 21 text

Summary ● Schema design in NoSQL databases presents unique challenges ● A workload-driven approach is necessary ● Conceptual schemata and queries are viable abstractions for NoSQL schema design tools

Slide 22

Slide 22 text

Query language SELECT [attributes] FROM [entity] WHERE [path](=|<|<=|>|>=) [value] AND … ORDER BY [path] LIMIT [count] Conjunctive queries with simple predicates and ordering Higher level queries can be composed by the application

Slide 23

Slide 23 text

Solution Index Enumerator Query Planner Index Selection Index benefit analysis Query Plans Selected Indices Workload Input Output

Slide 24

Slide 24 text

Implementation ● Focus on Cassandra as target system ● Simple cost model for read-only in-memory workloads ● DBMS-independent index enumerator and query planner ● Index selection ILP implemented with GLPK

Slide 25

Slide 25 text

Evaluation - TBD ● Re-implement existing workloads in the target DBMS (e. g. RUBiS, RUBBoS) ● Develop new benchmarks ● Comparison with a human DBA ● Random ER diagrams/queries with realistic properties