Data Modeling in RethinkDB

RethinkDB Meetup San Francisco, California June 29, 2015 Data Modeling
in RethinkDB

Jorge Silva @thejsj Developer Evangelist @ RethinkDB

Introduction Data modeling in other databases

SQL • Very basic data types: strings, numbers, and date/time
• All data is saved in rows under a predefined column (schema) • Complex data is modeled through relations (foreign keys)

SQL • Relations favor data spread across multiple tables •
Limited data types favor very normalized data

SQL Family Tree People id name 1 Darth 2 Luke
3 Leia ancestor descendant 1 1 1 2 1 3 TreePaths

MongoDB • JSON support reduces the need for spreading data
across tables • Joins happen at the application layer • Lack of relations and joins favors storing more data on a single document

MongoDB Family Tree { "id": 1, "name": "Darth", "children": [
{ "name": "Luke" }, { "name": "Leia" } ] }, { "id": 2, "name": "Luke" }, { "id": 3, "name": "Leia", }

MongoDB Family Tree { "id": 1, "name": "Darth", "children": [
2, 3 ] }, { "id": 2, "name": "Luke" }, { "id": 3, "name": "Leia", } • Array of ids, joined at application level

RethinkDB • RethinkDB allows for both: SQL style data modeling
through joins and MongoDB style modeling through subdocuments

Data Types • The introduction of arrays and objects completely
changes how you structure your data • Arrays reduce the need for intersection tables • Objects allow for nested subdocuments and easy lookup

RethinkDB vs SQL • Even though RethinkDB has joins, it
has no relations. • There is no concept of a foreign key or cascading queries at a database-level • Schema enforcement has to be done manually

Data Modeling in RethinkDB • Data modeling in RethinkDB is
incredibly flexible. This flexibility may seem daunting at first • What you already know is applicable

Data Modeling in RethinkDB You still need to think about
your data

Thinking about your database Tradeoﬀs

Main Tradeoff • Do I store this related piece of
data as a subdocument or do I save it in a diﬀerent table? • What you do depends on your use case

Structure and Performance Considerations

How to decide what to do? • The format and
structure of your data is not the only consideration • RethinkDB's flexibility allows you to optimize for ease-of-use and performance

Considerations • Structure of your data • Joins in query
• Atomic updates

Structure of your data • Models should still be simple,
faithful and avoid redundancy • Data should still be normalized and follow the three normal forms • Only break these rules for denormalization, but don't optimize prematurely

Joins in query • How you read your data is
important • If you always read your data in conjunction, you should store data in the same document • If data is read separately, it should be store in diﬀerent tables

Joins in query: Example • Example: You have 'cities' and
'states' and you need to store the relationships between them

Joins in query: Example • Options: • Have an intersection
table • Store the whole city as a subdocument • Store the city IDs in the document

Joins in query: Example // States { "id": 1, "name":
"California" }, { "id": 2, "name": "Oregon" } // Cities { "id": 1, "name": "Portland" }, { "id": 2, "name": "Salem" }, { "id": 3, "name": "San Francisco" }, { "id": 4, "name": "Mountain View" } // State-to-city { "state": 1, "city": 3 }, { "state": 1, "city": 4 }, { "state": 2, "city": 1 }, { "state": 2, "city": 2 }

Joins in query: Example # Reading all cities in state
r.table("states") .get(stateId).merge({ "cities": r.table("cities") .get_all(r.args(r.table("state-to-city") .get_all(stateId, index="state")["city"] )) }) # Reading one city r.table("cities") .get_all("Mountain View", index="name")

Joins in query: Example # Adding a state r.table("states") .insert({
"name": "Washington" }) # Adding a city r.table("cities") .insert({ "name": "San Jose" }) r.table("state-to-city") .insert({ "state": stateId, "city": cityId })

"California", "cities": [ { "name": "San Francisco" }, { "name": "Mountain View" } ] }, { "id": 2, "name": "Oregon", "cities": [ { "name": "Portland" }, { "name": "Salem" } ] }

Joins in query: Example # Getting a state with cities
r.table("states") .get(stateId) # Getting a city r.table("states") .filter(lambda row: row["cities"]["name"] == "San Jose" )[0] .map(lambda row: row.filter(lambda city: city["name"] == "San Jose") )

"name": "Washington", "cities": [] }) # Adding a city r.table("states") .get(stateId) .update({ "cities": r.row["cities"].append({ "name": "San Jose" }) })

"California", "cities": [ 3, 4 ] }, { "id": 2, "name": "Oregon", "cities": [ 1, 2 ] } // Cities { "id": 1, "name": "Portland" }, { "id": 2, "name": "Salem" }, { "id": 3, "name": "San Francisco" }, { "id": 4, "name": "Mountain View" }

Joins in query: Example # Getting a state with cities
r.table("states") .get(stateId).merge({ "cities": r.table("cities") .get_all(r.row["cities"]) }) # Getting a city r.table("cities") .get_all("Mountain View", index="name")

"name": "Washington", "cities": [] }) # Adding a city r.table("cities") .insert({ "name": "San Jose" }) r.table("states") .get(stateId).update({ "cities": r.row["cities"].append(cityId) })

Joins in query: Example # Creating an index for cities
r.table("states") .index_create("city", lambda row: row["cities"]["name"], multi=True) # Query city using index r.table("states") .get_all("San Francisco", index="city") Using an index for subdocument

Atomic Updates • Because RethinkDB is distributed, atomic updates can
only be guaranteed on a per-document basis • If you need atomic updates, you must keep everything in one document

• Example: You have an 'orders' table with an order
id and an array of products Atomic Updates: Example

• Options: • Store everything in one document to make
the update atomic • Store it in separate documents and give up atomic guarantees Atomic Updates: Example

Atomic Updates: Example # Update total and products in order
r.table("orders").get(order_id).update({ "products": r.row["products"].append(product['id']), "total_price": r.row["total_price"] + product["price"]) }) • The complete query is guaranteed to succeed or fail

Atomic Updates: Example # Update total in order r.table('orders').get(order_id).update({ "total_amount":
r.row["total_amount"] + product["price"] }) .do(lambda result: # Update products in intersection table # If this fails, the `update` won't be unapplied r.table("orders_products").insert({ "order": order_id, "product": product["id"] }) ) • The complete query is not guaranteed to succeed or fail

Summary • Data modeling in RethinkDB is very flexible •
The main tradeoﬀ is subdocuments vs joins • How you model your data will depend heavily on how you interact with the database

Summary Think about your data!

Questions?

Data Modeling in RethinkDB

Data Modeling in RethinkDB

More Decks by Jorge Silva

Other Decks in Programming

Featured

Transcript