February 13, 2019 Course Information ▪ Course book ▪ Database System Concepts (Sixth Edition), Abraham Silberschatz, Henry Korth and S. Sudarshan, McGraw-Hill, 2010 ▪ additional information from the book is available online - http://highered.mcgraw-hill.com/sites/0073523321 ▪ Course information (lecture slides, exercises, …) available on Canvas ▪ https://canvas.vub.be/courses/3432
February 13, 2019 Exercises ▪ Course content is going to be applied in the exercise sessions ▪ Weekly exercise session ▪ starting on February 22 - computer room E.1.2, Friday 14:00-16:00 ▪ assistant: Ahmed K.A. Abdullah ▪ Additional content may be covered in exercise sessions ▪ exam covers content of lectures and exercises
February 13, 2019 Course Overview 1. Introduction ▪ overview ▪ conceptual modelling and ER model 2. Extended ER Model and other Modelling Languages 3. Relational Model and Relational Algebra 4. Relational Database Design ▪ reduction ▪ functional dependencies and normalisation 5. Structured Query Language (SQL) 6. Advanced SQL
February 13, 2019 Databases in Action ▪ Online shops ▪ product information, customer data, order data, ... ▪ e.g. Amazon - hundreds of millions of customers - more than 50 terrabytes of data ▪ Human resources ▪ course registration, student grades, employee records, salary information, tax information, ... ▪ e.g. PointCarré - course registration
February 13, 2019 Databases in Action ... ▪ Banking and trading ▪ customer data, account information, transactions, ... ▪ e.g. London Stock Exchange - almost 1 million trades per day ▪ Reservation systems ▪ book flights from multiple airlines, hotel rooms etc. ▪ e.g. Amadeus systems - Global Distribution System (GDS) founded by Lufthansa, Air France and other partners
February 13, 2019 Databases in Action ... ▪ Digital archives ▪ persistently store various types of digital media ▪ e.g. Internet Archive project - access to more than 450 billion archived web pages (http://archive.org) ▪ Libraries ▪ index for traditional paper- based libraries as well as digital libraries ▪ e.g. Open Library project - over 23 million indexed books (http://openlibrary.org)
February 13, 2019 Databases in Action ... ▪ Geographic Information Systems (GIS) ▪ store raster (bitmap) or vector data representing real world objects ▪ geospatial query language ▪ Scientific databases ▪ sensor data, classifications (e.g. human genome) as well as data from simulations ▪ e.g. LHC Computing Grid - LHC experiments at CERN - 15 petabytes of data per year
February 13, 2019 Databases in Action ... ▪ Many everyday devices contain databases ▪ TVs, washing machines, mobile phones, ... ▪ e.g. Android phones with SQLite database ▪ Embedded databases in cars, airplanes etc. ▪ manage configurations and store sensor data ▪ e.g. db4o object database was used in BMW's Car IT system
February 13, 2019 Databases in Action ... ▪ Databases in the WISE research lab (VUB) ▪ database-driven cross-media publishing ▪ database extensions for hypermedia services ▪ personal information management (PIM) ▪ data visualisation - e.g. ArtVis ▪ human-information interaction ▪ paper-digital interfaces
February 13, 2019 Databases in Action ... ▪ Databases touch all aspects of our daily life! ▪ Numerous large database software companies ▪ e.g. Oracle is the 2nd largest software company in 2017 ▪ Databases form an important part of product lines of Microsoft (SQL Server), IBM (DB2), …
February 13, 2019 Basic Terminology ▪ Database ▪ collection of logically related data ▪ database schema describes the database design (blueprint) - format and relationships between stored data (often rather static) ▪ collection of data stored in a database at a given time is called an instance of the database ▪ Database Management Systems (DBMS) ▪ tools (programs) to efficiently store, maintain and retrieve information from a database - support of create, read, update and delete data (CRUD operations) - data definition language (DDL) to define the database schema - data manipulation language (DML) to query and update the data • often declarative fourth generation languages (4GLs) such as SQL - data access control, transactions and concurrency control
February 13, 2019 File Processing System ▪ Why should we not just use multiple files in a file system to store our data? ▪ There are various disadvantages of such an approach ▪ data redundancy and inconsistency - different file formats over time - duplication of information in different files ▪ limited data access - we have to write new programs to carry out new tasks - data cannot be retrieved in a convenient and efficient manner ▪ data isolation - data may be distributed over different files without a common format ▪ integrity - integrity constraints (e.g. balance > 0) are hidden in the program code and not explicitly stated and checked
February 13, 2019 File Processing System … ▪ missing atomic operations - system failures and crashes may leave the data in an inconsistent state (e.g. only parts of a operation have been carried out) - example: transfer of money from one account to another account - we later discuss transaction management as a solution ▪ concurrent update anomalies - concurrent updates may leave the data in an inconsistent state - example: two programs simultaneously removing money from a single account - we later discuss scheduling as a solution ▪ limited security control - difficult to give a user only access to parts of a file ▪ DBMSs offer solutions to all these problems ▪ concepts and algorithms to solve the problems with file processing systems
February 13, 2019 Database Management System (DBMS) Access Methods System Buffers Authorisation Control Integrity Checker Command Processor Program Object Code DDL Compiler File Manager Buffer Manager Recovery Manager Scheduler Query Optimiser Transaction Manager Query Compiler Queries Catalogue Manager DML Preprocessor Database Schema Application Programs Data, Indices and System Catalogue Database Manager Data Manager DBMS Programmers Users DB Admins Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010
February 13, 2019 Data Abstraction ▪ DBMS provides abstract view of data ▪ hide some details how data is stored ▪ Physical level ▪ physical schema describes how the data is stored (complex low-level data structures) ▪ Logical level ▪ logical schema describes what data is stored - simple structures: attribute names, data types and relationships between data - implementation of simple structures might be based on complex physical-level structures but the user of the logical level should not be aware of that → physical data independence ▪ View level ▪ subschemas provide only access to parts of the database - reduce complexity and introduce security Viewn Physical Level Logical Level View1
February 13, 2019 Duality of the Database Schema Database Schema describes describes Application Concepts Database Concepts Application World Computer World
February 13, 2019 Data Models and History of DBMSs ▪ Punched cards used since 1725 (textile looms) ▪ Hollerith cards later used by IBM for data processing ▪ 1950s: Data processing with magnetic tapes as storage ▪ only sequential access to data ▪ reading from one or multiple tapes and writing to a new tape ▪ sometimes combined with input from punched cards etc. Magnetic Tape Punched Card
February 13, 2019 Data Models and History of DBMSs ... ▪ 1960s: Widespread use of hard disks ▪ direct access (random access) to data ▪ opened possibilities for new Navigational DBMSs - IBM's IMS (1968), hierarchical database - Integrated Data Store (IDS), network database IBM 350 Disk Storage Unit Hard Disk
February 13, 2019 Data Models and History of DBMSs ... ▪ A data model is a collection of conceptual tools ▪ describes data, data relationships, data semantics and constraints ▪ Hierarchical model ▪ data organised in a tree structure ▪ used in early mainframe DBMS - e.g. IBM's Information Management System (IMS) ▪ XML documents also described by a hierarchical model ▪ Network model ▪ generalised graph structure ▪ two main constructs - records contain fields and sets define relationships between records ▪ navigational operations - follow the relationship from one record to another record
February 13, 2019 Data Models and History of DBMSs ... ▪ Relational model ▪ collection of tables (relations) containing records ▪ described in a paper by Edgar F. Codd in 1970 ▪ 1970s: Relational DBMS ▪ IBM's System R (1974) "based" on Codd's paper; SQL added later ▪ Entity-Relationship (ER) model ▪ representation of basic objects (entities) and their relationships ▪ widely used in conceptual database design ▪ Object-based data model ▪ introduces object identity, encapsulation and methods ▪ 1980s: Object Databases ▪ seminal work on object databases
February 13, 2019 Data Models and History of DBMSs ... ▪ 1990s: Web Interfaces to Databases ▪ databases deployed much more extensively ▪ Semi-structured data model ▪ no clear separation between data and the schema ("self-describing" data) ▪ individual data items of the same type may have different attributes ▪ XML is widely used to represent semi-structured data ▪ 2000s: XML and XQuery ▪ relational databases often still form the core ▪ Later 2000s: Extremely large-scale distributed DBMS ▪ BigTable or Hadoop and Hbase ▪ "NoSQL databases"
February 13, 2019 Database Design ▪ Conceptual Design ▪ define an abstract conceptual application model containing the main domain concepts - interact with domain experts to get the requirements ▪ describe the entities (with attributes) and their relationships - e.g. via ER model ▪ specify the functional requirements (operations) - ensure that operations can be realised based on the conceptual model ▪ Database implementation based on conceptual model ▪ logical design phase - mapping of the conceptual schema to the implementation data model • e.g. reduction from the ER model to the relational data model - define the logical database schema ▪ physical design phase - define the physical database layout based on the logical database schema
February 13, 2019 Database Design ... ▪ Two major database design pitfalls have to be avoided ▪ we should avoid any redundancy where information is repeated at multiple places since this might lead to inconsistent data - e.g. a lecture management system where a student's name is stored for each lecture they are attending instead of storing it in a separate student entity ▪ a database design may be incomplete and not enable the representation of certain aspects of the application domain - e.g. in a shopping application where the customer information is stored as part of an order we cannot enter new customer data without having an order ▪ There is often more than one "good design" ▪ e.g. when do we model something as a relationship and when as a separate entity? ▪ modelling is a challenging task that requires a combination of engineering skills and "good taste"
February 13, 2019 Entity-Relationship (ER) Model ▪ Conceptual model based on a set of entities and relationships ▪ An entity is a "thing" or "object" that can be distinguished from other objects ▪ A relationship describes an association between multiple entities ▪ Introduced and formalised by Peter Chen ▪ P. Chen, The Entity-Relationship Model - Toward a Unified View of Data, ACM Transactions on Database Systems 1 (1), March 1976 - one of the most cited and influential papers in Computer Science Peter Chen
February 13, 2019 Entities ▪ An entity represents a distinguishable object ▪ e.g. specific person, car or company ▪ An entity is described by a number of attributes ▪ has to be uniquely identifiable by its attributes (ovals) ▪ An entity set is a set of entities with the same type ▪ the extension of the entity set (rectangle) are its entities Beat Signer 1234 1576 Lode Hoste 3212 William Van Woensel Employees Employees id name note that we will use a slightly different notation than in the book!
February 13, 2019 Attributes ▪ The set of permitted attribute values is called the domain or value set ▪ entity instances can be described by a set of (name,value) pairs ▪ e.g. {(id, 1576),(name, Lode Hoste)} ▪ The ER model supports the following attribute types ▪ simple attributes ▪ composite attributes - hierarchy of sub-attributes ▪ multivalued attributes - optional lower and upper bounds ▪ derived attributes - computed via relationships or other attribute values
February 13, 2019 Attributes ... ▪ A multivalued attribute is represented by a double ellipse ▪ Derived attributes are indicated by dashed ellipses ▪ address is an example of composite attribute LocatedAt Offices Employees id name birthday phone age #offices address street city 0..1 0..*
February 13, 2019 Keys ▪ An entity's attribute values must uniquely identify the entity ▪ A subset of attributes that uniquely identify an entity is called superkey ▪ A minimal superkey without any unnecessary attributes is called a candidate key ▪ The primary key is one of the candidate keys chosen by the database designer for unique entity identification ▪ in the ER model, the primary key is highlighted by the set of underlined attributes ▪ the value of a primary key should change very rarely
February 13, 2019 Relationships ▪ A relationship is an association between multiple entities ▪ A relationship set (diamond) is a set of relationships of the same type ▪ e.g. LocatedAt Beat Signer 1234 1576 Reinout Roels 3212 Sandra Trullemans Employees 10F716 10G731e 10G731d 10F703 Offices LocatedAt Offices Employees id name name address
February 13, 2019 Relationships ... ▪ We can have binary or n-ary relationship sets ▪ {(e1 , e2 ,..., en ) | e1 E1 , e2 E2 ,..., en En } ▪ Each relationship instance in an ER schema represents an association between the involved entities ▪ The role defines an entity's function in a relationship ▪ has to be explicitly defined if the same entity set participates more than once in a relationship set (recursive relationship) ▪ A relationship may contain descriptive attributes ▪ A relationship instance must be uniquely identifiable by its entities (without any descriptive attributes) ▪ i.e. a relationship set cannot contain two relationship entities that only differ in their descriptive attributes
February 13, 2019 Cardinality Constraints ▪ A relationship can be one-to-one, one-to-many, many-to-one or many-to-many ▪ An arrow indicates a to-one relationship ▪ cardinality constraints may also be expressed by numbers - e.g. 0..*, 1..*, 0..1, 1..1, 2..5 ▪ a double line or 1..* indicates a total participation constraint MarriedTo Women Men Teaches Courses Teachers StarsIn Films Filmstars
February 13, 2019 Weak Entity Sets ▪ An entity set with a primary key is called a strong entity set ▪ A weak entity set (double rectangle) does not have enough attributes to form a primary key
February 13, 2019 Weak Entity Sets ... ▪ A weak entity set must be associated with an identifying entity set via an identifying relationship (double diamond) ▪ a weak entity set is existence dependent on an identifying entity set - can also participate in other non-identifying relationships ▪ a weak entity set must relate to the identifying entity set via a total participation constraint and each weak entity instance can only be related to one identifying entity instance ▪ a discriminator or partial key (underlined dashed attributes) uniquely identifies a weak entity relative to a strong entity ▪ In some cases a weak entity may also be expressed as a multivalued composite attribute
February 13, 2019 Homework ▪ Study the following two chapters of the Database System Concepts book ▪ chapter 1 - Introduction ▪ chapter 7 - sections 7.1-7.5 and 7.7 - Database Design and the ER Model
February 13, 2019 References ▪ A. Silberschatz, H. Korth and S. Sudarshan, Database System Concepts (Sixth Edition), McGraw-Hill, 2010 ▪ P. Chen, The Entity-Relationship Model - Toward a Unified View of Data, ACM Transactions on Database Systems 1 (1), March 1976 ▪ WISE Lab ▪ https://wise.vub.ac.be ▪ ArtVis Project ▪ https://wise.vub.ac.be/project/artvis-exploring-information-through- advanced-visualisation-techniques/