Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Warehousing Tools in Bioinformatics

Data Warehousing Tools in Bioinformatics

A brief introduction on the subject of data warehousing - why you would want to do this, and typical strategies - followed by an overview of two tools commonly used in bioinformatics groups to build data warehouses: BioMart and InterMine.

I gave this talk during a learning lunch session at work.

Darren Oakley

July 18, 2012
Tweet

More Decks by Darren Oakley

Other Decks in Programming

Transcript

  1. In computing, a data warehouse (DW or DWH) is a

    database used for reporting and analysis. The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc). The data may pass through an operational data store for additional operations before they are used in the DW for reporting. http://en.wikipedia.org/wiki/Data_warehouse
  2. Why Use A Data Warehouse • Integrate data from multiple

    systems ◦ Make disparate data more uniform ◦ Improve data quality/usability • Data reporting and analysis/mining ◦ Query-oriented ◦ Bulk-data retrieval
  3. Concepts • Fact/Main Table ◦ A table, or group of

    tables, containing the data that you want to retrieve out of your data warehouse ▪ Sales data ▪ Publication data ▪ Gene data ▪ Mouse breeding data ▪ etc.
  4. Concepts • Dimension/Lookup Table ◦ Tables used for filtering the

    fact tables through the use of foreign keys ◦ Can have multiple fields/filters per table ▪ Product groups (cat id/name) ▪ Dates (year/month/day/time) ▪ etc.
  5. Star Schema • Simplest style of data warehouse • One

    or more fact tables surrounded by multiple dimension tables • Pros ◦ Optimised for end-user ease-of-use ◦ Query performance • Cons ◦ Inefficient data storage
  6. http://en.wikipedia.org/wiki/Star_schema SELECT P.Brand, S.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN

    Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id WHERE D.YEAR = 1997 AND P.Product_Category = 'tv' GROUP BY P.Brand, S.Country
  7. Snowflake Schema • Extension on the star schema • Dimension

    tables are normalised • Pros ◦ Same pros as star schema... ◦ Lower storage space requirements ◦ Can answer some questions that may not be possible with a star schema design • Cons ◦ Queries can become more complex
  8. http://en.wikipedia.org/wiki/Snowflake_schema SELECT B.Brand, G.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN

    Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id WHERE D.YEAR = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country
  9. Reverse Star Schema • Inverse of the star schema ◦

    Dimension tables contain bulk/fact data ◦ Main (was Fact) table contains queryable fields ◦ May have multiple main tables • Pros ◦ Same as star schema ◦ Fast retrieval of large datasets • Cons ◦ Inefficient data storage ◦ May not be able to answer some questions that would be possible with a snowflake schema
  10. Reverse Star Schema Main Phenotype Test Results (summary) Phenotype Test

    Results (detailed) Gene Expression Data External Collaborator Data Gene Expression Images Fertility Data
  11. BioMart • Federated data warehousing system built around the reverse

    star schema design • Originally developed at the EBI and Sanger Institute in Cambridge, now OICR, Canada • Originally built for the Ensembl project • Used for... ◦ Providing uniform access to scientific data ◦ Allowing collaborators to easily share data ◦ Reporting / data mining
  12. Data Federation • Allows users to query/join separate datasets/marts as

    if they were one dataset ◦ Datasets on the same server ◦ Remote datasets (transparent via web services)
  13. Making a BioMart • Fairly straight forward ◦ If you

    know the questions you want to ask ◦ If you know your database enough to de- normalise/flatten it • GUI tools ◦ MartBuilder - database preparation ◦ MartEditor - configure interfaces
  14. BioMart • v0.7 ◦ Perl, Apache, mod_perl + RDBMS ◦

    Can only federate across 2 datasets/marts ◦ MartView web interface ◦ XML web services ◦ http://www.biomart.org/biomart/martview • v0.8 ◦ Java + RDBMS ◦ Unlimited dataset/mart federation ◦ Multiple web interface options ◦ XML/JSON/SPARQL web services ◦ http://central.biomart.org
  15. InterMine • Generic data warehousing system (non- federated) • Developed

    and supported by Cambridge University • Originally built for the FlyMine project • Used for... ◦ Model-organism data warehouses ◦ Reporting / data mining
  16. Making an InterMine • XML Configuration ◦ Define objects to

    be represented and how they relate to each other ◦ Define data import settings ◦ Possibly write data import scripts ◦ Configure interfaces (web and service layers) • GUI tools are in development
  17. InterMine • Java, Ant, Tomcat • PostgreSQL • Perl (for

    some build scripts) • Web interface • XML/JSON web services
  18. BioMart vs InterMine "Well, I like BioMart and I like

    InterMine. But which is better? There's only one way to find out... FIGHT!"
  19. • Pros ◦ More generic - suited to any dataset

    type ◦ Federated queries ◦ Active user/developer community • Cons ◦ The GUI tools can be a pain sometimes ◦ Basic web interface BioMart
  20. InterMine • Pros ◦ Slick web interface ◦ Pre-configured notion

    of biological entities ▪ Genes, Proteins etc. ◦ Built-in data analysis/enrichment tools ▪ Ontologies ▪ Publications • Cons ◦ Only suitable for biological data ◦ Developer/user community not as large ◦ No data federation ◦ Not as easy to setup as BioMart