Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DAGDA : un gestionnaire de données pour les pla...

DAGDA : un gestionnaire de données pour les plates-formes de calcul hétérogènes - Gaël Le Mahec

DAGDA : un gestionnaire de données pour les plates-formes de calcul hétérogènes
Présentation de Gaël Le Mahec (Maître de Conférences, Laboratoire MIS, Université de Picardie Jules Verne) lors des Rencontres SaaS, Cloud & innovation organisées par SysFera le 23 mai à Clamart.

SysFera

May 31, 2012
Tweet

More Decks by SysFera

Other Decks in Technology

Transcript

  1. DAGDA : un gestionnaire de données pour les plates-formes de

    calcul hétérogènes Gaël Le Mahec (MIS – UPJV) Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  2. Outline • Data Management & Big Data • DAGDA: the

    DIET’s data manager • Use cases • The future of DAGDA Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  3. Data Management & Big Data • What is « Big

    Data » ? – All kind of data – Valuable insight, but difficult to extract – Several dimensions • Variety – Structured/unstructured – Text, audio, video… • Velocity – Time sensitivity – Streaming • Volume – Large files – Small files in large quantities • Variability – Different meanings/format over different time period Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  4. Data Management & Big Data • Many usages : –

    In motion analysis • Smart grid, cyber-security, real-time promotions,… – Huge data volume analysis • Environmental analysis, risk-modeling, scientific data, … – Discovery and experimentations • Scientific research, hypothesis testing, ad-hoc analysis, … – Manage and plan • Planning and forecasting analysis, predictive analysis, … – Variety of informations • Social/media sentiment analysis, audio/video analysis, … Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  5. Data Management & Big Data • Data manager’s role :

    – Store and retrieve data – Add, remove, (copy), replicate, (move data) – Give access to data for read, write or update • A data manager should ensure – Efficient data access – Data integrity and consistency – Access permissions, authentication Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  6. Data Management & Big Data • Efficient data access ➭data

    replication Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart 100110 110101 010101 100011 store remove 100110 110101 010101 100011 replicate access update 1001101 101010 1010110 0011 copy copy 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 ? search & retrieve
  7. Data Management & Big Data • Data replication ➭ consistency

    problem Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart 100110 110101 010101 100011 store remove 100110 110101 010101 100011 replicate access update 1001101 101010 1010110 0011 copy copy 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 ? search & retrieve
  8. Data Management & Big Data • Data replication ➭ data

    deletion problems Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart 100110 110101 010101 100011 store remove 100110 110101 010101 100011 replicate access update 1001101 101010 1010110 0011 copy copy 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 ? search & retrieve
  9. Data Management & Big Data • Data replication ➭ source

    selection problem Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart 100110 110101 010101 100011 store remove 100110 110101 010101 100011 replicate access update 1001101 101010 1010110 0011 copy copy 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 1001101 101010 1010110 0011 ? search & retrieve ?
  10. DAGDA: the DIET’s data manager Rencontres SaaS, Cloud & innovation

    23 mai 2012 - Clamart • DIET: a middleware for HPC on distributed and heterogeneous environment – Hierarchical agents and servers network – GridRPC compliant middleware – Advanced scheduling mechanism – Complex workflow management
  11. DAGDA: the DIET’s data manager • The current DIET’s data

    manager that provides: – Efficient data management default strategies (implicit management) – Users fine tuning using the DAGDA API and configuration (explicit management) • Data persistence and replication – To avoid useless data transfers – To increase data availability • Platform snapshots system for error recovery • Resources usage limitation Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart Dagda 101001110 1011 101100111 0100 110110110 1011 1010 0010 0110 1010 10101 1001 0110 1010011110 11010 01100 Data manager for distributed and heterogeneous platforms
  12. DAGDA: the DIET’s data manager • Automatic management – Data

    replication – Data persistence • Volatile data • Persistent data (with/without return to client) • Sticky data (with/without return to client) – Cache management (LRU, LFU, FIFO) Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  13. DAGDA: the DIET’s data manager • DAGDA API: – Put

    data sync/async – Get data sync/async – Remove data – Replicate using replication rules – Define « alias » – Platform data snapshot Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  14. DAGDA: the DIET’s data manager • Consistency: partially ensured –

    All the replica will be identical… after a while… – Updates can take a long time… Question: is a consistency mechanism allways important? Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  15. DAGDA: the DIET’s data manager • Data deletion problem: –

    If a node restarts with a deleted data, some data can « born again » – But they will disappear after a while if they are not used Question: is it allways important to immediately remove a data? Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  16. DAGDA: the DIET’s data manager • Data source selection: –

    Based on statistics on previous transfers – Select source among the DAGDA nodes – No other protocols are managed yet Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  17. The request is divided into several sub-requests. Then, the sub-requests

    are sent to the DIET-Blast SeDs. When needed, the database is replicated on these nodes Biological database Biological database Biological database Biological database Use cases in bioinformatics • DIET-Blast: The DIET’s « Basic Local Alignment Search Tool » implementation – Bioinformatics data-intensive application – Uses large biological databases to find clues about genes or proteins functions ➢ Data replication & persistence Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  18. Use cases in bioinformatics • De-novo genome assembly data-flow ➢

    Data replication & persistence Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart Amplified DNA High-throughput genome sequencer ACCAGTACCTAGTTAC... CGTAACTTACGTTACA... AATGACTAAAGTACAA... CTAGTTTAAAGACCAT... ATAGATCGATAGGATT... GTAATGACGTAATTGA... ATAAAGAAGATAAGTA... CTAGTTTAAAGACCAT... ... Generates thousands of reads Clean the reads - Several methods Assembly the reads - Several methods Try to merge the assemblies - Several combinations CGTAACTTACGTT... Get an assembled genome ! (if we are dead lucky...)
  19. Use case: Grid TLSE • Sparse Linear algebra problem solving

    See Ronan’s presentation. Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  20. The future of DAGDA • DAGDA 2.0 will introduce many

    improvements – Multi-protocols data transfers – Advanced storage management – Data attributes (key-value meta-data) exploitation • How the replica should be consistent? • What is the data lifetime? • Which protocol is better for transfer? • Is the data used in RAM, in FS, in GPU memory, … • … – Extended multi-level cache management – And so much more! Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart
  21. The future of DAGDA • Highly modular with pluggable: –

    Transfer protocols – Storage resources – Cache management algorithms – Security manager – Replication/placement algorithms Rencontres SaaS, Cloud & innovation 23 mai 2012 - Clamart