Lines of work in HPC: Resource Information Policies & Hardware Assisted Terascheduling por Esteban Mocskos

Lines of work in HPC Resource Information Policies Hardware Assisted
Tera-scheduling Esteban Eduardo Mocskos Departamento de Computaci´ on Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires Consejo Nacional de Investigaciones Cient´ ıﬁcas y T´ ecnicas 2 of April, 2013 Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 1 / 28

Current Projects Three (at least) important projects are moving the
local scene: 1 BIA: Plataforma de Bionform´ atica Argentina (Bioinformatics Platform) 2 CSC: Centro de Simulaci´ on Computacional para Aplicaciones Tecnol´ ogicas (Center for Computational Simulation in Technological Applications) - CONICET 3 SNCAD: Sistema Nacional de C´ omputo de Alto Desempe˜ no (National System for Large Computing Equipment) Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 2 / 28

RISC Project Roadmap of High Performance Computing and Supercomputing strategic
R&D in Latin America Strategic research clusters established Fully functioning network focusing on activities to support and promote coordination of the HPC and Supercomputing research between Europe and Latin America Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 3 / 28

BIA: Plataforma de Bionform´ atica Argentina Main Objective To develope
a service oriented platform to deliver bioinformatics related services to public institutions and private industry. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 4 / 28

Objectives Design and development of simulation methods, algorithms and tools
in Computational Biology and Bioinformatics field. Cover identified necessities in Bioinformatics in public and private sectors. Increase the specialized human resources in the area: Undergraduate and postgraduate courses. Creation of Master in Bioinformatics Support the new orientation of Biology at UBA. Currently, searching and hiring staff, and buying equipment. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 5 / 28

CSC: Centro de Simulaci´ on Computacional para Aplicaciones Tecnol´ ogicas
To develop computational formulations oriented to the solution of technological problems in Argentinean industries. To train young Engineers and Scientists in the usage of Computational Mechanics methods. To perform original research. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 6 / 28

Waves and more waves Computational Modelling of mechanical and electromagnetic
wave propagation in complex mediums. Technological Application for mechanical waves: sonar and sismic prospection. Technological Application for electromagnetic waves: radar. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 7 / 28

New cluster at CSC CSC Cluster will have 36 rackable
compute servers, with 2 system boards, each with 4 AMD 16-core-processor. Every server will have 512GB of RAM. In addition, there will be 2 servers with 16 nVidia Tesla GPU each. In total 4600 AMD CPU cores. 18 TB of DDR3 RAM. 32 nVidia Tesla GPU. About 48TFLOPS of GPU+CPU (Top500 RPeak) 40Gbps Inﬁniband 4xQDR connection for each server. Separate 10Gb Ethernet connection for resource administration and monitoring. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 8 / 28

SNCAD: Sistema Nacional de C´ omputo de Alto Desempe˜ no
To share the large computing facilities acquired using public funding. To foster open access, primary data and information visibility for the research paid with public funding. To promote the overall optimization of the Technological and Scientiﬁc National Complex. To improve the eﬃciency in the usage of the equipment and the quality of the services being done with them. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 9 / 28

Overview Resource Information Policies More and more computing elements will
be available to be interconnected. The scheduling process needs up-to-date information about resources in the system. Any centralized point of failure must be avoided. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 10 / 28

Overview Hardware Assisted Tera-scheduling More cores will be available (without
increasing the speed). It will become more and more diﬃcult to feed these cores. A change in the underlying architecture and (probably) computing model will eventually come. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 10 / 28

Grid Computing Grid technologies allow sharing resources geographically distributed in
a transparent way. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 11 / 28

Grid Computing Grid technologies allow sharing resources geographically distributed in
a transparent way. For eﬃciently managing the resources, it is necessary to know their state and availability (Resource Monitoring and Discovery) MPI = true monitoring discovery ? ? cpu = 4 ? use = 20% Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 11 / 28

Mechanisms An Index Service uses two mechanisms to share resources
information between nodes Push: A sends information to B Poll: B requests information from A resource information poll (requests) push(sends) A B When you conﬁgure an Index Service should indicate the providers of information and to which nodes it must provide The resource information is associated with a lifetime, can become obsolete (i.e. CPU usage). Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 12 / 28

Policies Deﬁnes how the nodes communicate with the others Determines
the way each node sends and requests information Two main groups: predeﬁned hierarchy or Peer-to-Peer Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 13 / 28

Evaluation of system performance We propose a new metric, which
take into account the associated lifetime of the information: LIR: captures the amount of information that a particular host has from all the entire grid in a single moment. For the host k, LIRk is: LIRk = N h=1 f (ageh, expirationh) · resourceCounth totalResourceCount N - number of hosts in the system expirationh - the expiration time of the resources of host h in host k ageh - the time passed since the information was obtained by k node resourceCounth - the amount of resources in host h totalResourceCount - is the total amount of resources in the whole grid Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 14 / 28

Evaluation of system performance We propose a new metric, which
take into account the associated lifetime of the information: LIR: captures the amount of information that a particular host has from all the entire grid in a single moment. For the host k, LIRk is: LIRk = N h=1 f (ageh, expirationh) · resourceCounth totalResourceCount N - number of hosts in the system expirationh - the expiration time of the resources of host h in host k ageh - the time passed since the information was obtained by k node resourceCounth - the amount of resources in host h totalResourceCount - is the total amount of resources in the whole grid GIR: captures the amount of information that the whole grid knows, is the mean value of every node’s LIR. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 14 / 28

GridMatrix Is an application for the simulation and analysis of
the behavior of diﬀerent information distribution policies SimGrid2 is used as simulation engine Graphical interface for the design of the underlying network Allows automatically generating networks and measures the main properties Displays the simulation execution Coded in C++, QT and Python (multiplatform) Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 15 / 28

Some Results GridMatrix2 allows the analysis of diﬀerent network topologies
using many diﬀerent information propagation policies. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 16 / 28

using many diﬀerent information propagation policies. As a case study, we analysed the clique network topology. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 16 / 28

using many diﬀerent information propagation policies. As a case study, we analysed the clique network topology. Tested the following policies: Random Hierarchical Super-Peer Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 16 / 28

Hierarchical 1 1.1 1.2 1.2.1 1.2.2 1.2.3 1.2.3.1 1.1.1 1.1.2
Each node has at most one father and several sons (eventually none) The information ﬂows from the nodes of the lower levels to the top level and in the reverse way To ensure that the information reaches all the nodes, it must have enough lifetime to travel twice the height of the hierarchy Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 17 / 28

Random Each node is assigned a set of neighbors with
whom shares information They send resource information randomly to any neighbor The information can be sent or requested. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 18 / 28

Super-Peer The nodes are divided into disjoint subsets In each
subset, one of the nodes is marked as a super-peer A centralized scheme is formed using the super-peer as the central point. Among the super-peers, the information is shared using a Random policy. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 19 / 28

Basic network topology - Clique In this topology the longest
path between two nodes is 2 hops For this reason it is expected that any policy behaves relatively well. 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 GIR tiempo (segundos) Jerárquica Super−Peer Random Clique topology: 100 nodes 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 GIR tiempo (segundos) Jerárquica Super−Peer Random Clique topology: 300 nodes Random has a gradual fall related to the size of the system Super-Peer has a stable and acceptable performance Hierarchical has a fall in the performance in the larger system matching Super-Peer Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 20 / 28

Basic network topology - Clique 0 0.2 0.4 0.6 0.8
1 0 500 1000 1500 2000 2500 3000 GIR tiempo (segundos) 2 niveles, 100 nodos 2 niveles, 200 nodos 2 niveles, 300 nodos 3 niveles, 100 nodos 3 niveles, 200 nodos 3 niveles, 300 nodos GIR in clique with 2 and 3 level hierarchy, expiration time between 200 and 300 sec. For two levels, the performance remained around 0.8, regardless of the number of nodes In the case of three levels with 200 and 300 nodes, a sharp fall is observed due to the hierarchy construction algorithm Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 21 / 28

Basic network topology - Clique 0 0.2 0.4 0.6 0.8
1 0 500 1000 1500 2000 2500 3000 GIR tiempo (segundos) 100 nodos, 200 expiration time 100 nodos, 300 expiration time 100 nodos, 400 expiration time 300 nodos, 200 expiration time 300 nodos, 300 expiration time 300 nodos, 400 expiration time GIR of clique for 3-level hierarchical policy, diﬀerent expiration times Diﬀerent expiration times were used to evaluate it impact. The improvement is remarkable, despite the unbalanced hierarchical structure Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 22 / 28

Problem single-core multi-core Today We would like a faster processor
But... we can not increase the clock frequency We need to run more instructions in parallel Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 23 / 28

Problem single-core multi-core Nowadays each chip contains more than one
core. This is the only way to have more powerful processors because the frequency is fixed for the non-gamers. It is very difficult to efficiently use these processors. There is (or will be) a need for a new computing model, to take advantage of the increasing amount of cores in each chip Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 23 / 28

Instruction Level Parallelism - Current Techniques out of order pipeline
hipertreading superscalar Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 24 / 28

hipertreading superscalar Only at single instruction level Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 24 / 28

hipertreading superscalar Only at single instruction level Do not take into account the information from each thread Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 24 / 28

Improve Instruction Level Parallelism - Microthreads Processor 1 Microthread 1
Microthread 2 Processor 2 Microthread 3 Processor 3 My program Processor 0 new new new new end end end end mth queue One main thread and several short-lived threads collaborating (called microthread). Add new instructions to the ISA for launching microthreads. The microthreads run in a diﬀerent core with a copy of the main thread context. Each microthread run until it terminates. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 25 / 28

Improve Instruction Level Parallelism - Microthreads Processor 1 Microthread 1
Microthread 2 Processor 2 Microthread 3 Processor 3 My program Processor 0 new new new new end end end end mth queue One main thread and several short-lived threads collaborating (called microthread). Add new instructions to the ISA for launching microthreads. The microthreads run in a diﬀerent core with a copy of the main thread context. Each microthread run until it terminates. We need to do all of this very very very VERY fast Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 25 / 28

Execution model and new instructions Assumption Count with a mechanism
to a launch a (micro)thread fast. Objectives Less overhead allows us to further exploit parallelism. It must be well supported in software. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 26 / 28

Execution model and new instructions Assumption Count with a mechanism
to a launch a (micro)thread fast. New Primitives - Launch and stop threads: mthRun, mthEnd - Synchronization: waitForThreads How is it done? We decompose a program in several parts (eventually will be a compiler task, maybe user assisted). Each part can be scheduled for running in a separate core The cores have a mechanism for duplicating the registers from the master core to the others Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 26 / 28

Initial tests Sorting an array of 100 elements. It has
no sense in a conventional parallel mechanism (due to overhead) Serial solution - implement a heap sort (about 15000 instructions until ﬁnished) Parallel solution - divide the array in four parts - run a heap sort in a microthread for each part - merge the four arrays (about 7000 instructions until ﬁnished) Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 27 / 28

Initial tests Instructions Count Single Microthread 15095 7597 0 2651
0 2588 0 2672 Sorting Instructions Count 17000 15000 13000 11000 9000 7000 5000 3000 1000 Core 0 Core 1 Core 2 Core 3 Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 27 / 28

Conclusions Studies of resource information policies thinking in very large
Grids. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 28 / 28

Grids. Planning a new architecture to support the increasing number of cores to come Important We have some resources, we are willing to share them. Think in students exchange, research short stays. There are opportunities all around, we have to be smart to use them. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 28 / 28

Grids. Planning a new architecture to support the increasing number of cores to come Brieﬂy: Trying to be a step ahead and understand the dynamics of the systems to come in the small and large scale. Important We have some resources, we are willing to share them. Think in students exchange, research short stays. There are opportunities all around, we have to be smart to use them. Esteban Mocskos (DC-UBA/CONICET) Lines of work in HPC 2 of April, 2013 28 / 28

Lines of work in HPC: Resource Information Poli...

Lines of work in HPC: Resource Information Policies & Hardware Assisted Terascheduling por Esteban Mocskos

More Decks by Jorge I. Meza

Other Decks in Technology

Featured

Transcript