4th Workshop on Resiliency in High Performance Computing
4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011)Bordeaux France, August 29 - September 2nd, 2011
Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single- processor performance, will be achieved through corresponding
increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of these HPC systems increases from today's tera- and peta-scale to next-generation multi
peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count,
poses a challenge for HPC system and application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing
accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as
migration, simply fail as random soft errors can't be predicted.
Moreover, soft errors may even remain undetected resulting in silent data corruption.
Important Web sites:
Resilience 2011 at http://xcr.cenit.latech.edu/resilience2011
Euro-Par 2011 at http://europar2011.bordeaux.inria.fr
Prior conferences Web sites:
Resilience 2010 at http://xcr.cenit.latech.edu/resilience2010
Resilience 2009 at http://xcr.cenit.latech.edu/resilience2009
Resilience 2008 at http://xcr.cenit.latech.edu/resilience2008
Algoritmo de HS - Hirschberg y Sinclair
algoritmo distribuido fue diseñado para el problema de la elección del líder en un anillo síncrono.
El algoritmo requiere el uso de las identificaciones únicas (UID) para cada proceso. El algoritmo trabaja en fases y envía su UID hacia fuera en ambas direcciones. El mensaje sale una distancia de 2Número de la fase los saltos y entonces el mensaje dirigieron de nuevo al proceso que origina. Mientras que los mensajes están dirigiendo “hacia fuera” cada proceso receptor comparará el UID entrante sus el propio. Si el UID es mayor que su propio UID entonces continuará el mensaje encendido. Si no si el UID es menos que su propio UID, no pasará la información encendido. En el final de una fase, un proceso puede determinarse si envía mensajes en el redondo siguiente cerca si recibió ambos de sus mensajes entrantes. Las fases continúan hasta que un proceso recibe ambos de sus hacia fuera mensajes, de ambos de sus vecinos. En este tiempo el proceso sabe que es el UID más grande del anillo y se declara el líder.
OpenMP News
Parallel Programming in Computational Engineering and Science PPCES 2011
Aachen, Germany http://www.rz.rwth-aachen.de/ppces
This year’s seminar will include a special introduction session on Monday to present the new HPC-cluster to be delivered by Bull. During the remainder of the week, we will cover Serial Programming, Tuning, Debugging and Processor Architectures (Tuesday), Shared Memory Programming with OpenMP (Wednesday), Message Passing with MPI (Thursday) and GPGPU Programming on Friday. Some of these lectures will feature hands-on sessions.
Attendees should be comfortable with C/C++ or Fortran programming and interested in learning more about the technical details of application tuning and parallelization on their favored platform (Windows or Linux). The presentations will be given in English.
Dieter an Mey (RWTH), Thomas Warschko (Bull), Herbert Cornelius (Intel), Jean-Pierre Panziera (Bull), Christian Bischof (RWTH) and Felix Wolf (German Research School for Simulation Science) for our Monday event. The remainder of the week will be covered by Ruud van der Pas (Oracle), Michael Wolfe (PGI) and speakers of the HPC Team of the RWTH Aachen University.
The seminar is free. Allocation is on a first come, first served basis, as we are limited in
capacity. Please register separately for any session you intend to participate. Go to:
http://www.rz.rwth-aachen.de/ppces for more information.
The registration deadline is March 14, 2011
The event is sponsored by: Intel, Microsoft and Bull
Measuring OpenMP Performance
WHAT IS WATSON?
Now with High Performance Computing
High-performance computing (HPC) is a term that arose after the term "supercomputing." HPC is sometimes used as a synonym for supercomputing; but, in other contexts, "supercomputer" is used to refer to a more powerful subset of "high-performance computers," and the term "supercomputing" becomes a subset of "high-performance computing." The potential for confusion over the use of these terms is apparent.
![]()
Top 500
A list of the most powerful high-performance computers can be found on the TOP500 list. The TOP500 list ranks the world's 500 fastest high-performance computers, as measured by the High Performance Linpack (HPL) benchmark. Not all computers are listed, either because they are ineligible (e.g., they cannot run the HPL benchmark) or because their owners have not submitted an HPL score (e.g., because they do not wish the size of their system to become public information for defense reasons). In addition, the use of the single Linpack benchmark is controversial, in that no single measure can test all aspects of a high-performance computer. To help overcome the limitations of the Linpack test, the U.S. government commissioned one of its originators, Dr. Jack Dongarra of the University of Tennessee, to create a suite of benchmark tests that includes Linpack and others, called the HPC Challenge benchmark suite. Those evolving suite has been used in some HPC procurements, but, because it is not reducible to a single number, it has been unable to overcome the publicity advantage of the less useful TOP500 Linpack test. The TOP500 list is updated twice a year, once in June at the ISC European Supercomputing Conference and again at a US Supercomputing Conference in November.
Many ideas for the new wave of grid computing were originally borrowed from HPC.
Learning OpenMP
Dynamic Programming
Dynamic programming is one of the important techniques that we have to understand in this course. In dynamic programming, as in the divide and conquer approach, the problem is divided into smaller subproblems and the solutions to each subproblem are combined to get the solution to the original problem. But, unlike the divide and conquer technique, dynamic programming is usually applied when there are overlapping subproblems (subproblems share subsubproblems). In these cases, the advantage of dynamic programming is that the subproblems are only solved once, storing the optimal value for each subproblem in a table.
The process of developing an algorithm following the dynamic programming approach can be divided into four steps:
· Find the optimal substructure of an optimal solution to the problem.
· Express the value of an optimal solution recursively.
· Compute the value of an optimal solution bottom-up
· Construct the optimal solution from the information computed in step 3.
Minimum spanning tree
Given a connectec, undirected graph, a spanning tree of that graph is a subgraph which is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its connected components.
One example would be a cable TV company laying cable to a new neighborhood. If it is constrained to bury the cable only along certain paths, then there would be a graph representing which points are connected by those paths. Some of those paths might be more expensive, because they are longer, or require the cable to be buried deeper; these paths would be represented by edges with larger weights. A spanning tree for that graph would be a subset of those paths that has no cycles but still connects to every house. There might be several spanning trees possible. A minimum spanning tree would be one with the lowest total cost.
No hay comentarios:
Publicar un comentario