HIGH PERFORMANCE COMPUTING: 2012

Sunday, December 9, 2012

PARALLEL UNIVERSE

Processor technology had moved from single processor to multiprocessor system with two variants either many single processors connected together or many processors embedded to a single chip popularly known as Chip-Multiprocessors (CMP). I am sure the system in which you are viewing this blog has CMP architecture. The former variant which is generally found in supercomputers has a connection between many individual processors via some interconnection network and making them to communicate for executing a particular task. These two forms of processor classes form the base for High Performance Computing. The two architectures greatly differ in the way they perform and also in the view presented to the programmer for performing their task giving rise to a new form of computing known as parallel computing or parallel programming which is the root power of High Performance Systems. The way you program so far like in C, C++, Java etc are sequential programming model where you will think of a single processing system with memory around it feeding and getting data to and fro from the processing unit. Shocking news is that the world is slowly moving to parallel models of programming in which the programmer should think about many processing units sharing a memory system or having the memory system distributed among all processing units according to the aforementioned two variants of processor collaboration. So get ready to face this shift.

Parallel Programming:

Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel") since there are many processing unit to work on many problems. There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling. As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multicore processors.

Types of Parallelism:

Bit-level parallelism:

Word Size (your processor labelled as 32 bit or 64 bit) is the amount of the amount of information the processor can manipulate per cycle and it has a very great implication on the speed of the processor. Increasing the word size correspondingly reduces the number of the instructions to be executed to complete a task. For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction.

Instruction-level parallelism:

A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s. Instruction-level parallelism is realized by pipelines in the processor architecture. Think of the pipeline at an automobile manufacturing site. At a particular instance the number of instructions in execution will be equal to the number of pipeline stages the single-issue processor architecture has, issuing one instruction per cycle. On a double-issue processor, this number will be double the number of pipeline stages and so on. Processors having the capability of issuing more than one instruction per clock cycle are known as superscalar processors.

Data parallelism:

Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different computing nodes to be processed in parallel. Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure. Many scientific and engineering applications exhibit data parallelism.

Task parallelism:

Task parallelism is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data". This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism does not usually scale with the size of a problem.

Parallel Computer Classes:

Keep in mind the classification given below are not mutually exclusive.

Multicore computing:

A multicore processor is a processor that includes multiple execution units ("cores") on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); in contrast, a multicore processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multicore processor can potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream. Simultaneous multithreading (of which Intel's HyperThreading is the best known) was an early form of pseudo-multicoreism. A processor capable of simultaneous multithreading has only one execution unit ("core"), but when that execution unit is idling (such as during a cache miss), it uses that execution unit to process a second thread. IBM's Cell microprocessor, designed for use in the Sony PlayStation 3, is another prominent multicore processor.

Symmetric multiprocessing:

A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. "Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that a sufficient amount of memory bandwidth exists."

Distributed computing:

A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.

Cluster computing

A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial off-the-shelf computers connected with a TCP/IP Ethernet local area network. Beowulf technology was originally developed by Thomas Sterling and Donald Becker. The vast majority of the TOP500 supercomputers are clusters.

Massive parallel processing

A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger than clusters, typically having "far more" than 100 processors. In an MPP, "each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect.”

Blue Gene/L, the fifth fastest supercomputer in the world according to the June 2009 TOP500 ranking, is an MPP.

Grid computing

Distributed computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, distributed computing typically deals only with embarrassingly parallel problems. Many distributed computing applications have been created, of which SETI@home and Folding@home are the best-known examples. Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common distributed computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, distributed computing software makes use of "spare cycles", performing computations at times when a computer is idling.

Wednesday, October 3, 2012

Cluster vs Grid vs Cloud - Faces of HPC

Nowadays nearly every field require high performance computing for their cutting-edge progress. With a general definition and outlook, HPC may seem to be very expensive to afford in which point Cluster Computing (HPC with commodity components) comes to play. Beowulf, HPVM are some of the participants of cluster community. There are also two other HPC technologies which are Grid and Cloud. By definition and with the general context all three technologies seem to have a same meaning. “Get extraordinary computing power by combining individual computing components together”. The main aim of writing this blog is to give a clear picture on where did all these three HPC technologies stand separated. Let’s start delving into a detail comparison starting from Cluster.

Cluster Computing:

You can perform HPC in your home with a Cluster. Yes (perhaps with some effort). Cluster is a cheap way for gaining huge computing power. A cluster is a collection of commodity computers connected together with a system of high speed network such as Myrinet, SCM or Infiniband, a system which then coordinate among themselves to act as one parallel machine pushing out enormous HPC power. Putting in an easy context, a simple high speed LAN, with all the nodes coordinating and communicating among them to solve a problem, is a Cluster. All Cluster nodes are tightly coupled. No matter how many systems participate in a cluster, they give a Single System Image (SSI). Clusters are notorious for their high availability, load balancing along excellent compute power. Also they comprise of redundant nodes which are used as backup nodes in case of node failure. Tasks are distributed among all the standalone computers to form a cluster in a balanced way. More emphasis is given to computational load than concentrating on other ancillary activities. A unique layer known as ‘Middleware’ is required for provided the SSI. Application destined to run in a cluster must be programmed separately with a task management routine that has to be performed to divide a particular task between computers of cluster.

Clusters are used in various streams like fluid dynamics, life sciences, nuclear simulations, data mining, image processing, aerodynamics and astrophysics which are popularly known as grand challenging applications. They are also used in lot of commercial applications too.

Condor, Nimrod, PARAMON, MPI and OpenMP are some of the Cluster projects to achieve high performance in cluster computing.

Grid Computing:

Grid computing combines computers from multiple administrative domains to reach a common goal, to solve a single task, and may then disappear just as quickly. It is analogous to the power grid which gives power to your home appliances on-demand. More than one cluster combines to form a Grid. For any application, a middleware apportion tasks of that application to among several compute nodes of the Grid. To put in simple context, HPC appears when needed and disappears when completed. The Size of Grid may vary from small network within a company to large collaborations of various enterprise networks across the world. All the nodes participating in a Grid are loose-coupled and heterogeneous in nature which makes programming grid applications a lot more complex. Resources in Grid are owned by many organizations which also incurs a burden to system administration.

Globus, EGI-InSPIRE(European Grid Infrastructure - Integrated Sustainable Pan-European Infrastructure for Research in Europe), Information Power Grid, GriPhyN, NEESGrid, National Technology Grid, Particle Physics Data Grid are some of the Grid projects around the globe.

Cloud Computing:

Yes. Piracy is a Sin. But giving a heavy amount to buy software which is used once or twice a month is not a fair deal. Isn’t it? Here comes significance to Cloud Computing. We can use the required software by paying for using it and not for buying it. A perfect analogy for using Cloud Computing would be hiring a Taxi rather than buying a car, for a simple short travel. We could agree to a SLA which consists of various pay schemes for using particular software. This concept of Cloud Computing is known as Software as a Service (SaaS). Likewise we can also use various processing power, storage and network according to our use as Infrastructure as a Service (IaaS). High Level Integrated Environment for building, testing and deploying of online applications can be done using Platform as a Service (PaaS).

Wednesday, September 5, 2012

INDIAN SUPERCOMPUTERS

Due to the restriction imposed on India for importing the cray computers just for the reason it could be used for modeling nuclear weapons, India was pushed to design its own supercomputer to show the world its potent to race in technology. For the purpose of achieving self sufficiency in the field, C-DAC (Center for Development for Advanced Computing) was established in pune by the Department of Electronics in 1988. Vijay Bhatkar was hired as the Director of C-DAC. The project was given an initial run of 3 years and an initial funding of

30,00,00,000 as the same amount of money and time was usually expended to secure the purchase of a supercomputer from the US. In 1990, a prototype was produced and was benchmarked at the 1990 Zurich Supercomputing Show. It surpassed most other systems, placing India second after US. The final result of the effort was the PARAM 8000, which was installed in 1991.

CDAC Tercomputing facility at Bangalore.

PARAM 8000 is considered India's first supercomputer. As of June 2012, India has 5 systems on the Top500 list

ranking 58, 86, 129, 224 and 380.

PARAM 8000:

PARAM 8000

Unveiled in 1991, PARAM 8000 used Inmos 8000 transputers. Transputers were a fairly new and innovative microprocessor architecture designed for parallel processing at the time. It distributed memory MIMD architecture with a reconfigurable interconnection network. It had 64 CPUs.

The transputer (the name deriving from transistor and computer) was the first general purpose microprocessor designed specifically to be used in parallel computing systems. The goal was to produce a family of chips ranging in power and cost that could be wired together to form a complete parallel computer. The name was selected to indicate the role the individual transputers would play:numbers of them would be used as basic building blocks, just as transistors had earlier.

MIMD (multiple instruction, multiple data) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data.

PARAM 8600:

PARAM 8600 was an improvement over PARAM 8000. It was a 256 CPU computer. For every four Inmos 8000, it employed an Intel i860 coprocessor. The result was over 5 GFLOPS at peak for vector processing. Several of these models were exported.

PARAM 9900/SS:

PARAM 9900

PARAM 9900/SS was designed to be a MPP system. It used the SuperSPARC II processor. The design was changed to be modular so that newer processors could be easily accommodated. Typically, it used 32-40 processors. But, it could be scaled up to 200 CPUs using the ‘clos network topology’. PARAM 9900/US was the UltraSPARC variant and PARAM 9900/AA was the DEC Alpha variant.

MPP (Massively Parallel Processing) refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel.

Clos network (used for communication between 200 CPUs) is a kind of multistage circuit switching network, first formalized by Charles Clos in 1953, which represents a theoretical idealization of practical multi-stage telephone switching systems. Clos networks are required when the physical circuit switching needs exceed the capacity of the largest feasible single crossbar switch.

PARAM 10000:

PARAM 10000

In 1998, the PARAM 10000 was unveiled. PARAM 10000 used several independent nodes, each based on the Sun Enterprise 250 server and each such server contained two 400MHz UltraSPARC II processors. The base configuration had three compute nodes and a server node. The peak speed of this base system was 6.4 GFLOPS. A typical system would contain 160CPUs and be capable of 100 GFLOPS. But, it was easily scalable to the TFLOP range.

PARAM Padma:

PARAM Padma

PARAM Padma was introduced in April 2003. It had a peak speed of 1024 GFLOPS (about 1 TFLOP) and a peak storage of 1 TB. It used 248 IBM Power4 CPUs of 1 GHz each. The operating system was IBM AIX 1.5L. It used PARAMnet II as its primary interconnect. It was the first Indian supercomputer to break the 1 TFLOP barrier.

PARAMnet is a high speed high bandwidth low latency network developed for the PARAM series. The original PARAMnet used a 8 port cascadable non-blocking switch developed by C-DAC. Each port provided 400 Mb/s in both directions (thus 2x400 Mbit/s) as it is was a full-duplex network. It was first used in PARAM 10000.

PARAMnet II, introduced with PARAM Padma, is capable of 2.5 Gb/s while working full-duplex. It supports interfaces like Virtual Interface Architectureand Active messages. It uses 8 or 16 port SAN switches. The grid computing network GARUDA is also based on it.

PARAM Yuva:

PARAM Yuva

PARAM Yuva was unveiled in November 2008. It has a maximum sustainable speed (Rmax) of 38.1 TFLOPS and a peak speed (Rpeak) of 54 TFLOPS. There are 4608 cores in it, based on Intel 73XX of 2.9 GHz each. It has a storage capacity of 25 TB up to 200 TB. It uses PARAMnet 3 as its primary interconnect.

SAGA – 220:

SAGA - 220

SAGA-220 is a supercomputer built by the Indian Space Research Organisation (ISRO). As of May 2011, it is the fastest supercomputer in the nationwith a maximum theoretical speed of 220 TFlops. The name SAGA-220 stands for Supercomputer for Aerospace with GPU Architecture - 220 teraflops. It was built using commercially available hardware, open source software components and in house developments. The system uses 400 NVIDIA Tesla C2070 GPUs and 400 Intel Quad Core Xeon CPUs supplied by WIPRO. Each NVIDIA Tesla C2070 GPU is capable of delivering 515 gigaflops compared to the Xeon CPU’s more modest contribution of 50 gigaflops. The system cost about INR 140,000,000 to build. The system consumes only about 150 kW.

A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles.

EKA:

EKA

EKA is a supercomputer built by the Computational Research Laboratories (a subsidiary of Tata Sons) with technical assistance and hardware provided by Hewlett-Packard.

EKA uses 14,352 cores based on the Intel QuadCore Xeon processors. The primary interconnect is Infiband 4x DDR. EKA occupies about 4000 sq. feet area. It was built using offshelf components from Hewlett-Packard, Mellanox and Voltaire Ltd. It was built within a short period of 6 weeks.

At the time of its unveiling, it was the 4th fastest supercomputer in the world and the fastest in Asia. As of 16 September 2011, it is ranked at 58.

Wipro SUPERNOVA:

The product is offered under 3 segments: entry level, mid-segment and high-end, which have varying performance and storage capacities. The entry level system costs INR 2,500,000; and performs at 1 TeraFLOPS and has a storage capacity of 4 TB. They use the Gluster software stack.

The GlusterFS architecture aggregates compute, storage, and I/O resources into a global namespace. Each server plus attached commodity storage (configured as DAS, JBOD, or SAN) is considered to be a node. Capacity is scaled by adding additional nodes or adding additional storage to each node. Performance is increased by deploying storage among more nodes. High availability is achieved by replicating data n-way between nodes.

Friday, August 31, 2012

The Rise of Extraordinary Computers

Today Supercomputing is considered to be an important backbone of almost all scientific domains without which the world you see out would just be a dream. From where did the roots of such exemplary power rose? To read the history of supercomputing anyone should go back to the 1960s when a legendary man named Saymour Cray lived with his unquenchable thirst of designing extreme powered computers. He was affectionately known as the ‘Father of Supercomputing’. The CDC (Control Data Corporation) 6600, released in 1964, is generally considered the first supercomputer was one of his innovations.

The Beginning:

As said above the beginning of supercomputing era appeared around 1960 by which the world saw the release of one its biggest dreams ‘the CDC 6600’. Cray completed the CDC 1604, the first solid state computer, and the fastest computer in the world at a time when vacuum tubes were found in most large computers, in the year 1960.

The term solid state says that the computer is made from semiconductors. This term is used in order to represent the transition of computing system from using vacuum tubes to semiconductor materials.

Around 1960 Cray decided to design a computer that would be the fastest in the world to a greater extent than 1604. After four years of experimentation along with Jim Thornton, and Dean Roush and about 30 other engineers Cray completed the CDC 6600 in 1964. Given that the 6600 outran all computers of the time by about 10 times, it was dubbed a supercomputer and defined the supercomputing market when one hundred computers were sold at $8 million each. The 6600 gained speed by "farming out" work to peripheral computing elements, freeing the CPU (Central Processing Unit) to process actual data. The Minnesota FORTRAN compiler for the machine was developed by Liddiard and Mundstock at the University of Minnesota and with it the 6600 could sustain 500 KFLOPS on standard mathematical operations. In 1968 Cray completed the CDC 7600, again the fastest computer in the world. At 36 MHz, the 7600 had about three and a half times the clock speed of the 6600, but ran significantly faster due to other technical innovations. Cray left CDC in 1972 to form his own company. Two years after his departure CDC delivered the STAR-100 which at 100 megaflops was three times the speed of the 7600. Along with the Texas Instruments ASC, the STAR-100 was one of the first machines to use vector processing - the idea having been inspired around 1964 by the APL programming language.

The CRAY Era:

Around 1976, Cray delivered the 80 MHz Cray 1 , and it became one of the most successful supercomputers in history. The Cray 1 was a vector processor which introduced a number of innovations such as chaining in which scalar and vector registers generate interim results which can be used immediately, without additional memory references which reduce computational speed.

Chaining is a technique used in computer architecture in which scalar and vector registers generate intermediate results which can be used immediately, without additional memory references which reduce computational speed.

A vector processor also known as array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items.

In 1982, a 105 MHz shared-memory parallel vector processor ‘The Cray X-MP’ was released, with better chaining support and multiple memory pipelines (a concept of overlapping the execution of instructions). All three floating point pipelines on the XMP could operate simultaneously. The Cray-2 released in 1985 was a 4 processor liquid cooled computer and Fluorinert was pumped through it as it operated. It could perform to 1.9 gigaflops and was the world's fastest until 1990 when ETA-10G from CDC overtook it. The Cray 2 was a totally new design and did not use chaining and had high memory latency, but used much pipelining and was ideal for problems that required large amounts of memory. The software costs in developing a supercomputer should not be underestimated, as evidenced by the fact that in the 1980s the cost for software development at Cray came to equal what was spent on hardware. That trend was partly responsible for a move away from the in-house, Cray Operating System to UNICOS (UNIx based Cray Operating System) based on Unix. The Cray Y-MP, also designed by Steve Chen, was released in 1988 as an improvement of the XMP and could have eight vector processors at 167 MHz with a peak performance of 333 megaflops per processor. In the late 1980s, Cray's experiment on the use of gallium arsenide semiconductors in the Cray-3 did not succeed. Cray began to work on a massively parallel computer in the early 1990s, but died in a car accident in 1996 before it could be completed.

The Massive Processing Era:

The Cray-2 which set the frontiers of supercomputing in the mid to late 1980s had only 8 processors. In the 1990s, supercomputers with thousands of processors began to appear. Another development at the end of the 1980s was the arrival of Japanese supercomputers, some of which were modelled after the Cray-1. The SX-3/44R was announced by NEC Corporation in 1989 and a year later earned the fastest in the world title with a 4 processor model. However, Fujitsu's Numerical Wind Tunnel supercomputer used 166 vector processors to gain the top spot in 1994. It had a peak speed of 1.7 gigaflops per processor. The Hitachi SR2201 on the other obtained a peak performance of 600 gigaflops in 1996 by using 2048 processors connected via a fast three dimensional crossbar network. In the same timeframe the Intel Paragon could have 1000 to 4000 Intel i860 processors in various configurations, and was ranked the fastest in the world in 1993. The Paragon was a MIMD (Multiple Instruction Multiple Data) machine which connected processors via a high speed two dimensional mesh, allowing processes to execute on separate nodes; communicating via the Message Passing Interface (technique used to pass data between processors). By 1995 Cray was also shipping massively parallel systems, e.g. the Cray T3E with over 2,000 processors, using a three dimensional torus interconnect.

Interconnect with its various form like mesh, torus etc are the way of connecting the various number of processors as a network of processors having communications between them.

The Paragon architecture soon led to the Intel ASCI Red supercomputer which held the top supercomputing spot to the end of the 20th century as part of the Advanced Simulation and Computing Initiative. This was also a mesh-based MIMD massively-parallel system with over 9,000 compute nodes and well over 12 terabytes of disk storage, but used off-the-shelf Pentium Pro processors that could be found in everyday personal computers. ASCI Red was the first system ever to break through the 1 teraflop barrier on the MP-Linpack benchmark in 1996; eventually reaching 2 teraflops.

The PETAFLOP Computing Era:

The 21st century saw a significant progress and it was shown that the power of a large number of small processors can be harnessed to achieve high performance, e.g. as in System X's use of 1,100 Apple Power Mac G5 computers quickly assembled in the summer of 2003 to gain 12.25 Teraflops. The efficiency of supercomputers continued to increase, but not dramatically so. The Cray C90 used 500 kilowatts of power in 1991, while by 2003 the ASCI Q used 3,000 kW while being 2,000 times faster, increasing the performance by watt 300 fold. In 2004 the Earth Simulator supercomputer built by NEC at the Japan Agency for Marine-Earth Science and Technology (JAMSTEC) reached 131 teraflops, using 640 nodes, each with eight proprietary vector processing chips. The IBM Blue Gene supercomputer architecture found widespread use in the early part of the 21st century, and 27 of the computers on the TOP500 list used that architecture. The Blue Gene approach is somewhat different in that it trades processor speed for low power consumption so that a larger number of processors can be used at air cooled temperatures. It can use over 60,000 processors, with 2048 processors "per rack", and connects them via a three-dimensional torus interconnect. Progress in China has been rapid, in that China placed 51st on the TOP500 list in June 2003, then 14th in November 2003 and 10th in June 2004 and then 5th during 2005, before gaining the top spot in 2010 with the 2.5 petaflop Tianhel supercomputer. In July 2011 the 8.1 petaflop Japanese K computers became the fastest in the world using over 60,000 commercial scalar SPARC64 VIIIfx processors housed in over 600 cabinets. The fact that K computer is over 60 times faster than the Earth Simulator, and that the Earth Simulator ranks as the 68th system in the world 7 years after holding the top spot demonstrates both the rapid increase in top performance and the widespread growth of supercomputing technology worldwide.

This is a list of the computers which appeared at the top of the Top500 list since 1993.

Year	Supercomputer	Peak speed	Location
1993	Fujitsu Numerical Wind Tunnel	124.50 GFLOPS	National Aerospace Laboratory, Tokyo, Japan
1993	Intel Paragon XP/S 140	143.40 GFLOPS	DoE-Sandia National Laboratories, New Mexico, USA
1994	Fujitsu Numerical Wind Tunnel	170.40 GFLOPS	National Aerospace Laboratory, Tokyo, Japan
1996	Hitachi SR2201/1024	220.4 GFLOPS	University of Tokyo, Japan
1996	Hitachi CP-PACS/2048	368.2 GFLOPS	University of Tsukuba, Tsukuba, Japan
1997	Intel ASCI Red/9152	1.338 TFLOPS	DoE-Sandia National Laboratories, New Mexico, USA
1999	Intel ASCI Red/9632	2.3796 TFLOPS	DoE-Sandia National Laboratories, New Mexico, USA
2000	IBM ASCI White	7.226 TFLOPS	DoE-Lawrence Livermore National Laboratory, California, USA
2002	NEC Earth Simulator	35.86 TFLOPS	Earth Simulator Center, Yokohama, Japan
2004	IBM Blue Gene/L	70.72 TFLOPS	DoE/IBM Rochester, Minnesota, USA
2005		136.8 TFLOPS	DoE/U.S. National Nuclear Security Administration, Lawrence Livermore National Laboratory, California, USA
2005		280.6 TFLOPS
2007		478.2 TFLOPS
2008	IBM Roadrunner	1.026 PFLOPS	DoE-Los Alamos National Laboratory, New Mexico, USA
2008	IBM Roadrunner	1.105 PFLOPS	DoE-Los Alamos National Laboratory, New Mexico, USA
2009	Cray Jaguar	1.759 PFLOPS	DoE-Oak Ridge National Laboratory, Tennessee, USA
2010	Tianhe-IA	2.566 PFLOPS	National Supercomputing Center, Tianjin, China
2011	Fujitsu K computer	10.51 PFLOPS	RIKEN, Kobe, Japan
2012	IBM Sequoia	16.32 PFLOPS	Lawrence Livermore National Laboratory, California, USA

source: http://en.wikipedia.org/wiki/History_of_supercomputing