How much did it cost to build the fastest supercomputer in the world?

8 min readOct 15, 2020

A new supercomputer from Japan (named Fugaku) a couple of months back became the fastest supercomputer in the world with over 1 Exaflops worth of half precision computing and over 500 Petaflops of high precision computing as per the Linpack benchmark.

With this level of computing power, it’s no less than a breakthrough. These supercomputers are capable of performing more than a thousand million million calculations every second. This gives scientists and engineers a powerful tool to study the world in a much more time efficient way. 1000s of simulated experiments can be run without performing any practical experiments which consumes a lot of time.

Using supercomputers, we can now conduct virtual experiments that are impossible in the real world — from looking deep inside individual atoms, to studying the future climate of the earth and following the evolution of the entire universe from the big bang.

But what makes a supercomputer computationally powerful?

In computing, floating point operations per second (FLOPS) is a measure of computer performance, useful in fields of scientific computations that specifies the amount floating point operations processed cumulatively in a second.

Higher number of FLOPs indicates a computer with higher processing power.

An Exaflop scale supercomputer is capable of one quintillion (10^18) floating-point operations per second, or 1,000 petaFLOPS. A supercomputer is used for parallel computations i.e. the tasks that can be subdivided and processed in parallel over a lot of individual compute nodes.

In case of a supercomputer, since there are 1000s of compute nodes connected together to act as a single giant computer, the total number of FLOPs are calculated as an aggregate of every individual compute node’s FLOP performance.

GPUs (Graphics processing units) are considered as consumer-scale supercomputers because they are able to deliver TeraFlops performance and can process data in parallel rather than sequential. This method of performing a single task in parallel with multiple input data streams is called as SIMD or SIMT architecture.

SIMD stands for Single instruction multi data. When a single program is divided into sub-processes (or sometimes not even divided) to work in parallel on multiple data streams or multiple threads then it is called SIMD or SIMT architecture. Frame rendering on screen is a great example of SIMD implementation where a GPU has a single instruction of rendering a pixel and the RGB value on each pixel is the input data. So each pixel can eventually be rendered in parallel using 1000s of small cuda cores inside a GPU, thus making the entire rendering process much faster.

As more problems today are now using parallel processing, GPUs are becoming GPGPUs i.e. General purpose GPUs because initially they had only a single powerful use case i.e. frame rendering for gaming. But now they are finding a great use case in machine and deep learning model training, big data analysis, scientific simulations and synthetic data creation.

But a general purpose Supercomputer is not just limited to the computing power of its individual processors. There are a lot of other factors that make a supercomputer an actual SUPER computer.

At a base level, the overall speed of a supercomputer is heavily dependent on the following:

I/O speed for data transfer in memory for processing.
The actual computing capability of the processors.
The intercommunication speed for data transfer amongst individual compute nodes.
I/O speed for writing back the output data to a drive

The faster these tasks are performed, the faster the supercomputer will be. And based on the performance of these tasks, the overall flops value is calculated for the computer.

The overall cost of a supercomputer is not derived by the compute nodes only, but also faster interconnects such as Infiniband interconnects that can offer upwards of 25GB/s for data transfer, addition of fault tolerance mechanisms, faster I/O disks and memory (in TBs).

Today, a supercomputer is built by housing:

1000s of high end compute nodes powered by computing accelerators like GPUs and TPUs.
Connected to each other via high speed interconnects for communication of data between the compute nodes.
And, a large amount of shared bus high bandwidth memory for easy accessibility of data amongst the nodes.

Fugaku was developed by RIKEN in close collaboration with Fujitsu and the application community for over a decade of research and development.

Fugaku comprises Fujitsu’s 48-core Arm A64FX SoC chipsets to perform double precision calculations.

Some performance and system spec numbers for Fugaku Supercomputer:

Number of compute nodes: 152,064 A64FX nodes

Total number of CPU cores: 7.3 million Arm cores

CPU Core Clock: 2.2 GHz

Double Precision LinPack Performance: 415.53 PetaFlops

Theoretical peak performance: 513.98 PetaFlops

Total RAM: 4.85 Petabytes, with 163 Petabytes / second memory bandwidth

OS: Both Linux & McKernel light-weight kernel operating simultaneously

Fugaku demonstrated more than 2.8 times the performance of the previous list leader Summit (ORNL), benchmarked at 148.6 Petaflops (and now in second place).

More explained by Jack Dongarra (one of the pioneers of Supercomputing domain):

But what about the cost of building such a fast supercomputer?

The cost to build Fugaku was around $1 Billion (source), on par with what is projected for the U.S. Exascale machines.

Yes, you read that correctly. A billion dollars worth of money it took to build the fastest supercomputer in the world.

There was a significant R&D and DC upgrade involved in building this supercomputer. Had they used off-the-shelf CPUs then it would have costed 3 times more than this.

To recover the cost of building such a supercomputer, the users are charged exorbitant amounts of money.

This has been the trend from the beginning of computing era.

Fastest computers have always costed a bomb.

Thus, they have always been limited to only a particular group of researchers, companies and universities who can spend this kind of money on computing.

Because of this, a lot of research takes a halt at individual and small team level. They get restricted to limited computing infrastructure which slows down the innovation at all levels.

There needs to be a change in this domain to support research at this scale too.

Not just with supercomputing but even renting a powerful GPU instance on cloud for your next Machine or deep learning algorithm takes a backseat when you see a huge bill coming in your inbox that puts a dent on your wallet.

For example, a Tesla V100 GPU instance on AWS costs $3.06/hr and if you had to train your NLP model for 2 weeks continuously then it’d cost you $1000 for the entire duration. And it doesn’t stop there. To further improve your model’s accuracy and make it work for skewed datasets, you need to retrain the models several times with different hyperparameters and datasets.

Running a GPU instance adds up the costs incrementally and soon enough becomes quite expensive.

The cloud platforms have been designed to support generic applications from web hosting to storage and backups. But high performance computing applications requires specialized and efficient computing implementation at both the hardware and software level. Adding GPUs in existing server racks and trying to adjust with the existing infrastructure results in a non-optimized solution for the end customers, leading to higher compute costs.

At Q Blocks, we realized that there is a better way to develop a more efficient computing platform for the new generation of applications. Taking inspiration from a method used by some of the biggest scientific organizations in the world like CERN and SETI.

Decentralized computing using idle workstations for compute intensive jobs.

Folding@Home, a decentralized computing project for protein folding recently achieved exascale computing level to simulate Covid-19 virus and find solutions to cure this disease by using idle computers donated by researchers and science enthusiasts.

This is the sheer power of decentralized / distributed computing. In simplest form, it means workloads are distributed and computed across multiple computers in parallel all over the world.

These computations can mostly be processed individually and thus don’t require faster intercommunication between the nodes.

To do this at scale for business use cases there are several challenges around reliability, security and redundancy.

At Q Blocks, we approached these challenges one by one and designed a decentralized computing architecture from grounds up to serve compute intensive tasks that can be processed through a networked grid of remote nodes such as consumer PCs and data centres with idle capacities. There is fault tolerance implementation by design along with network and node level security to offer a reliable computing access for the end user.

One of the challenges that comes with decentralized nodes is heterogeneity around hardware and software stack. These nodes can carry different amount of CPU cores, RAM, storage and GPU settings. Q Blocks computing engine resolves this by abstracting infrastructure from the nodes on-demand and process all the computing jobs in secure self-contained environments with pre-configured libraries and drivers to serve a variety of applications. This results in a highly optimized computing experience for the users and unlike a traditional cloud, Q Blocks platform drastically reduces effort while deploying new compute instances.

With this approach, a lot of the fast-evolving commercial applications can be deployed easily such as machine and deep learning model training and serving, big data analysis and data processing, 3D rendering, blockchain dApps, video transcoding and much more.

Q Blocks platform enables the following benefits for users:

5–10X lower computing costs
No vendor lock-ins
Web 3.0 & blockchain application support
On-demand and managed computing access
No single point of failure

Today our platform offers optimized support for training compute intensive ML and DL models in computer vision and natural language processing space.

If you are working on AI research then you’d understand the pain associated with computing costs to train an effectively larger model.

Q Blocks — Distributed Supercomputer for HPC

Imagine the uber of computing.A way to use millions of underutilised sources of computing to build your next AI model…

www.qblocks.cloud

We are working on bringing managed computing support for more use cases soon.

So, checkout Q Blocks decentralized computing platform to get multi-GPU instances at upto 1/10th the cost for your AI workloads.

We hope there was a great learning that came from this article for you about how supercomputing works and how supercomputing when democratized can benefit the entire ecosystem.

—

By:

Co-founder & CTO, Q Blocks

How much did it cost to build the fastest supercomputer in the world?

But what about the cost of building such a fast supercomputer?

Q Blocks — Distributed Supercomputer for HPC

Imagine the uber of computing.A way to use millions of underutilised sources of computing to build your next AI model…

Written by Qblocks

No responses yet