Outline

Introduction

Zero Knowledge cryptography is one of the most notable innovations in the last fifty years of computer science. Zero Knowledge Proofs (ZKPs) offer unique properties that make them essential components of various blockchain scaling and privacy solutions, including ZK rollups like StarkNet, private ZK rollups like Aztec, and Layer 1 chains like Mina, Filecoin & Aleo.

ZKPs are slow and expensive to produce due to a large number of expensive math operations. However, with the usage of specialized hardware like Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs), they can be accelerated by 10-1000x.

As users seek more expressive, performant, and private computation, the complexity of the statements proven with ZKPs will increase. This will result in slower proof generation, necessitating the use of specialized hardware so that proofs can be produced in a timely manner.

The operators of the hardware will need to be compensated for their work, similar to Bitcoin miners. Eventually, a full ZK mining and proving industry will manifest, starting with hobbyists generating proofs in their CPUs, then GPUs, then FPGAs. In contrast to Bitcoin, we anticipate that ASICs could take a long time to see adoption, if ever.

Why do Zero Knowledge Proofs matter?

Zero Knowledge Proofs have two main use cases.

Outsourced Verifiable Computation

Assume you have some computation that is expensive or impossible to run due to constraints of the platform you are using (e.g. your laptop, a Raspberry Pi, or even Ethereum).

Instead of running the computation on your platform, you must run it on a third-party service that can return to you the output of the computation quickly and cheaply (e.g. an AWS Lambda function, or an oracle service like Chainlink).

Normally, you would need to trust that the computation has been executed correctly, allowing the provider to output an invalid result, with potentially catastrophic consequences.

ZKPs allow a third party provider to also output a proof of computational integrity which guarantees the output you received is correct.

Private Computation

What if you have a computation that is not expensive to run locally, but you’d like to hide parts of it? For example, what if I want to show you that I know the 1000th Fibonacci number without telling you the number, or convince you I sent a payment without revealing either the amount or my identity?

ZKPs allow you to selectively hide some or all inputs around a computational statement.

Both of the above use cases have manifested in the crypto industry in multiple form factors (among others):

Layer 2 scaling: Verifiable computation with ZKPs allow L1s to outsource transaction processing to off-chain high-performance systems (aka Layer 2s). This enables blockchain scaling without compromising on security. As an example, StarkWare is building a scalable smart contract platform, StarkNet, using a special-purpose virtual machine that runs ZK-friendly code. Aztec also enables their Layer 2 programs to run privately, without leaking any information about a user’s transactions.
Private L1s: L1 chains like Aleo, Mina, and Zcash allow transactors to hide senders, receivers, or amounts using ZKPs, either by default (Aleo) or as an opt-in (Mina and Zcash).
Decentralized Storage: Filecoin uses ZKPs (running on GPUs) to prove that nodes in the network store data correctly.
Blockchain Compression: Mina and Celo use ZKPs to compress the blockchain data needed to synchronize to the latest state of the chain into a small proof.

Given the above, it is safe to say that as cryptocurrency adoption increases, ZKPs will be required in order to accommodate the increased demand for performance and privacy from users, and new types of applications and protocols.

ZKPs fundamentally allow scalable and private payments and smart contract platforms to thrive but introduce non-trivial overheads which have historically hindered their adoption.

Why are ZKPs slow and how do we make them fast?

Proving a computation requires first translating it from a classical program to a ZK-friendly format. This is done either by manually rewriting your code to use a low-level library like Arkworks, or by using a Domain Specific Language like Cairo or Circom that compiles down to the necessary primitives to generate the proof.

More expensive and complex operations result in longer proof generation times. It is also common that some operations are not ZK-friendly (e.g. the bitwise operations used in SHA or Keccak), resulting in long proof generation times for what might be a cheap operation on a classical computer.

Once your computation is in ZK-friendly form, you choose some inputs and send it to a proof system. There are many proof systems, some named after the authors of their papers (e.g. Groth16, GM17) and others with more creative names (PLONK, Spartan, STARK). What they all have in common is that they take a computation that’s expressed in a ZK-friendly format along with some inputs and output a proof.

Depending on the proof system, the proof generation process may differ, but the bottleneck always ends up being either:

Multiplications over large vectors of numbers (field or group elements), specifically variable-base and fixed-base multi-scalar multiplications (MSMs); or,
Fast Fourier Transforms (FFTs) and Inverse FFTs (although there are techniques for FFT-less proof systems).

In systems where both FFTs and MSMs exist, about 70% of the time generating a proof is spent on MSMs, and the rest is dominated by FFTs.

Both MSMs and FFTs are slow, but have ways of improving their performance:

MSMs are embarrassingly parallel and can be accelerated by running them over multiple threads. However, even on hundreds of cores, if each vector of elements is 225 long (that’s 33 million elements, a conservative complexity ballpark for an application like a zkEVM), the multiplications still end up taking a lot of time. This means frequently repeating the same operations, and saturating most of the memory available on the device. In short, MSMs require a lot of memory and remain slow even when heavily parallelized.
FFTs heavily rely on the frequent shuffling of data as the algorithm runs. This makes them hard to speed up by distributing the load across a computing cluster, as shown in DIZK. In addition, they require a lot of bandwidth when run on hardware. The shuffling means that you need to ‘randomly’ load and unload elements from, for example, a >100GB dataset on a hardware chip with 16GB of memory or less. While operations on the hardware are very fast, the time to load and unload data over the wire ends up slowing operations significantly.

In short:

MSMs have predictable memory accesses which allows for heavy parallelization, but their cost is still high because of the raw amount of compute and memory needed.
FFTs have random memory accesses which make them not friendly to hardware, on top of being naturally hard to run on distributed infrastructure.

The most promising work we have seen on addressing the slowness of large MSMs and FFTs is PipeZK. In their paper, the authors describe a method to make MSMs cheaper using Pippenger’s algorithm to skip duplicate computation. They also describe a method to “unroll” FFTs so they can be performed without significant shuffling, which allows for speed improvements on hardware due to the now-predictable memory access patterns.

Assuming the above methods address the fundamental bottlenecks of each algorithm, the question then becomes: What is the best hardware to flash with highly optimized MSM and FFT algorithms to accelerate ZKP generation?

Hardware matters

The above acceleration techniques can be implemented on multiple hardware technologies: GPUs, FPGAs, or ASICs. But which one is the best choice?

To answer that, we first have to acknowledge that ZKPs are still early in their development. There is still little standardization on system parameters (e.g. FFT width or bit-size of elements) or choice of proof system.

Due to these factors, there are two core properties of FPGAs which make them preferable to ASICs in the ZK context:

“Write multiple times” vs “write once”: The business logic on an ASIC is write-once. If any ZKP logic changes, you need to start from scratch. FPGAs can be re-flashed any amount of times in under 1 second, meaning they can re-use the same hardware across multiple chains with incompatible proof systems (e.g. because they want to extract MEV across chains), and nimbly adapt as the ZK “meta” changes.
Healthier supply chain: ASIC design, manufacturing, and deployment typically takes 12 to 18 months or more. On the contrary, FPGA supply chains are healthy, with leading providers like Xilinx allowing mass retail orders from the website (i.e. without any point of contact) which arrive within 16 weeks. This allows an FPGA-centric operation to have a tighter feedback loop on its product, and scale up their operation by purchasing and deploying more FPGAs.

We also expect FPGAs to outperform GPUs for similar reasons why they have thrived in machine learning and computer vision:

Hardware Cost: A top-of-class FPGA (leading process node, clock speed, energy efficiency, and memory bandwidth) is about 3x cheaper than a top-of-class GPU. The global demand for GPUs has further exacerbated this issue.
Power Efficiency: FPGAs can be >10x more energy efficient than GPUs, a large reason being the requirement for GPUs to be connected to a host device, which is often consuming a lot of power.

Given the above, we expect that the winning players in the market will be companies that focus on FPGAs over ASICs or GPUs. If however only one or few ZK L1s or L2s end up achieving dominant scale, and ZK proof systems stabilize around a single implementation, then the likelihood of ASICs winning over FPGAs might be higher. But we are likely multiple years away from that scenario, if it ever occurs at all.

Conclusion

In 2021, Bitcoin miners netted over $15 billion in revenue, and Ethereum miners just exceeded $17 billion. It’s plausible that Zero Knowledge Proofs end up becoming the de-facto medium of computational integrity and privacy on the web. In that case, the opportunity for ZK miners/provers could be of similar size to the Proof of Work mining market.

ZKPs are slow and will require hardware acceleration to be made feasible over complex computations. We believe that the technology that will matter the most for ZK hardware acceleration is FPGAs and not GPUs (due to cost and energy efficiency) or ASICs (due to their inflexibility and long iteration cycles).

If you are a hardware, Rust, or cryptography expert interested in further discussing or working together on this subject, please reach out to me at georgios@paradigm.xyz.

Twitter

Farcaster

Contact

Hardware Acceleration for Zero Knowledge Proofs

Introduction

Why do Zero Knowledge Proofs matter?

Why are ZKPs slow and how do we make them fast?

Hardware matters

Conclusion

Biography