## Fastest device to perform single-threaded SHA256 hashing

2

I am in the process of designing a cryptographic Proof of Storage (PoS) that relies on single threaded computation of SHA256 hashes. In practice, my algorithm is equivalent to computing for some string S the value SHA256^N(S) = SHA256(SHA256(SHA256(...(SHA256(S)))).

This forces anything computing that function to use only one thread, as each SHA256 step needs the output of the previous as input.

Now, I know that ASIC miners exist that can compute SHA256 hashes at an enormous rate. This, however, is due to their parallelism: completing Proofs of Work (i.e., finding zeros of hash functions) is an embarassingly parallel problem and can be computed very fast on massively parallel devices. This is not what I am looking for.

What I am looking for is the fastest device that would be able to compute my function, i.e., single-threaded computation of a SHA256, then the SHA256 of its output, the SHA256 of the result and so on.

I thought that maybe in this case the best hardware would be the fastest CPU in terms of single threaded performance: I found here that Intel Core i7-7700K could be a good place to start looking for.

Is there any other known specialized hardware device that could carry out the task faster?

1

Well, existing Bitcoin mining hardware would indeed not be any good for iterated hashing, as it's not designed to do that. However, SHA-256 is not a hardware resistant hash function. The most complex operation in the hash function is addition modulo 2³², which is quite easy to implement in hardware.

An attacker could implement a fast SHA core on a chip, which would consist of calculating a single round as fast as possible. This is the opposite optimization as a Bitcoin miner: miners prefer very many parallel cores that run at slower clock speeds to save energy per hash. Such a hypothetical chip would definitely be much faster than a CPU at calculating iterated hashes.

What you want is a memory-hard hash function such as Argon2. Its execution requires fast cache, which is plentiful on commodity CPUs but costly to build into other hardware. It has a parallelism parameter where you can allow up to, say, 4 cores. Iteration of the function will not be any faster past that number of cores, and custom hardware will not increase the attacker's speed.

Thank you! So if I understood correctly, the point to leverage on to make a sequential hash function resistant to hardware attacks is to make memory the bottleneck? Why can't memory access be made faster by using dedicated hardware? – Matteo Monti – 2017-01-14T08:58:17.797

Also, you said "much faster" than a CPU, and that's already a relevant answer to my question, but I was wondering if you could give me an idea on the order of magnitude of the speed up. Are we talking about ten or ten billion times faster? I guess that those operations would still be bound by clock, which is ultimately bound by physics...? – Matteo Monti – 2017-01-14T09:00:54.410

1

maservant is correct that you could build specialized hardware to compute this faster. I'm assuming that you're interested only in off-the-shelf hardware.

Intel has proposed adding SHA256 extensions to their CPU's to make them able to compute SHA256 faster. None of those are currently on the market, but unconfirmed speculation says that Cannonlake CPU's will have the SHA instructions. (You can check if a CPU supports these by running cpuid | grep SHA.)

Also, you're using the CPU in a somewhat unusual way. For example, most workloads don't fit in L1 cache, so they benefit from better and more cache. That won't matter at all to you.

Some aspects of SHA256 can be parallelized. For example, the compression function and the message expansion can be run at the same time.

I think, but am not sure, that it's also possible to get some degree of parallelism between SHA256 compression rounds. Notice that the only variables that change as a result of the computation are A and E; all the rest are just shifted over one space.

You may want to consider overclocking the CPU to a degree that would make most systems unstable. While computing repeated SHA256 requires fast serial performance, verifying it can be done in parallel.

PS. I'm very curious about what you're building, and why this is relevant to it.

Thank you! Actually, I am interested in any kind of hardware: to prevent attacks, one needs to be sure that his algorithm will stand attacks also from specifically designed hardware. – Matteo Monti – 2017-01-14T09:03:24.137

In my question, i oversimplified by far my algorithm, because I thought that the bottleneck would be its inherent sequentiality. As far as I understood, it is rather memory access. – Matteo Monti – 2017-01-14T09:04:53.873

What I am trying to achieve is a function that has a small input and a very large output (e.g. 32 bytes of input, 512 MB of output) that satisfies the condition of being inherently sequential and that computing one byte of the output is more or less as complex as computing the whole output. This can be used as a proof of persistent storage: if you compute a function (it takes 5 minutes), save its large output, then I ask you 1 byte of the output and give you 1 second to give it to me, either you are honestly storing it or you have to compute it from scratch, failing the timeout. – Matteo Monti – 2017-01-14T09:08:20.050

@MatteoMonti Unless I'm misunderstanding your system, I can store every 8th hash, and forget the rest. When someone asks me for a byte, I only have to do 7 hashes. I only need to store 64MB. – Nick ODell – 2017-01-14T22:29:35.810

I'll contact you privately later and give you the full description of the algorithm! :) I didn't explain myself well enough. – Matteo Monti – 2017-01-14T22:47:58.437