## Tuesday, October 2, 2018

### A Zero Knowledge Merkle Tree

In the second post in this series, I presented the very neat concept of a Merkle Tree, but we ended with the problem that in its standard implementation - while being a powerful commitment scheme - a Merkle Tree is not zero knowledge.
As it turns out, this is very easy to fix by adding another level to the tree, such that in it, every pair of siblings is comprised of one node associated with actual data, and the other with a random string.
This is again the notion of mixing real data with randomness in order to obtain zero knowledge.

Here's how this will look with our "Yes Sir I Can Boogie" data:

This has the desired effect because whenever the prover has to reveal an authentication path for a piece of data - all the hashes revealed are affected by random data, and having been mixed through the Sha256 hash - these hashes appear random, and provide zero knowledge (other than the revelaed leaf node).

### Fixing The Code:

Tweaking the MerkleTree class from last time, we get:

class ZkMerkleTree:
"""
A Zero Knowledge Merkle tree implementation using SHA256
"""
def __init__(self, data):
self.data = data
next_pow_of_2 = int(2**ceil(log2(len(data))))
self.data.extend([0] * (next_pow_of_2 - len(data)))
# Intertwine with randomness to obtain zero knowledge.
rand_list = [random.randint(0, 1 << 32) for x in self.data]
self.data = [x for tup in zip(self.data, rand_list) for x in tup]
# Create bottom level of the tree (i.e. leaves).
self.tree = ["" for x in self.data] + \
[hash_string(str(x)) for x in self.data]
for i in range(len(self.data) - 1, 0, -1):
self.tree[i] = hash_string(self.tree[i * 2] + self.tree[i * 2 + 1])

def get_root(self):
return self.tree[1]

def get_val_and_path(self, id):
# Because of the zk padding, the data is now at id * 2
id = id * 2
val = self.data[id]
auth_path = []
id = id + len(self.data)
while id > 1:
auth_path += [self.tree[id ^ 1]]
id = id // 2
return val, auth_path

def verify_zk_merkle_path(root, data_size, value_id, value, path):
cur = hash_string(str(value))
# Due to zk padding, data_size needs to be multiplied by 2, as does the value_id
tree_node_id = value_id * 2 + int(2**ceil(log2(data_size * 2)))
for sibling in path:
assert tree_node_id > 1
if tree_node_id % 2 == 0:
cur = hash_string(cur + sibling)
else:
cur = hash_string(sibling + cur)
tree_node_id = tree_node_id // 2
assert tree_node_id == 1
return root == cur

### Protocol Summary

To summarize the theory, the protocol by which the prover proves knowledge of a satisfying assignment to the Partition Problem is:
1. The prover generates a witness (using get_witness from the first post in this series).
2. The prover creates a ZK Merkle Tree from the witness, and sends its root-hash to the verifier.
3. The verifier sends a random number $i$ to the prover.
4. If $i < n$ then the prover sends to the verifier:
1. The elements in places $i$ and $i + 1$ in the witness.
2. The authentication paths proving that these answers are consistent with the root sent in step (2).
5. If $i == n$ then the prover sends the first and the last elements in the witness, with the authentication paths etc.
6. The verifier checks the authentication paths against the root, and the returned numbers against the problem instance, to verify properties (1) and (2) of the witness as they are described in the first post.
7. The verifier returns true iff everything checks out.

### What If The Prover Is Lying?

Clearly if everything is kosher, the verifier will see that it is (this is called completeness).
But what if the prover is dishonest? What is the probability $p$ that the verifier will catch on? (this is called soundness).

Suppose the witness is kosher in all but one place, which is clearly the hardest case to catch. This means that in a single query, the verifier has a probability of $\frac 1 {n + 1}$ to expose the prover's deception.
If we repeat the protocol $k$ times, then the verifier's probability of catching a lying prover grows to $1 - (1 - \frac 1 {n + 1})^k$.
And if we set $k = 100(n + 1)$ then this is approximately $1 - \frac 1 {e^{100}}$ which is indeed very very very sure.
To give a sense of how sure that is, the prover's odds of convincing the verifier of a false claim are like odds of flipping a coin and having it land on its edge 12 times in a row.

### Fiat-Shamir Heuristic

One must admit that it is somewhat cumbersome to have the prover and the verifier engage in such a long exchange of queries and responses. It means that whenever there's something to prove, both sides need to be available, online, and ready for this ping pong.

Luckily, a neat trick by Amos Fiat and Adi Shamir, known as Fiat-Shamir Heuristic, allows us to take this protocol, and convert it into a single long proof, that the prover generates once, and that everyone in the world can check afterwards.

The heuristic is based on the observation that in many protocols, and specifically in the one described here, the only messages that the verifier ever sends throughout the exchange are random numbers.
So here's the basic idea:
• The prover will simulate the verifier's side in the exchange, but will seed the random number generator it uses in a way that is on one hand random "enough", and on the other hand - replicable.
• The prover will write down the verifier's queries, and the prover's replies (with the authentication paths and all), one after the other, and documentation of the simulated exchange will be the proof!
• After the desired number of queries have been simulated - the prover will send this single long proof to the verifier.
• On the verifier side - the verifier will simulate the exchange, using the same replicable-randomness mechanism, that will convince the verifier that the queries that the prover asked itself were indeed random.

This smells like cheating, I admit. The prover asks itself and answers itself and sends this exchange to the verifier.
But to our aid comes the fact that hash functions behave, to all cryptographic intents and purposes, as if they were random number generators.

So when the prover needs to simulate the first query - it will feed the problem instance into a hash function, and use that to obtain a random number (e.g. take the hash mod n).
When the time comes to generate the second query, and all the subsequent queries - the prover will feed the proof that has been written up to that point into the hash function, and use that to obtain a random number.
Provided that the prover and the verifier agree which hash function they use - this is both random and replicable (since the verifier has the problem instance and the proof, used to seed the randomness) on both sides of the exchange.

### Putting it All Together

And here's the code to obtain a proof and to check it:

def get_proof(problem, assignment, num_queries):
proof = []
randomness_seed = problem[:]
for i in range(num_queries):
witness = get_witness(problem, assignment)
tree = ZkMerkleTree(witness)
random.seed(str(randomness_seed))
query_idx = random.randint(0, len(problem))
query_and_response = [tree.get_root()]
query_and_response += [query_idx]
query_and_response += tree.get_val_and_path(query_idx)
query_and_response += tree.get_val_and_path((query_idx + 1) % len(witness))
proof += [query_and_response]
randomness_seed += [query_and_response]
return proof

def verify_proof(problem, proof):
proof_checks_out = True
randomness_seed = problem[:]
for query in proof:
random.seed(str(randomness_seed))
query_idx = random.randint(0, len(problem))
merkle_root = query[0]
proof_checks_out &= query_idx == query[1]
# Test witness properties.
if query_idx < len(problem):
proof_checks_out &= abs(query[2] - query[4]) == abs(problem[query_idx])
else:
proof_checks_out &= query[2] == query[4]
# Authenticate paths
proof_checks_out &= \
verify_zk_merkle_path(merkle_root, len(problem) + 1, query_idx, query[2], query[3])
proof_checks_out &= \
verify_zk_merkle_path(merkle_root, len(problem) + 1, \
(query_idx + 1) % (len(problem) + 1), query[4], query[5])
randomness_seed += [query]
return proof_checks_out


### A Real Live Proof!

Running the following short script returns true (as the proof indeed checks out) and prints the proof

def test(q):
problem = [1, 2, 3, 6, 6, 6, 12]
assignment = [1, 1, 1, -1, -1, -1, 1]
proof = get_proof(problem, assignment, q)
print(proof)
return verify_proof(problem, proof)

And this is the proof we get (running "test(4)" for only 4 queries):

[['f9f3b1e40626e906b03eb9fd5428b2f2f801e8f3c23627fe7e52a645c3f32632', 3, 1, ['1b7f5356d043c6336c6614fcc24cb77f8807cd2f443b1b77e0002be6b96c40b6', 'a412af57af0b88cdb5cb3d0cbfcd739ebcc3c6fe0ac364db9490b4a208803101', '9f358dd0980f35ea86070d0fb12b2f5726857031ef56968005095cdb13e0a6f0', '05066ac05f174f1f98226c40889c566775592ec3807fbe080324447616773e18'], 7, ['cd6ee891c632e07ad468cd602c8d2d935356ca5901b21a75a2719d164a925382', '4cfc41b83cf64e0cf14a0ea8179aa67c6324699557c508dfc424604674805864', '4efb02f72dbc085ead53657647e893f3ceb29c9f81d411dd817f3be512cad632', '6cd4c16c3d5db473280b64f6b3fce54cb4b6810b46331899f4c07f884fd89aae']], ['580bd4db1071906bcd101600baf51d33b9930ba6e26853e85634bf38c0acef92', 6, 16, ['f8b28423de50f3b0cbcf88caacb6d4f6789ba3cecdc7791b38d5bbcd700ecbd2', '5c41ad0b9d813740b516cb61cf9ce06966efcf82e8ee5881ca86d5b18400d03d', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687'], 4, ['b703a38bb22b758c5c23c08f096b6c3155c56885d57e1280ff521126282fa857', '4e602f00ef1e1de0b733f467de61805f09a1ebee8db72cc64c62dd8d55836de1', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687']]]


### One More Thing...

This series, in accordance with Hofstadter's law, turned out to be somewhat longer than I anticipated. However, I left out a few things that are important.

Among these are:
1. Off-the-bat optimizations.
2. Some discussion about proof length, running time, and what modern proof lengths (and running times) look like.
3. A few simple tweaks, suggested by my colleague at Starkware, Lior Goldberg, to make the protocol truly Zero-Knowledge (because it isn't exactly there yet) and slightly more elegant.

So, although I promised a 3-part series, there will be a fourth part. But seeing how all the code is already here, we'll call it an appendix.

1. Great series, thanks! If the verifier executes the protocol n+1 times (one for each element of p), and a third party records the responses from the prover, is it true that this third party could, when challenged by the verifier, appear as if it knows the secret (m) by just replaying the responses? Is there any way to fix this? In general, I'm looking for a ZK protocol that is resistant to this kind of replay attacks.

1. Depends on how you define the protocol. In its interactive form (without the Fiat-Shamir heuristic) yes, recording the prover's answers will not be useful, since the verifier provides different bits of randomness on every query.
Consider this point - maybe I listened in on a few rounds of communications, but then when I want to mimic an honest prover, I half to start a round of communication by sending the root of some Merkle tree. Which one do I send? The one from the first query I listened to? But then, what if the verifier asks for a leaf in the tree that is not the exact one that it asked for when interacting with the honest prover? I can't forge authentication path to make it all work...

On the other hand - for the case of non-interactive proofs, which is what we built here - of course you can pretend to be an honest prover. But this is not an issue, because what we're proving is not a question of identity, but rather a question of computational integrity. That is to say - we're not proving that the prover is honest, but that the claim is true. If such a proof exists then the claim is true, no matter who claims it, so there's no biggie that someone can prove it.