Garden Path Trajectory: 2018

Thursday, October 4, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Appendix

This is the final part of this hands-on tutorial. I will assume from now on that you have read Part I, Part II, and Part III of this series.

As promised, this post will deal with:

Some tweaks to the protocol presented in the previous posts.
A complexity analysis of the protocol,
A small optimization.
A few words about modern ZK proving protocols

Lior's Tweaks

A colleague at Starkware, Lior Goldberg, pointed out a caveat in the zero-knowledge aspect, and suggested a nice simplification of the protocol, here they are:

A ZK Fix

Suppose the first two numbers in the problem instance are 5 and 6, and that the random shift $r$ is chosen from the range $0..10$.
Now if the random query $i$ happens to be $1$, then the prover is required to reveal the second and third elements in the witness. If they happen to be 15 and 21, then the verifier knows immediately that 5 and 6 (from the problem instance) belong to the same side in the solution. This violates the zero knowledge property that we wanted.
This happened because we chose uniformly at random from a very small range, and $r$ happened to be the maximal number in that range.

There are two ways to solve this. One is by chosing some arbitrary number and doing all computations modulo that number. A simpler way would be chosing $r$ from a huge domain, such as $0..2^{100}$, which makes the probability of getting a revealing $r$ negligible.

Simplify By Having A Cyclic List

Our witness originally had $n + 1$ elements such that the first was a random number, and the rest were partial sums of the problem and the assignment dot product (plus the initial random number).

This meant we had two types of queries, one to check that two consecutive elements in the list differ in absolute value by the corresponding element in the problem list. Another type of query just checked that the first and last element are equal.

As Lior pointed out, it is much more elegant to omit the last element from the witness entirely, and if $i = n$ - check that the first and last elements in the witness differ, in absolute value, by the last element in the problem instance. Essentially, this is like thinking of the witness as cyclic. The nice thing about this is that now we only have one type of queries - a query about the difference between two consecutive elements modulo n in the witness.

Proof Size / Communication Complexity

We'd like to analyze the size of the proof that our code generates. This often referred to as communication complexity, because the Fiat-Shamir Heuristic (that was described in Part III) transforms messages (from an interactive protocol) to a proof, making these two terms interchangeable in this context.

So, for each query, the proof stores:

The value of i.
The value of the $i$-th element in the witness and the $(i \mod n)$-th element.
Authentication paths for both elements.

The authentication paths here are the heavy part. Each of them is a $\log(n)$-element long list of 256bit values.

As was discussed in the last post, to get a decent soundness, the number of queries has to be roughly $100n$.

Putting these two together, the proof size will be dominated by the $~200 \cdot n \cdot \log(n)$ hashes that form the authentication paths.

So a proof that one knows an assignment to a Partition Problem instance with 1000 numbers, will require roughly $2,000,000$ hashes, which translate to 64 Megabytes of data.

Small Merkle Optimization

Since merkle authentication paths, somewhat surprisingly, make up the vast majority of the proof, maybe we can reduce their number by a little.

Note that all queries (except for one) ask about consecutive leaves in the tree.

Consecutive leaves share, on average, have their LCA (least common ancestor) at height $\frac {\log n} {2}$. Up to the LCA, their authentication paths may differ, but from the LCA up to the root, they're authentication paths are identical, so we're wasting space writing both in the proof.

Omitting the path from the LCA to the root from one of them will bring the proof size down to $150 \cdot n \cdot \log (n)$, which is a nice 25% improvement.

Implementing this optimization, as well as Lior's tweaks, is left - as they say in textbooks - as an exercise for the reader.

Modern Protocols

Modern ZK proving protocols, such as ZK-SNARKS, ZK-STARKS, Bulletproof, Ligero, Aurora, and others, are often compared along these four axes:

What type of statements can be proved using the protocol.
How much space the proof takes up.
How long it takes to create a proof.
How long it takes to verify a proof.

Often the topic of trusted setup is discussed, but we won't get into that here.

Let's see how our toy-protocol fares:

Which statements can be proved?

In the toy-protocol, only knowledge of a solution to a Partition Problem instance could be proved. In contrast with most protocols, where one can use the protocol to prove knowledge of an input that satisfies some arbitrary arithmetic circuit, or even that a specific program ran for $T$ steps, and provided a specified output (this is what ZK-STARKS do).
Well, you may say, if you can prove one NP-complete problem (and the Partition Problem is one) - you can prove them all, due to polynomial time reductions. And theoretically speaking you would be right. However, in the practical world of ZK-proofs, all these manipulations have costs of their own, and conversions often incur a blow-up of the problem, since "polynomial reduction" is a theoretical term that can translate to non-practical cost. For this reasons, modern protocols make an effort to take as input more expressive forms (such as arithmetic circuits and statements about computer programs).

Space

As the analysis showed, our proof takes up $O(n \log (n))$ space, whereas in most modern protocols, the proof size is somewhere between constant and polylogarithmic in $n$ (e.g. $O(\log ^2 (n))$).
This huge gap is what this makes the proposed protocol nothing more than a toy example, that - while demonstrating certain approaches and tricks - is useless for any real application.
You can trace this gap to the fact we need a linear number of queries, each costing a logarithmic number of hashes (the Merkle authentication paths).
The approach I took was inspired by tricks from the ZK-STARK protocol, that is slightly more expensive than others in terms of proof size, but is expressive, requires relatively short prover time, and very short verifier time. In STARK, indeed the lion share of the proof is comprised of Merkle authentication paths, but great care is taken so that the number of queries will be minuscule.

Prover Running Time

In our protocol it is roughly $O(n \log (n))$, which is not far from modern protocols.

Verifier Running Time

In our protocol it is linear in the proof size, so $O(n \log n)$ which is not so good. Recall that, at least in the context of blockchains, a proof is written once but verified many times (by miners for example). Modern protocols thus strive to make the verifier workload as small as they possibly can without impeding soundness.

This concludes what I hoped to cover in this tutorial. It was fun to write and code. Let's do it again sometime. :)

Tuesday, October 2, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Part III

A Zero Knowledge Merkle Tree

In the second post in this series, I presented the very neat concept of a Merkle Tree, but we ended with the problem that in its standard implementation - while being a powerful commitment scheme - a Merkle Tree is not zero knowledge.

As it turns out, this is very easy to fix by adding another level to the tree, such that in it, every pair of siblings is comprised of one node associated with actual data, and the other with a random string.

This is again the notion of mixing real data with randomness in order to obtain zero knowledge.

Here's how this will look with our "Yes Sir I Can Boogie" data:

This has the desired effect because whenever the prover has to reveal an authentication path for a piece of data - all the hashes revealed are affected by random data, and having been mixed through the Sha256 hash - these hashes appear random, and provide zero knowledge (other than the revelaed leaf node).

Fixing The Code:

Tweaking the MerkleTree class from last time, we get:

class ZkMerkleTree:
    """
    A Zero Knowledge Merkle tree implementation using SHA256
    """
    def __init__(self, data):
        self.data = data
        next_pow_of_2 = int(2**ceil(log2(len(data))))
        self.data.extend([0] * (next_pow_of_2 - len(data)))
        # Intertwine with randomness to obtain zero knowledge.
        rand_list = [random.randint(0, 1 << 32) for x in self.data]
        self.data = [x for tup in zip(self.data, rand_list) for x in tup]
        # Create bottom level of the tree (i.e. leaves).
        self.tree = ["" for x in self.data] + \
                    [hash_string(str(x)) for x in self.data]
        for i in range(len(self.data) - 1, 0, -1):
            self.tree[i] = hash_string(self.tree[i * 2] + self.tree[i * 2 + 1])

    def get_root(self):
        return self.tree[1]

    def get_val_and_path(self, id):
        # Because of the zk padding, the data is now at id * 2
        id = id * 2
        val = self.data[id]
        auth_path = []
        id = id + len(self.data)
        while id > 1:
            auth_path += [self.tree[id ^ 1]]
            id = id // 2
        return val, auth_path

def verify_zk_merkle_path(root, data_size, value_id, value, path):
    cur = hash_string(str(value))
    # Due to zk padding, data_size needs to be multiplied by 2, as does the value_id
    tree_node_id = value_id * 2 + int(2**ceil(log2(data_size * 2)))
    for sibling in path:
        assert tree_node_id > 1
        if tree_node_id % 2 == 0:
            cur = hash_string(cur + sibling)
        else:
            cur = hash_string(sibling + cur)
        tree_node_id = tree_node_id // 2
    assert tree_node_id == 1
    return root == cur

Protocol Summary

To summarize the theory, the protocol by which the prover proves knowledge of a satisfying assignment to the Partition Problem is:

The prover generates a witness (using get_witness from the first post in this series).
The prover creates a ZK Merkle Tree from the witness, and sends its root-hash to the verifier.
The verifier sends a random number $i$ to the prover.
If $i < n$ then the prover sends to the verifier:

The elements in places $i$ and $i + 1$ in the witness.
The authentication paths proving that these answers are consistent with the root sent in step (2).

If $i == n$ then the prover sends the first and the last elements in the witness, with the authentication paths etc.
The verifier checks the authentication paths against the root, and the returned numbers against the problem instance, to verify properties (1) and (2) of the witness as they are described in the first post.
The verifier returns true iff everything checks out.

What If The Prover Is Lying?

Clearly if everything is kosher, the verifier will see that it is (this is called completeness).
But what if the prover is dishonest? What is the probability $p$ that the verifier will catch on? (this is called soundness).

Suppose the witness is kosher in all but one place, which is clearly the hardest case to catch. This means that in a single query, the verifier has a probability of $\frac 1 {n + 1}$ to expose the prover's deception.

If we repeat the protocol $k$ times, then the verifier's probability of catching a lying prover grows to $1 - (1 - \frac 1 {n + 1})^k$.

And if we set $k = 100(n + 1)$ then this is approximately $1 - \frac 1 {e^{100}}$ which is indeed very very very sure.

To give a sense of how sure that is, the prover's odds of convincing the verifier of a false claim are like odds of flipping a coin and having it land on its edge 12 times in a row.

Fiat-Shamir Heuristic

One must admit that it is somewhat cumbersome to have the prover and the verifier engage in such a long exchange of queries and responses. It means that whenever there's something to prove, both sides need to be available, online, and ready for this ping pong.

Luckily, a neat trick by Amos Fiat and Adi Shamir, known as Fiat-Shamir Heuristic, allows us to take this protocol, and convert it into a single long proof, that the prover generates once, and that everyone in the world can check afterwards.

The heuristic is based on the observation that in many protocols, and specifically in the one described here, the only messages that the verifier ever sends throughout the exchange are random numbers.

So here's the basic idea:

The prover will simulate the verifier's side in the exchange, but will seed the random number generator it uses in a way that is on one hand random "enough", and on the other hand - replicable.
The prover will write down the verifier's queries, and the prover's replies (with the authentication paths and all), one after the other, and documentation of the simulated exchange will be the proof!
After the desired number of queries have been simulated - the prover will send this single long proof to the verifier.
On the verifier side - the verifier will simulate the exchange, using the same replicable-randomness mechanism, that will convince the verifier that the queries that the prover asked itself were indeed random.

This smells like cheating, I admit. The prover asks itself and answers itself and sends this exchange to the verifier.

But to our aid comes the fact that hash functions behave, to all cryptographic intents and purposes, as if they were random number generators.

So when the prover needs to simulate the first query - it will feed the problem instance into a hash function, and use that to obtain a random number (e.g. take the hash mod n).

When the time comes to generate the second query, and all the subsequent queries - the prover will feed the proof that has been written up to that point into the hash function, and use that to obtain a random number.

Provided that the prover and the verifier agree which hash function they use - this is both random and replicable (since the verifier has the problem instance and the proof, used to seed the randomness) on both sides of the exchange.

Putting it All Together

And here's the code to obtain a proof and to check it:

def get_proof(problem, assignment, num_queries):
    proof = []
    randomness_seed = problem[:]
    for i in range(num_queries):
        witness = get_witness(problem, assignment)
        tree = ZkMerkleTree(witness)
        random.seed(str(randomness_seed))
        query_idx = random.randint(0, len(problem))
        query_and_response = [tree.get_root()]
        query_and_response += [query_idx]
        query_and_response += tree.get_val_and_path(query_idx)
        query_and_response += tree.get_val_and_path((query_idx + 1) % len(witness))
        proof += [query_and_response]
        randomness_seed += [query_and_response]
    return proof

def verify_proof(problem, proof):
    proof_checks_out = True
    randomness_seed = problem[:]
    for query in proof:
        random.seed(str(randomness_seed))
        query_idx = random.randint(0, len(problem))
        merkle_root = query[0]
        proof_checks_out &= query_idx == query[1]
        # Test witness properties.
        if query_idx < len(problem):
            proof_checks_out &= abs(query[2] - query[4]) == abs(problem[query_idx])
        else:
            proof_checks_out &= query[2] == query[4]
        # Authenticate paths
        proof_checks_out &= \
            verify_zk_merkle_path(merkle_root, len(problem) + 1, query_idx, query[2], query[3])
        proof_checks_out &= \
            verify_zk_merkle_path(merkle_root, len(problem) + 1, \
                                 (query_idx + 1) % (len(problem) + 1), query[4], query[5])
        randomness_seed += [query]
    return proof_checks_out

A Real Live Proof!

Running the following short script returns true (as the proof indeed checks out) and prints the proof

def test(q):
    problem = [1, 2, 3, 6, 6, 6, 12]
    assignment = [1, 1, 1, -1, -1, -1, 1]
    proof = get_proof(problem, assignment, q)
    print(proof)
    return verify_proof(problem, proof)

And this is the proof we get (running "test(4)" for only 4 queries):

[['f9f3b1e40626e906b03eb9fd5428b2f2f801e8f3c23627fe7e52a645c3f32632', 3, 1, ['1b7f5356d043c6336c6614fcc24cb77f8807cd2f443b1b77e0002be6b96c40b6', 'a412af57af0b88cdb5cb3d0cbfcd739ebcc3c6fe0ac364db9490b4a208803101', '9f358dd0980f35ea86070d0fb12b2f5726857031ef56968005095cdb13e0a6f0', '05066ac05f174f1f98226c40889c566775592ec3807fbe080324447616773e18'], 7, ['cd6ee891c632e07ad468cd602c8d2d935356ca5901b21a75a2719d164a925382', '4cfc41b83cf64e0cf14a0ea8179aa67c6324699557c508dfc424604674805864', '4efb02f72dbc085ead53657647e893f3ceb29c9f81d411dd817f3be512cad632', '6cd4c16c3d5db473280b64f6b3fce54cb4b6810b46331899f4c07f884fd89aae']], ['580bd4db1071906bcd101600baf51d33b9930ba6e26853e85634bf38c0acef92', 6, 16, ['f8b28423de50f3b0cbcf88caacb6d4f6789ba3cecdc7791b38d5bbcd700ecbd2', '5c41ad0b9d813740b516cb61cf9ce06966efcf82e8ee5881ca86d5b18400d03d', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687'], 4, ['b703a38bb22b758c5c23c08f096b6c3155c56885d57e1280ff521126282fa857', '4e602f00ef1e1de0b733f467de61805f09a1ebee8db72cc64c62dd8d55836de1', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687']]]

One More Thing...

This series, in accordance with Hofstadter's law, turned out to be somewhat longer than I anticipated. However, I left out a few things that are important.

Among these are:

Off-the-bat optimizations.
Some discussion about proof length, running time, and what modern proof lengths (and running times) look like.
A few simple tweaks, suggested by my colleague at Starkware, Lior Goldberg, to make the protocol truly Zero-Knowledge (because it isn't exactly there yet) and slightly more elegant.

So, although I promised a 3-part series, there will be a fourth part. But seeing how all the code is already here, we'll call it an appendix.

Monday, October 1, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Part II

In the previous post we developed the witness for our proof. Simply put - it is a piece of data that the prover will claim it has, and the verifier will ask queries about. In this process will develop the machinery required to force the prover to be - if not honest, then at least consistent.
Hopefully, our protocol will be such that the prover cannot make a false claim and be consistent about it.

Where were we?

Recall that in our setting, the prover claims knowledge of a (supposedly secret) satisfying assignment to a known Partition Problem instance. The protocol we developed so far was this:

The verifier chooses a random value $i$.
According to the value of $i$, the verifier asks for two values from the witness (usually two consecutive values, but sometimes the first and the last).

(check the previous post for exact details).

Indeed, if the prover is honest in its answers, and doesn't cheat or make mistakes, then after enough queries - the verifier should be convinced. But hold on, if the prover is honest - why have all this elaborate game of questions and answers? the verifier can just take the prover's word for it, and everyone'll go home early.

But the prover may be a liar!!!

The entire raison d'être of ZK proofs is that we assume that the prover may be dishonest. So, assuming that the prover knows the protocol that the verifier is using - it can simply send such queries that will satisfy the verifier. If the verifier asks for two consecutive values, the prover will provide some two random numbers such that their absolute value matches the verifier's expectations (i.e., the corresponding value in the problem instance), and if it asks for the first and last element, the prover will just send some random number twice.

The Commitments

What we need here is a mechanism that will:

Force the prover to write down all of the values of $p$ before the verifier asks about them.
When asked, force the prover to return the required values from this previously written list.

This is a concept known in the world of cryptography as a commitment scheme.

A wonderful movie from 1991, I totally recommend, excellent music.

In our case, we're going to work with a 40 year-old commitment scheme, a Merkle Tree. It is a simple and brilliant idea.

A Merkle Tree is just a full binary tree, where each node is assigned a string:

The leaves contain hashes (we'll use Sha256) of the data we'd like to commit to.
Every internal node in the tree is assigned with a string that is a hash of its two children's strings.

Suppose we want to commit to this list of four strings: ["Yes", "Sir", "I Can", "Boogie!"].

The Merkle tree will look something like this:

So node 4 is assigned the hash of the word "Yes", node 5 is assigned the hash of the word "Sir", and so on.

Also, every node $0 < i < 4$ is assigned the hash of the strings assigned to nodes $2i$ and $2i + 1$, concatenated.

The string assigned to the root of the tree (i.e. node #1) is referred to as the commitment.

That is because even the slightest change in the underlying data causes the root to change drastically.

Here's what happens if we omit the exclamation mark from the word "Boogie" (the affected nodes are marked in red):

An even cooler property of Merkle Trees, is that one can prove that a certain string belongs to the underlying data, without exposing the entire data.

Authentication Paths

Suppose I would like to commit to the title of a 1977 song by the Spanish vocal duo Baccara. The title itself is kept a secret (!), but you can ask me about one of the words in the title (well, I put "I" and "Can" in the same leaf... but let's ignore this fine point).

To prove that I won't switch songs half way through our game, I send you the hash from the root node of a Merkle Tree I created.

You now ask me what is the second word in the title, to which I reply "Sir".

To prove that this answer is consistent with the hash I sent you before, I also send you the hashes of nodes 4 and 3. This is called the authentication path of node 5 (which contains the second word from the title).

You can now check that I'm not lying by:

Computing the hash of node 5 yourself (by hashing the word "Sir").
Using that and the hash I gave you of node 4 to compute the hash of node 2.
Using that and the hash I gave you of node 3 to compute the hash of the root node.
Compare the hash you computed with the one I originally sent you, if they match, it means that I didn't switch song in between the time I sent you the initial commitment, and the time I answered you query about the second word in the title.

It is widely believed that given the Sha256 hash of some string $S_0$, it is infeasible to find another string $S_1 \neq S_0$ that has an identical Sha256 hash. This belief means that indeed one could not have changed the underlying data of a Merkle Tree without changing the root node's hash, and thus Merkle Trees can be used as commitment schemes.

Let's See Some Code

Recall that we need this machinery in order to commit to a list of numbers which we dubbed "the witness", and referred to as $p$ in the previous post.

So we need a simple class with a constructor that gets a list of numbers as input, constructs the necessary Merkle Tree, and allows the user to get the root's hash, and obtain authentication paths for the numbers in the underlying list.

We'll also throw in a function that verifies authentication paths, this function is independent from the class, as this can be done simply by hashing.

Here's a somewhat naive implementation of a Merkle Tree:

import hashlib
from math import log2, ceil

def hash_string(s):
    return hashlib.sha256(s.encode()).hexdigest()

class MerkleTree:
    """
    A naive Merkle tree implementation using SHA256
    """
    def __init__(self, data):
        self.data = data
        next_pow_of_2 = int(2**ceil(log2(len(data))))
        self.data.extend([0] * (next_pow_of_2 - len(data)))
        self.tree = ["" for x in self.data] + \
                    [hash_string(str(x)) for x in self.data]
        for i in range(len(self.data) - 1, 0, -1):
            self.tree[i] = hash_string(self.tree[i * 2] + self.tree[i * 2 + 1])

    def get_root(self):
        return self.tree[1]

    def get_val_and_path(self, id):
        val = self.data[id]
        auth_path = []
        id = id + len(self.data)
        while id > 1:
            auth_path += [self.tree[id ^ 1]]
            id = id // 2
        return val, auth_path

def verify_merkle_path(root, data_size, value_id, value, path):
    cur = hash_string(str(value))
    tree_node_id = value_id + int(2**ceil(log2(data_size)))
    for sibling in path:
        assert tree_node_id > 1
        if tree_node_id % 2 == 0:
            cur = hash_string(cur + sibling)
        else:
            cur = hash_string(sibling + cur)
        tree_node_id = tree_node_id // 2
    assert tree_node_id == 1
    return root == cur

A few things to note:

It being a binary tree, we extend the underlying data to the next power of 2 (by padding with 0s) for it to match the number of leaves of a full binary tree.
We store the root of the tree at index 1 and not 0, and its children at 2 and 3 and so on. This is just so that the indexing will be convenient, and the children of node $i$ will always be $2i$ and $2i + 1$.
This is far from optimal, because for our purposes (a blog post) clarity and brevity are more important.

What About Zero Knowledge???

An observant reader will point out that when we provide the authentication path for node 5, we provide the hash of node 4.

A snooping verifier may try hashing various words from the titles of songs of the Spanish vocal duo Baccara, and when it gets the hash we sent it as "the hash of node 4", it will have found out a leaf of the tree that we never intended to expose!

In the next post in the series, we'll deal with the ZK issue, using a simple but effective trick to get a Zero Knowledge Merkle Tree.

Also, we'll hopefully tie everything together to get the proof we originally wanted!

Part III

Sunday, September 30, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Part I

By the end of this series of posts we will have composed some code in Python (roughly 100 lines) that proves the satisfiability of Partition Problem instances with zero knowledge.

Three preliminary notes

I will assume some basic Computer Science background, as well as familiarity with Python, but not much more.
I haven't seen this specific protocol in the literature (because I haven't searched for it), but it is a combination of well-known techniques in well-known ways, so I'm quite sure some variation of it is out there.
For didactic reasons, we'll start with naive and sometimes broken implementations, and improve on them as we go along.

Background

I'm currently working at Starkware, developing some serious zero-knowledge proofs with some brilliant people, based on state-of-the-art research in the field, and we're usually hiring, so drop me a line if you're interested.

This series, however, will deal with much more basic stuff, essentially Computer Science from the 1980s. For those familiar with contemporary protocols such as SNARKs, Bulletproofs, and STARKs - I am not going to present any of them, if you don't know what any of these are - fear not.
What I'm shooting for is less "Cheat sheet for modern ZK proofs" and more "ZK proofs for dummies".

With that in mind - let's get going.

Zero Knowledge Proofs

Zero Knowledge (AKA ZK) proofs are stories of the following type: side A states a claim and proves it to side B, after some deliberation between them, such that:

B is sure that the claim is true with 99.99999% certainty.
B has learnt nothing new in the process, other than that the claim is true.

This Wikipedia article contains an excellent explanation of the idea, with some concrete examples.

In this series I'll deal with ZK arguments of knowledge, which are not exactly the same as proofs, but they're close enough. In short: a ZK proof can be trusted completely, even if the side who's trying to prove their claim (usually referred to as "the prover") has unlimited computational power. ZK argument-of-knowledge can be trusted under the assumption that if the prover indeed tries to cheat, it is polynomialy bounded (if you use credit cards on the internet, then you already assume that, btw).

In the world of ZK proofs, the other side of the exchange is often called "the verifier". I'll stick to this terminology here.

The Partition Problem

Given a sequence of numbers $a_0, a_1, ..., a_{n-1}$, can one partition this sequence into two subsets that sum up to the same number?

If the sequence in question is $1, 9, 8, 0, 2, 2$, then the answer is clearly yes since $2 + 9 = 8 + 1 + 2 + 0$.

However if the sequence is $2, 3, 4, 5, 6, 7$, then the answer is clearly no, since the sum is odd, and therefore there cannot be two subsets summing exactly to half of it each (the numbers are all integers).
While these are simple enough instances, in general this problem is NP-complete (though it has a pseudo-polynomial algorithm).

Let's Start Proving!

Suppose we have a python list $l$ of numbers, that defines our Partition Problem instance. We'll say that another list $m$ is a satisfying assignment if:

len(m) == len(l).
All of the elements in $m$ are either $1$ or $-1$.
The dot-product of $l$ and $m$ is 0.

Note that this is equivalent to the statement of the partition problem, if we think of a '1' in $m$ as assigning its corresponding number in $l$ to the left side of the equation, and '-1' as assigning it to the right side.

Given $l$, a proof for its satisfiability can be given by revealing $m$, but that would violate the ZK requirement.

Let's rewrite $l$ as the partial sum list of its dot product with $m$.

Mathematically speaking, let $p_i := \sum _{0\leq k<i} l[k] \cdot m[k]$.

So if $l = [4, 11, 8, 1]$, and $m = [1, -1, 1, -1]$, then $p$ will be one element longer: $p = [0, 4, -7, 1, 0]$.

Note that $p$ now has two interesting properties, if $m$ is indeed a satisfying assignment:

(property 1 of p) It starts and ends with 0.
(property 2 of p) For every $0\leq i < n$, we have $|l[i]| = |p[i+1] - p[i]|$.

So here's a first draft for a ZK protocol:

The verifier chooses a random $0 \leq i \leq n$.

If $i = n$, the verifier asks the prover to provide $p[0]$ and $p[n]$ and checks that they are both 0.

Otherwise, the verifier asks the prover to provide $p[i]$ and $p[i+1]$ and checks that indeed $|l[i]| = |p[i+1] - p[i]|$ (recall that $l$ is known to the verifier, as part of the claim made by the prover).

What if the prover is lying???

The above contains an implicit assumption that when the verifier asks the prover to provide some data, the prover will indeed provide it honestly. We don't want to assume that, but we postpone dealing with this issue to the next post. For now, let's assume everything is kosher.

This doesn't prove anything!

An observant reader will probably point out that asking about a single element doesn't mean much. And that's true, we'd like to ask many queries, and after enough of them - we'll be certain that the claim is true. We'll quantify this more accurately in the third (and last) post.

This is not Zero-Knowledge!

Each query reveals something about $m$, and so it is not zero-knowledge. Consequently, after enough queries - $m$ can be completely revealed.

That's terrible! Let's fix it.

Manufacturing Zero-Knowledge

Mathematically speaking, we usually say that something provides no new information, if it appears random, or more precisely - if it is uniformly distributed over some appropriately chosen domain. Without getting into the exact definition, this means that to make something ZK, we mix it with randomness. So here's how we do it here.

Instead of $m$ as it was given to us, we flip a coin. If it shows heads, we leave $m$ as it is, if it shows tails, we multiply all of $m$'s elements by $-1$. Note that since its elmenets were initially $-1$ and $1$, and its dot product with $l$ was 0, this does not change its dot product with $l$ at all.
We choose a random number $r$ and add it to all the elements of $p$. This does not effect the second property of $p$, but it changes the first property such that the first and last elements of $p$ now may not be zero. However, they must still be identical to one another.

Now suppose that before each query - we recompute this randomness (i.e. - flip the coin and change $m$, and choose a random number $r$ and add it to the elements of $p$).

If we choose $r$ carefully, then indeed, every two consecutive elements of $p$ will differ (in absolute value) by the corresponding element in $l$ but look otherwise random.

So, here's the first piece of code we'll need, something that takes a problem (i.e. $l$) and a satisfying assignment (i.e. $m$) and constructs a witness (i.e. $p$) that will attest to the satisfiability of the problem instance:

import random

def get_witness(problem, assignment):
    """
    Given an instance of a partition problem via a list of numbers (the problem) and a list of
    (-1, 1), we say that the assignment satisfies the problem if their dot product is 0.
    """
    sum = 0
    mx = 0    
    side_obfuscator = 1 - 2 * random.randint(0, 1)
    witness = [sum]
    assert len(problem) == len(assignment)
    for num, side in zip(problem, assignment):
        assert side == 1 or side == -1
        sum += side * num * side_obfuscator
        witness += [sum]
        mx = max(mx, num)
    # make sure that it is a satisfying assignment
    assert sum == 0
    shift = random.randint(0, mx)
    witness = [x + shift for x in witness]
    return witness

This post didn't have nearly enough images as a blog post should have. Her'e one to make up for it:

Part II

Part III

Wednesday, February 21, 2018

Hierarchical Hierarchical Clustering: A Concept So Nice, We Named It Twice

While working at Resonai, I wrote a piece of code that performs Hierarchical Clustering, in collaboration with David Lehavi. In addition to various optimizations I won't get into, we applied a nice heuristic that allowed a considerable improvement in the program's memory footprint, as well as the running time. The more formal name we gave it was Spatially Sensitive Hierarchical Clustering (SSHC), but ended up referring to it as Hierarchical Hierarchical Clustering which is funnier, and better reflects what's really going on.

Hierarchical Clustering in a Nutshell

The following image, taken from the Wikipedia article on HC, illustrates the basic notion rather well:

Suppose we have 6 items that we wish to cluster, and we have some metric that we can compute between subsets of those items. We get a hierarchy of clusters via the following greedy algorithm:

Let each item be a cluster (with a single element)
Let $S$ be the set of all clusters
While $|S| > 1$:

Find the closest pair of clusters $X, Y$ in $S$.
Define a new cluster $Z := X \cup Y$
Add $Z$ to $S$, remove $X$ and $Y$ from $S$

So the diagram above implies the following possible sequence of cluster unions (it may be slightly different):

$\{b\} \cup \{c\}$
$\{d\} \cup \{e\}$
$\{d,e\} \cup \{f\}$
$\{b,c\} \cup \{d,e,f\}$
$\{a\} \cup \{b,c,d,e,f\}$

There are a number of obvious optimizations, for example, one does not have to recompute all distances after each creation of a new cluster. Rather - only the distances that involve the new cluster.

What is It Good For?

In most of the use cases that we employed, the initial atoms were triangular facets of a 3D mesh. One such use case is mesh segmentation, that is the process of breaking down a 3D object into meaningful sub-parts. There's a well-known paper from 2006 by Attene et al that describes such an approach. The metric chosen for distance between segments(=clusters) is how much they resemble certain primitive shapes the authors chose in advance (cylinder, cube, sphere, etc.).

As can be seen in this image taken from the paper, once one has the full hierarchy of segments, this tree can be trimmed according to the number of clusters one wishes.

source: "Hierarchical mesh segmentation based on fitting primitives" by Attene et al

The Problem With HC

The natural approach we took was to consider this not as atoms where all pairwise distances needed to be considered, but as a graph, where only distances between neighboring vertices had to be considered. In the 3D mesh case - adjacent faces were represented by neighboring atoms in the HC tree. So now we simply put all the edges of the graph (and the respective distances) into a some implementation of a min-heap, and began the simple process of:

Extracting the minimal edge
Uniting the clusters it connected
Updating the graph (and the edges in the heap) accordingly

This became an issue when the number of items stored in the heap was in the millions and tens of millions, which is very much the case when you get a high-quality 3D-scan of a room, for example. It turned out that operations on the heap became unbearably expensive, and the memory footprint was terrible, since we had to store this huge heap in the RAM.

The Solution: HHC

We realized that changes in the HC tree were almost always very local - one computes the distance between pairs of clusters, which are neighboring nodes in the graph, and sometimes unites them in a way which affects only the neighbors of the united clusters. So why not divide and conquer?

What we ended up doing is this:

Break down the model very crudely into K parts, by doing something that is not far from just putting it on a grid and taking cubes. Choose K such that each part will have not so many facets in it, but not too little.
Run HC on each part separately until the number of clusters in the part becomes small.
Now unite adjacent parts until the number of clusters in each part is again not too big but not too small.

Note that this is in effect doing a Hierarchical Clustering of the parts, hence the winning name.

Also note that it effectively means you work on a number of very small heaps all the time and never on a very large one. This means heap operations are now considerably cheaper, and indeed the memory footprint went down by a factor of 5 or so, and the running time was improved dramatically, on machines with slower hard drives (since less swapping was involved).

The Price: Accuracy

As it often happens, the price of heuristics is that they mess up the algorithm's accuracy. In our case - it means that if the minimal edge happens to connect atoms that lie in different parts - you will get to it much later than you otherwise would have, but the reduction in running time and resource consumption made it worth our while.