Garden Path Trajectory

Friday, December 30, 2022

McHeron Algorithm for Square Root

Heron

Famously, Heron's method of computing the square root of a some number x goes like this:

1. Guess the root, we'll denote the guess by g

2. Compute the quotient q = x/g

3. Update your next guess to be the arithmetic mean of q and g. So g := (x/g + g)/2

4. Goto 2

This can be viewed as a special case of the Newton–Raphson method, and indeed the results become very accurate very fast.

Mckay

Somewhat less famous is the anecdote about Mckay's theorem.

In this paper by Laurence Sherzer, the author tells the story of an 8th grader (in 1973) named Robert Mckay, who suggests a way to find a number that lies between two fractions. Suppose the fractions are a/b and c/d, Mckay's suggestion is to use (a+c)/(b+d).

This is a mathematical education paper, so it's focus is about how the class and the teacher approached the seemingly weird idea of adding numerators and denominators, and how they explained to themselves why this seems to 'magically' work.

The short version of the explanation is: this amounts to taking a weighted average of the two fractions, where the weights are the denominators. When the denominators are equal, this is exactly the arithmetic mean. When they differ by a lot - the result is much closer to the fraction with the larger denominator.

McHeron?

Since taking the mean of two roots, in Heron's method, is annoying at times, why not use Mckay's very simple method instead, dubbing it the McHeron Algorithm?
Mathematically, it will give an increasingly better guess in each round, even though it may not be the arithmetic mean.

How would that look?

Suppose we want to compute the square root of 2:

1. Our first guess is 3/2

2. The quotient is 2/(3/2) which is 4/3 ('just invert and multiply' as they say)

3. Our next guess needs to be between 3/2 and 4/3, so we use Mckay's method to get 7/5

4. Our next guess is therefore 7/5

5. The next quotient is 2/(7/5) which is just 2*5/7 = 10/7

6. Using Mckay's theorem - our next guess will be (7+10)/(5+7) = 17/12

7. The next quotient is 2/(17/12) which is just 2*12/17 = 24/17

8. Using Mckay's - (17+24)/(12+17) = 41/29

And indeed 41/29 = 1.41379...

Which is approximates the square root of 2 up to 2 decimal points, which is not bad.

Cool, isn't it?

But of course, life is never quite so simple

I specifically chose a nice example where the algorithm converges relatively fast.

Had I chosen 715 (whose square root is 26.7394...) as the initial number, it would have taken me a 100 iterations of the algorithm just to get to 26.76... and I will be dealing with denominators with 145 digits by then.

Why would the algorithm perform so bad?

Well, remember how Mckay's method introduces a bias for large denominators? The algorithm above scales the quotient's denominator by the the input number, so the larger the input number, the greater the bias toward the quotient will be in the 'Mckay' step of the calculation.

Heron's algorithm has quadratic convergence (which is amazing), and I don't think showing exactly how bad the convergence of the McHeron algorithm is difficult (I suspect that asymptotically, to get decent results, you'll have to run it as many iterations as the magnitude of the result, so O(2^(n/2)) for an input with n bits, which is terrible), but I should go shopping for the weekend.

Silver Lining

For small numbers (with small denominators) this works nicely for the same reasons, because Mckay's method gives a decent approximation of the arithmetic mean.

That's all, would love to hear thoughts in the comments, or links if you saw this exact method somewhere else, I haven't (and for a good reason, it yields terrible results in most cases).

Update: a Twitter Tweak

The user

@k_yaakov

suggested on Twitter to bring the denominators to roughly the same scale through multiplying by the appropriate power of 10. Since determining the correct power of 10 and multiplying by it can be done by counting the number of digits and adding zeros respectively, this is very easy to do in practice. The consequence is that the bias introduced by Mckay's weighted mean is now bounded by 1/10, rendering a significantly better convergence rate.
Taking the adversarial example of 715 from before: getting a 2 decimal points precision now requires only 17 iterations, compared to over 100 iterations in the previous version of McHeron. Very nice!

Monday, January 31, 2022

3 Golden Rules

"Just as I cannot step in the same river twice, the person who created this bug and I are two separate beings in the fabric of the universe. Therefore, I will also not be the one who fixes it."

Heraclitus of Ephesus, a programmer

I don't like code manifests.

Rina Artstain once tweeted that we only think we know certain things about parenting because as luck would have it, we haven't had a child to whom all of them don't apply.
That's how I feel about code manifests.

However, I do have a manifest of my own, short though it may be, so I want to frame it differently. What follows are not the three rules I think everyone should follow to get a "cleaner" code or some other metric of goodness. Rather, these are three things I like when I see in other people's code and annoy me when I see blatantly disregarded.

Here we go.

1.

"לֹא עָלֶיךָ הַמְּלָאכָה לִגְמוֹר, וְלֹא אַתָּה בֶן חוֹרִין לִבָּטֵל מִמֶּנָּה."
פרקי אבות, ב' טז'

"It is not incumbent upon you to finish the task, but neither are you free to absolve yourself from it."
Pirkei Avot 2:16

This is one of the greatest quotes I know and it sounds so much better in Hebrew.

2.

"Entities should not be multiplied beyond necessity."
William of Ockham

Yes, this is the OG Ockham's razor. How amazing it is that a Franciscan friar who lived 700 years ago foresaw the harms of overusing OOP and polymorphism.

3.

"Writing good code is superior to writing bad code.
Deleting code is superior to writing good code.
Superior to all is concluding that some code does not have to be written at all."
Confucius

Well, not Confucius, but if one wants people to listen, one should attribute their thoughts to someone famous, preferably dead.

I'm pretty sure Oscar Wilde said that one.

Tuesday, June 23, 2020

Math Shmath - My Podcast

During the first COVID-19 days, I had a couple of weeks hiatus between two jobs, which I used to create a popular-math podcast in Hebrew I call "Math Shmath".

There are only 5 episodes so far, but I hope to get some more in the future.

The intended audience is people with no formal math background, but through a listeners survey I found that about half of the responders had at least undergraduate level studies in math or sciences.

The podcast had over 5k downloads so far, and if you are a Hebrew speaker, I suggest you give it a try. It is available on iTunes, Spotify, and all the usual places, and I'm told kid with math tendencies at ages 10-15 like it.

Episode #1: Zero

Episode #2: Combinations

Episode #3: The Problem With Percentage

Episode #4: Algorithms

Episode #5: An Unfinished Game

Saturday, October 26, 2019

Zero Knowledge Proofs Via A Toy Example: An Overdue Video

I wrote a hands-on tutorial for Zero-Knowledge proofs more than a year ago on this blog (in four parts: I, II, III and IV).

It subsequently turned into a talk at the fifth Algorithms IL meetup, and it was videoed.

So here's the video, courtesy of the Algorithms IL team, and Tzipi Zanyovka from Waze, which hosted the event.

Thursday, October 4, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Appendix

This is the final part of this hands-on tutorial. I will assume from now on that you have read Part I, Part II, and Part III of this series.

As promised, this post will deal with:

Some tweaks to the protocol presented in the previous posts.
A complexity analysis of the protocol,
A small optimization.
A few words about modern ZK proving protocols

Lior's Tweaks

A colleague at Starkware, Lior Goldberg, pointed out a caveat in the zero-knowledge aspect, and suggested a nice simplification of the protocol, here they are:

A ZK Fix

Suppose the first two numbers in the problem instance are 5 and 6, and that the random shift $r$ is chosen from the range $0..10$.
Now if the random query $i$ happens to be $1$, then the prover is required to reveal the second and third elements in the witness. If they happen to be 15 and 21, then the verifier knows immediately that 5 and 6 (from the problem instance) belong to the same side in the solution. This violates the zero knowledge property that we wanted.
This happened because we chose uniformly at random from a very small range, and $r$ happened to be the maximal number in that range.

There are two ways to solve this. One is by chosing some arbitrary number and doing all computations modulo that number. A simpler way would be chosing $r$ from a huge domain, such as $0..2^{100}$, which makes the probability of getting a revealing $r$ negligible.

Simplify By Having A Cyclic List

Our witness originally had $n + 1$ elements such that the first was a random number, and the rest were partial sums of the problem and the assignment dot product (plus the initial random number).

This meant we had two types of queries, one to check that two consecutive elements in the list differ in absolute value by the corresponding element in the problem list. Another type of query just checked that the first and last element are equal.

As Lior pointed out, it is much more elegant to omit the last element from the witness entirely, and if $i = n$ - check that the first and last elements in the witness differ, in absolute value, by the last element in the problem instance. Essentially, this is like thinking of the witness as cyclic. The nice thing about this is that now we only have one type of queries - a query about the difference between two consecutive elements modulo n in the witness.

Proof Size / Communication Complexity

We'd like to analyze the size of the proof that our code generates. This often referred to as communication complexity, because the Fiat-Shamir Heuristic (that was described in Part III) transforms messages (from an interactive protocol) to a proof, making these two terms interchangeable in this context.

So, for each query, the proof stores:

The value of i.
The value of the $i$-th element in the witness and the $(i \mod n)$-th element.
Authentication paths for both elements.

The authentication paths here are the heavy part. Each of them is a $\log(n)$-element long list of 256bit values.

As was discussed in the last post, to get a decent soundness, the number of queries has to be roughly $100n$.

Putting these two together, the proof size will be dominated by the $~200 \cdot n \cdot \log(n)$ hashes that form the authentication paths.

So a proof that one knows an assignment to a Partition Problem instance with 1000 numbers, will require roughly $2,000,000$ hashes, which translate to 64 Megabytes of data.

Small Merkle Optimization

Since merkle authentication paths, somewhat surprisingly, make up the vast majority of the proof, maybe we can reduce their number by a little.

Note that all queries (except for one) ask about consecutive leaves in the tree.

Consecutive leaves share, on average, have their LCA (least common ancestor) at height $\frac {\log n} {2}$. Up to the LCA, their authentication paths may differ, but from the LCA up to the root, they're authentication paths are identical, so we're wasting space writing both in the proof.

Omitting the path from the LCA to the root from one of them will bring the proof size down to $150 \cdot n \cdot \log (n)$, which is a nice 25% improvement.

Implementing this optimization, as well as Lior's tweaks, is left - as they say in textbooks - as an exercise for the reader.

Modern Protocols

Modern ZK proving protocols, such as ZK-SNARKS, ZK-STARKS, Bulletproof, Ligero, Aurora, and others, are often compared along these four axes:

What type of statements can be proved using the protocol.
How much space the proof takes up.
How long it takes to create a proof.
How long it takes to verify a proof.

Often the topic of trusted setup is discussed, but we won't get into that here.

Let's see how our toy-protocol fares:

Which statements can be proved?

In the toy-protocol, only knowledge of a solution to a Partition Problem instance could be proved. In contrast with most protocols, where one can use the protocol to prove knowledge of an input that satisfies some arbitrary arithmetic circuit, or even that a specific program ran for $T$ steps, and provided a specified output (this is what ZK-STARKS do).
Well, you may say, if you can prove one NP-complete problem (and the Partition Problem is one) - you can prove them all, due to polynomial time reductions. And theoretically speaking you would be right. However, in the practical world of ZK-proofs, all these manipulations have costs of their own, and conversions often incur a blow-up of the problem, since "polynomial reduction" is a theoretical term that can translate to non-practical cost. For this reasons, modern protocols make an effort to take as input more expressive forms (such as arithmetic circuits and statements about computer programs).

Space

As the analysis showed, our proof takes up $O(n \log (n))$ space, whereas in most modern protocols, the proof size is somewhere between constant and polylogarithmic in $n$ (e.g. $O(\log ^2 (n))$).
This huge gap is what this makes the proposed protocol nothing more than a toy example, that - while demonstrating certain approaches and tricks - is useless for any real application.
You can trace this gap to the fact we need a linear number of queries, each costing a logarithmic number of hashes (the Merkle authentication paths).
The approach I took was inspired by tricks from the ZK-STARK protocol, that is slightly more expensive than others in terms of proof size, but is expressive, requires relatively short prover time, and very short verifier time. In STARK, indeed the lion share of the proof is comprised of Merkle authentication paths, but great care is taken so that the number of queries will be minuscule.

Prover Running Time

In our protocol it is roughly $O(n \log (n))$, which is not far from modern protocols.

Verifier Running Time

In our protocol it is linear in the proof size, so $O(n \log n)$ which is not so good. Recall that, at least in the context of blockchains, a proof is written once but verified many times (by miners for example). Modern protocols thus strive to make the verifier workload as small as they possibly can without impeding soundness.

This concludes what I hoped to cover in this tutorial. It was fun to write and code. Let's do it again sometime. :)

Tuesday, October 2, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Part III

A Zero Knowledge Merkle Tree

In the second post in this series, I presented the very neat concept of a Merkle Tree, but we ended with the problem that in its standard implementation - while being a powerful commitment scheme - a Merkle Tree is not zero knowledge.

As it turns out, this is very easy to fix by adding another level to the tree, such that in it, every pair of siblings is comprised of one node associated with actual data, and the other with a random string.

This is again the notion of mixing real data with randomness in order to obtain zero knowledge.

Here's how this will look with our "Yes Sir I Can Boogie" data:

This has the desired effect because whenever the prover has to reveal an authentication path for a piece of data - all the hashes revealed are affected by random data, and having been mixed through the Sha256 hash - these hashes appear random, and provide zero knowledge (other than the revelaed leaf node).

Fixing The Code:

Tweaking the MerkleTree class from last time, we get:

class ZkMerkleTree:
    """
    A Zero Knowledge Merkle tree implementation using SHA256
    """
    def __init__(self, data):
        self.data = data
        next_pow_of_2 = int(2**ceil(log2(len(data))))
        self.data.extend([0] * (next_pow_of_2 - len(data)))
        # Intertwine with randomness to obtain zero knowledge.
        rand_list = [random.randint(0, 1 << 32) for x in self.data]
        self.data = [x for tup in zip(self.data, rand_list) for x in tup]
        # Create bottom level of the tree (i.e. leaves).
        self.tree = ["" for x in self.data] + \
                    [hash_string(str(x)) for x in self.data]
        for i in range(len(self.data) - 1, 0, -1):
            self.tree[i] = hash_string(self.tree[i * 2] + self.tree[i * 2 + 1])

    def get_root(self):
        return self.tree[1]

    def get_val_and_path(self, id):
        # Because of the zk padding, the data is now at id * 2
        id = id * 2
        val = self.data[id]
        auth_path = []
        id = id + len(self.data)
        while id > 1:
            auth_path += [self.tree[id ^ 1]]
            id = id // 2
        return val, auth_path

def verify_zk_merkle_path(root, data_size, value_id, value, path):
    cur = hash_string(str(value))
    # Due to zk padding, data_size needs to be multiplied by 2, as does the value_id
    tree_node_id = value_id * 2 + int(2**ceil(log2(data_size * 2)))
    for sibling in path:
        assert tree_node_id > 1
        if tree_node_id % 2 == 0:
            cur = hash_string(cur + sibling)
        else:
            cur = hash_string(sibling + cur)
        tree_node_id = tree_node_id // 2
    assert tree_node_id == 1
    return root == cur

Protocol Summary

To summarize the theory, the protocol by which the prover proves knowledge of a satisfying assignment to the Partition Problem is:

The prover generates a witness (using get_witness from the first post in this series).
The prover creates a ZK Merkle Tree from the witness, and sends its root-hash to the verifier.
The verifier sends a random number $i$ to the prover.
If $i < n$ then the prover sends to the verifier:

The elements in places $i$ and $i + 1$ in the witness.
The authentication paths proving that these answers are consistent with the root sent in step (2).

If $i == n$ then the prover sends the first and the last elements in the witness, with the authentication paths etc.
The verifier checks the authentication paths against the root, and the returned numbers against the problem instance, to verify properties (1) and (2) of the witness as they are described in the first post.
The verifier returns true iff everything checks out.

What If The Prover Is Lying?

Clearly if everything is kosher, the verifier will see that it is (this is called completeness).
But what if the prover is dishonest? What is the probability $p$ that the verifier will catch on? (this is called soundness).

Suppose the witness is kosher in all but one place, which is clearly the hardest case to catch. This means that in a single query, the verifier has a probability of $\frac 1 {n + 1}$ to expose the prover's deception.

If we repeat the protocol $k$ times, then the verifier's probability of catching a lying prover grows to $1 - (1 - \frac 1 {n + 1})^k$.

And if we set $k = 100(n + 1)$ then this is approximately $1 - \frac 1 {e^{100}}$ which is indeed very very very sure.

To give a sense of how sure that is, the prover's odds of convincing the verifier of a false claim are like odds of flipping a coin and having it land on its edge 12 times in a row.

Fiat-Shamir Heuristic

One must admit that it is somewhat cumbersome to have the prover and the verifier engage in such a long exchange of queries and responses. It means that whenever there's something to prove, both sides need to be available, online, and ready for this ping pong.

Luckily, a neat trick by Amos Fiat and Adi Shamir, known as Fiat-Shamir Heuristic, allows us to take this protocol, and convert it into a single long proof, that the prover generates once, and that everyone in the world can check afterwards.

The heuristic is based on the observation that in many protocols, and specifically in the one described here, the only messages that the verifier ever sends throughout the exchange are random numbers.

So here's the basic idea:

The prover will simulate the verifier's side in the exchange, but will seed the random number generator it uses in a way that is on one hand random "enough", and on the other hand - replicable.
The prover will write down the verifier's queries, and the prover's replies (with the authentication paths and all), one after the other, and documentation of the simulated exchange will be the proof!
After the desired number of queries have been simulated - the prover will send this single long proof to the verifier.
On the verifier side - the verifier will simulate the exchange, using the same replicable-randomness mechanism, that will convince the verifier that the queries that the prover asked itself were indeed random.

This smells like cheating, I admit. The prover asks itself and answers itself and sends this exchange to the verifier.

But to our aid comes the fact that hash functions behave, to all cryptographic intents and purposes, as if they were random number generators.

So when the prover needs to simulate the first query - it will feed the problem instance into a hash function, and use that to obtain a random number (e.g. take the hash mod n).

When the time comes to generate the second query, and all the subsequent queries - the prover will feed the proof that has been written up to that point into the hash function, and use that to obtain a random number.

Provided that the prover and the verifier agree which hash function they use - this is both random and replicable (since the verifier has the problem instance and the proof, used to seed the randomness) on both sides of the exchange.

Putting it All Together

And here's the code to obtain a proof and to check it:

def get_proof(problem, assignment, num_queries):
    proof = []
    randomness_seed = problem[:]
    for i in range(num_queries):
        witness = get_witness(problem, assignment)
        tree = ZkMerkleTree(witness)
        random.seed(str(randomness_seed))
        query_idx = random.randint(0, len(problem))
        query_and_response = [tree.get_root()]
        query_and_response += [query_idx]
        query_and_response += tree.get_val_and_path(query_idx)
        query_and_response += tree.get_val_and_path((query_idx + 1) % len(witness))
        proof += [query_and_response]
        randomness_seed += [query_and_response]
    return proof

def verify_proof(problem, proof):
    proof_checks_out = True
    randomness_seed = problem[:]
    for query in proof:
        random.seed(str(randomness_seed))
        query_idx = random.randint(0, len(problem))
        merkle_root = query[0]
        proof_checks_out &= query_idx == query[1]
        # Test witness properties.
        if query_idx < len(problem):
            proof_checks_out &= abs(query[2] - query[4]) == abs(problem[query_idx])
        else:
            proof_checks_out &= query[2] == query[4]
        # Authenticate paths
        proof_checks_out &= \
            verify_zk_merkle_path(merkle_root, len(problem) + 1, query_idx, query[2], query[3])
        proof_checks_out &= \
            verify_zk_merkle_path(merkle_root, len(problem) + 1, \
                                 (query_idx + 1) % (len(problem) + 1), query[4], query[5])
        randomness_seed += [query]
    return proof_checks_out

A Real Live Proof!

Running the following short script returns true (as the proof indeed checks out) and prints the proof

def test(q):
    problem = [1, 2, 3, 6, 6, 6, 12]
    assignment = [1, 1, 1, -1, -1, -1, 1]
    proof = get_proof(problem, assignment, q)
    print(proof)
    return verify_proof(problem, proof)

And this is the proof we get (running "test(4)" for only 4 queries):

[['f9f3b1e40626e906b03eb9fd5428b2f2f801e8f3c23627fe7e52a645c3f32632', 3, 1, ['1b7f5356d043c6336c6614fcc24cb77f8807cd2f443b1b77e0002be6b96c40b6', 'a412af57af0b88cdb5cb3d0cbfcd739ebcc3c6fe0ac364db9490b4a208803101', '9f358dd0980f35ea86070d0fb12b2f5726857031ef56968005095cdb13e0a6f0', '05066ac05f174f1f98226c40889c566775592ec3807fbe080324447616773e18'], 7, ['cd6ee891c632e07ad468cd602c8d2d935356ca5901b21a75a2719d164a925382', '4cfc41b83cf64e0cf14a0ea8179aa67c6324699557c508dfc424604674805864', '4efb02f72dbc085ead53657647e893f3ceb29c9f81d411dd817f3be512cad632', '6cd4c16c3d5db473280b64f6b3fce54cb4b6810b46331899f4c07f884fd89aae']], ['580bd4db1071906bcd101600baf51d33b9930ba6e26853e85634bf38c0acef92', 6, 16, ['f8b28423de50f3b0cbcf88caacb6d4f6789ba3cecdc7791b38d5bbcd700ecbd2', '5c41ad0b9d813740b516cb61cf9ce06966efcf82e8ee5881ca86d5b18400d03d', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687'], 4, ['b703a38bb22b758c5c23c08f096b6c3155c56885d57e1280ff521126282fa857', '4e602f00ef1e1de0b733f467de61805f09a1ebee8db72cc64c62dd8d55836de1', 'af38f9a1873b70d113dab45d6312e6d2a7f4afa45a8c82ebe788abf63dd85650', 'a57a3ccb7cbffdf4d346f1ecf10ead43a4ce1e52b51170789698b7aece6c7687']]]

One More Thing...

This series, in accordance with Hofstadter's law, turned out to be somewhat longer than I anticipated. However, I left out a few things that are important.

Among these are:

Off-the-bat optimizations.
Some discussion about proof length, running time, and what modern proof lengths (and running times) look like.
A few simple tweaks, suggested by my colleague at Starkware, Lior Goldberg, to make the protocol truly Zero-Knowledge (because it isn't exactly there yet) and slightly more elegant.

So, although I promised a 3-part series, there will be a fourth part. But seeing how all the code is already here, we'll call it an appendix.

Monday, October 1, 2018

A Hands-On Tutorial for Zero-Knowledge Proofs: Part II

In the previous post we developed the witness for our proof. Simply put - it is a piece of data that the prover will claim it has, and the verifier will ask queries about. In this process will develop the machinery required to force the prover to be - if not honest, then at least consistent.
Hopefully, our protocol will be such that the prover cannot make a false claim and be consistent about it.

Where were we?

Recall that in our setting, the prover claims knowledge of a (supposedly secret) satisfying assignment to a known Partition Problem instance. The protocol we developed so far was this:

The verifier chooses a random value $i$.
According to the value of $i$, the verifier asks for two values from the witness (usually two consecutive values, but sometimes the first and the last).

(check the previous post for exact details).

Indeed, if the prover is honest in its answers, and doesn't cheat or make mistakes, then after enough queries - the verifier should be convinced. But hold on, if the prover is honest - why have all this elaborate game of questions and answers? the verifier can just take the prover's word for it, and everyone'll go home early.

But the prover may be a liar!!!

The entire raison d'être of ZK proofs is that we assume that the prover may be dishonest. So, assuming that the prover knows the protocol that the verifier is using - it can simply send such queries that will satisfy the verifier. If the verifier asks for two consecutive values, the prover will provide some two random numbers such that their absolute value matches the verifier's expectations (i.e., the corresponding value in the problem instance), and if it asks for the first and last element, the prover will just send some random number twice.

The Commitments

What we need here is a mechanism that will:

Force the prover to write down all of the values of $p$ before the verifier asks about them.
When asked, force the prover to return the required values from this previously written list.

This is a concept known in the world of cryptography as a commitment scheme.

A wonderful movie from 1991, I totally recommend, excellent music.

In our case, we're going to work with a 40 year-old commitment scheme, a Merkle Tree. It is a simple and brilliant idea.

A Merkle Tree is just a full binary tree, where each node is assigned a string:

The leaves contain hashes (we'll use Sha256) of the data we'd like to commit to.
Every internal node in the tree is assigned with a string that is a hash of its two children's strings.

Suppose we want to commit to this list of four strings: ["Yes", "Sir", "I Can", "Boogie!"].

The Merkle tree will look something like this:

So node 4 is assigned the hash of the word "Yes", node 5 is assigned the hash of the word "Sir", and so on.

Also, every node $0 < i < 4$ is assigned the hash of the strings assigned to nodes $2i$ and $2i + 1$, concatenated.

The string assigned to the root of the tree (i.e. node #1) is referred to as the commitment.

That is because even the slightest change in the underlying data causes the root to change drastically.

Here's what happens if we omit the exclamation mark from the word "Boogie" (the affected nodes are marked in red):

An even cooler property of Merkle Trees, is that one can prove that a certain string belongs to the underlying data, without exposing the entire data.

Authentication Paths

Suppose I would like to commit to the title of a 1977 song by the Spanish vocal duo Baccara. The title itself is kept a secret (!), but you can ask me about one of the words in the title (well, I put "I" and "Can" in the same leaf... but let's ignore this fine point).

To prove that I won't switch songs half way through our game, I send you the hash from the root node of a Merkle Tree I created.

You now ask me what is the second word in the title, to which I reply "Sir".

To prove that this answer is consistent with the hash I sent you before, I also send you the hashes of nodes 4 and 3. This is called the authentication path of node 5 (which contains the second word from the title).

You can now check that I'm not lying by:

Computing the hash of node 5 yourself (by hashing the word "Sir").
Using that and the hash I gave you of node 4 to compute the hash of node 2.
Using that and the hash I gave you of node 3 to compute the hash of the root node.
Compare the hash you computed with the one I originally sent you, if they match, it means that I didn't switch song in between the time I sent you the initial commitment, and the time I answered you query about the second word in the title.

It is widely believed that given the Sha256 hash of some string $S_0$, it is infeasible to find another string $S_1 \neq S_0$ that has an identical Sha256 hash. This belief means that indeed one could not have changed the underlying data of a Merkle Tree without changing the root node's hash, and thus Merkle Trees can be used as commitment schemes.

Let's See Some Code

Recall that we need this machinery in order to commit to a list of numbers which we dubbed "the witness", and referred to as $p$ in the previous post.

So we need a simple class with a constructor that gets a list of numbers as input, constructs the necessary Merkle Tree, and allows the user to get the root's hash, and obtain authentication paths for the numbers in the underlying list.

We'll also throw in a function that verifies authentication paths, this function is independent from the class, as this can be done simply by hashing.

Here's a somewhat naive implementation of a Merkle Tree:

import hashlib
from math import log2, ceil

def hash_string(s):
    return hashlib.sha256(s.encode()).hexdigest()

class MerkleTree:
    """
    A naive Merkle tree implementation using SHA256
    """
    def __init__(self, data):
        self.data = data
        next_pow_of_2 = int(2**ceil(log2(len(data))))
        self.data.extend([0] * (next_pow_of_2 - len(data)))
        self.tree = ["" for x in self.data] + \
                    [hash_string(str(x)) for x in self.data]
        for i in range(len(self.data) - 1, 0, -1):
            self.tree[i] = hash_string(self.tree[i * 2] + self.tree[i * 2 + 1])

    def get_root(self):
        return self.tree[1]

    def get_val_and_path(self, id):
        val = self.data[id]
        auth_path = []
        id = id + len(self.data)
        while id > 1:
            auth_path += [self.tree[id ^ 1]]
            id = id // 2
        return val, auth_path

def verify_merkle_path(root, data_size, value_id, value, path):
    cur = hash_string(str(value))
    tree_node_id = value_id + int(2**ceil(log2(data_size)))
    for sibling in path:
        assert tree_node_id > 1
        if tree_node_id % 2 == 0:
            cur = hash_string(cur + sibling)
        else:
            cur = hash_string(sibling + cur)
        tree_node_id = tree_node_id // 2
    assert tree_node_id == 1
    return root == cur

A few things to note:

It being a binary tree, we extend the underlying data to the next power of 2 (by padding with 0s) for it to match the number of leaves of a full binary tree.
We store the root of the tree at index 1 and not 0, and its children at 2 and 3 and so on. This is just so that the indexing will be convenient, and the children of node $i$ will always be $2i$ and $2i + 1$.
This is far from optimal, because for our purposes (a blog post) clarity and brevity are more important.

What About Zero Knowledge???

An observant reader will point out that when we provide the authentication path for node 5, we provide the hash of node 4.

A snooping verifier may try hashing various words from the titles of songs of the Spanish vocal duo Baccara, and when it gets the hash we sent it as "the hash of node 4", it will have found out a leaf of the tree that we never intended to expose!

In the next post in the series, we'll deal with the ZK issue, using a simple but effective trick to get a Zero Knowledge Merkle Tree.

Also, we'll hopefully tie everything together to get the proof we originally wanted!

Part III