Revisions to Hash function such that fx(hash(s1),hash(s2)) = hash(concat(s1,s2))

replaced http://crypto.stackexchange.com/ with https://crypto.stackexchange.com/

edited Apr 13, 2017 at 12:48

1

The idea in that other answer answer as modified per this comment comment also works, but is prone to collisions for messages differing by few consecutive bytes, contrary to all the methods considered in the present answer.

Discuss principle of a practical implementation, and random reduction polynomial

Source Link

edited Mar 25, 2015 at 14:19

fgrieu ♦

150.9k
13
327
631

I'll consider only a non-adversarial model for the requirement of a low collision probability; that is, we are considering naturally-occurring strings only (which implies thatthey are of bounded size; I'll limit it to $2^{64}-1$ bits, over 2305 Petabyte). However I'll consider that we need to reliably detect strings that differ only in a small consecutive segment. Caveat: in the following, hash is thus not used in its default meaning in cryptography.

The hash of any bitstring $s$ of at most $192$ bits is its length $l$ (in binary over $64$ bits), followed by $192-l$ 0 bits, followed by $s$. For longer bitstrings, we can either use the definition, or use $g(h(s_1),h(s_2))=h(s_1\|s_2)$; both allow computing the hash of an arbitrary bitstring in time roughly proportional to its length.

A practical implementation of $h$ on a modern $64$-bit CPU can use a table of $2048$ pre-computed $192$-bit values ($8$ tables each $256$ entries each $3$ words of $64$ bits), and each additional $8$ bytes of $s$ will require only $8$ fetches of $3$-word entries, and few simple operations dominated by $24$ wordwise XOR. That's very fast.

One suitable $p(x)$ is $x^{192}+x^{149}+x^{97}+x^{46}+1$, taken from Jörg Arndt's Table of weight-5 binary primitive polynomials with (roughly) equally spaced coefficients. We might however want to use a random primitive polynomial, which gives even better insurance that no collision occurs between naturally occurring strings (argument: with neither knowledge of the polynomial nor example hashes, it is computationally impossible to find distinct strings with sizable odds of hash collision).

I'll consider only a non-adversarial model for the requirement of a low collision probability; that is, we are considering naturally-occurring strings only (which implies that are of bounded size; I'll limit it to $2^{64}-1$ bits, over 2305 Petabyte). However I'll consider that we need to reliably detect strings that differ only in a small consecutive segment.

The hash of any bitstring $s$ of at most $192$ bits is its length $l$ (in binary over $64$ bits), followed by $192-l$ 0 bits, followed by $s$. For longer bitstrings, we can either use the definition, or use $g(h(s_1),h(s_2))=h(s_1\|s_2)$; both allow computing the hash of an arbitrary bitstring in time roughly proportional to its length.

One suitable $p(x)$ is $x^{192}+x^{149}+x^{97}+x^{46}+1$, taken from Jörg Arndt's Table of weight-5 binary primitive polynomials with (roughly) equally spaced coefficients.

I'll consider only a non-adversarial model for the requirement of a low collision probability; that is, we are considering naturally-occurring strings only (which implies they are of bounded size; I'll limit it to $2^{64}-1$ bits, over 2305 Petabyte). However I'll consider that we need to reliably detect strings that differ only in a small consecutive segment. Caveat: in the following, hash is thus not used in its default meaning in cryptography.

The hash of any bitstring $s$ of at most $192$ bits is its length $l$ (in binary over $64$ bits), followed by $192-l$ 0 bits, followed by $s$. For longer bitstrings, we can either use the definition, or use $g(h(s_1),h(s_2))=h(s_1\|s_2)$; both allow computing the hash of an arbitrary bitstring in time roughly proportional to its length.

A practical implementation of $h$ on a modern $64$-bit CPU can use a table of $2048$ pre-computed $192$-bit values ($8$ tables each $256$ entries each $3$ words of $64$ bits), and each additional $8$ bytes of $s$ will require only $8$ fetches of $3$-word entries, and few simple operations dominated by $24$ wordwise XOR. That's very fast.

One suitable $p(x)$ is $x^{192}+x^{149}+x^{97}+x^{46}+1$, taken from Jörg Arndt's Table of weight-5 binary primitive polynomials with (roughly) equally spaced coefficients. We might however want to use a random primitive polynomial, which gives even better insurance that no collision occurs between naturally occurring strings (argument: with neither knowledge of the polynomial nor example hashes, it is computationally impossible to find distinct strings with sizable odds of hash collision).

Explain the simplest idea first, and in detail

Source Link

edited Mar 25, 2015 at 13:28

fgrieu ♦

150.9k
13
327
631

I'll consider only a non-adversarial model for the requirement of a low collision probability (thatprobability; that is, we are considering naturally-occurring strings only (which implies that are of bounded size; I'll limit it to $2^{64}-1$ bits, over 2305 Petabyte). However I'll consider that we need to reliably detect strings that differ only in a small consecutive segment.

My (first and overly complex; see updatenew) proposed $h(s)$ is $256$-bit ($32$ bytes), among which the first $64$ bits $l(s)$ are the number of bits in $s$ (truncated to the lower $64$ bits),; and the $192$ last bits are a $m(s)=2^s\bmod p$$192$-bit Cyclic Redundancy Check (CRC) of $s$, that is as the coefficients of $s(x)\bmod p(x)$, where $p$$s(x)$ is the binary polynomial with coefficients the bits of $s$, and $p(x)$ is a fixed $192$-bit prime such that $(p-1)/2$ is prime and $2^{(p-1)/2}\bmod p=p-1$primitive polynomial. Conversions from bitstring to integer or polynomial, and back, are big-endian (mostthe last bit corresponds to the least significant bit firstof integer or constant term of polynomial).

We define $g(h_1,h_2)$ as follows:

from $h_1$ and $h_2$ we get the integers $l_1$ and $l_2$ as the first $64$ bits, and build the first $64$ bits of $g(h_1,h_2)$ as $l_1+l_2$;

from $h_1$ and $h_2$ we get the polynomials $m_1(x)$ and $m_2(x)$ which coefficients are the last $192$ bits, and build the last $192$ bits of $g(h_1,h_2)$ as the coefficients of $$m_1(x)\cdot x^{l_2}+m_2(x)\bmod p(x)$$

The functiondesired $m$$g(h(s_1),h(s_2))=h(s_1\|s_2)$ holds, and its computation is suchfast, independently of the length of the original strings, noticing that $m(s)=m(s\bmod(p-1))=2^{s\bmod(p-1)}\bmod p$$x^{l_2}\bmod p(x)$ is quickly computed by pre-computing $q_j(x)=x^{2^j}\bmod p(x)$ for $0\le j<64$, and multiplying modulo $p(x)$ those $q_j(x)$ corresponding to the bits set in $l_2$.

The hash of any bitstring $s$ of at most $192$ bits is its length $l$ (in binary over $64$ bits), followed by $192-l$ 0 bits, followed by $s$. For longer bitstrings, we can either use the definition, or use $g(h(s_1),h(s_2))=h(s_1\|s_2)$; both allow computing the hash of an arbitrary bitstring in time roughly proportional to its length.

It is trivial to make collisions, but that can not occur between bitstrings of different length, or which allows computationdiffer only in a small segment (of size no more than $192$ bits).

One suitable $p(x)$ is $x^{192}+x^{149}+x^{97}+x^{46}+1$, taken from Jörg Arndt's Table of weight-5 binary primitive polynomials with (roughly) equally spaced coefficients.

My original (and in retrospect overly complex) proposed $h(s)$ is $256$-bit ($32$ bytes), among which the first $64$ bits are the number of bits in $s$, and the $192$ last bits are $2^s\bmod p$, where $p$ is a fixed $192$-bit prime such that $(p-1)/2$ is prime and $2^{(p-1)/2}\bmod p=p-1$.

Because $2^s\bmod p=2^{s\bmod(p-1)}\bmod p$, we can compute the hash of a string in time about proportional to its length, by first reducing $s$ modulo $p-1$.

from $h_1$ and $h_2$ we get the integers $l_1=\lfloor h_1/2^{192}\rfloor$$l_1$ and $l_2=\lfloor h_2/2^{192}\rfloor$$l_2$ as the first $64$ bits, and build the first $64$ bits of $g(h_1,h_2)$ as $$l_1+l_2\bmod2^{64}$$ $l_1+l_2$;
from $h_1$ and $h_2$ we get the integers $m_1=h_1\bmod2^{192}$$m_1$ and $m_2=h_2\bmod2^{192}$$m_2$ as the last $192$ bits, and build the last $192$ bits of $g(h_1,h_2)$ as $$m_1^{2^{l_2}\bmod(p-1)}\cdot m_2\bmod p$$

It is easy to find a collision, but it won't occur by accident, and only for strings of equal length. For example, the $24$-byte strings
000000000000000000000000000000000000000000000001 and
ff41208fd469fe8cfdcfdb1b1a977095890fed10d7b6f8a3 both hash to
00000000000000C0000000000000000000000000000000000000000000000002.

Update: the The above has the property that it is slightly difficult to come up with a string having a given random value in the right part (first preimage resistance); that's the discrete log problem. By making $p$ much larger, or using a group derived from point addition on an elliptic curve of known order, we can make that very difficult.

But we do not need that property This is nice to have, but not useful in the situationcontext of the question, and can go a much simpler method: replace $m(s)$ with a 192-bit CRC! I'll update the answer accordingly when time allows.

I'll consider only a non-adversarial model for the requirement of a low collision probability (that is, we are considering naturally-occurring strings only).

My (first and overly complex; see update) proposed $h(s)$ is $256$-bit ($32$ bytes), among which the first $64$ bits $l(s)$ are the number of bits in $s$ (truncated to the lower $64$ bits), and the $192$ last bits are $m(s)=2^s\bmod p$, where $p$ is a fixed $192$-bit prime such that $(p-1)/2$ is prime and $2^{(p-1)/2}\bmod p=p-1$. Conversions from bitstring to integer and back are big-endian (most significant bit first).

The function $m$ is such that $m(s)=m(s\bmod(p-1))=2^{s\bmod(p-1)}\bmod p$, which allows computation of the hash of a string in time about proportional to its length.

from $h_1$ and $h_2$ we get the $l_1=\lfloor h_1/2^{192}\rfloor$ and $l_2=\lfloor h_2/2^{192}\rfloor$, and build the first $64$ bits of $g(h_1,h_2)$ as $$l_1+l_2\bmod2^{64}$$
from $h_1$ and $h_2$ we get the $m_1=h_1\bmod2^{192}$ and $m_2=h_2\bmod2^{192}$, and build the last $192$ bits of $g(h_1,h_2)$ as $$m_1^{2^{l_2}\bmod(p-1)}\cdot m_2\bmod p$$

It is easy to find a collision, but it won't occur by accident. For example, the $24$-byte strings
000000000000000000000000000000000000000000000001 and
ff41208fd469fe8cfdcfdb1b1a977095890fed10d7b6f8a3 both hash to
00000000000000C0000000000000000000000000000000000000000000000002.

Update: the above has the property that it is slightly difficult to come up with a string having a given random value in the right part (first preimage resistance); that's the discrete log problem. By making $p$ much larger, or using a group derived from point addition on an elliptic curve of known order, we can make that very difficult.

But we do not need that property in the situation of the question, and can go a much simpler method: replace $m(s)$ with a 192-bit CRC! I'll update the answer accordingly when time allows.

I'll consider only a non-adversarial model for the requirement of a low collision probability; that is, we are considering naturally-occurring strings only (which implies that are of bounded size; I'll limit it to $2^{64}-1$ bits, over 2305 Petabyte). However I'll consider that we need to reliably detect strings that differ only in a small consecutive segment.

My (new) proposed $h(s)$ is $256$-bit ($32$ bytes), among which the first $64$ bits are the number of bits in $s$ (truncated to the lower $64$ bits); and the $192$ last bits are a $192$-bit Cyclic Redundancy Check (CRC) of $s$, that is as the coefficients of $s(x)\bmod p(x)$, where $s(x)$ is the binary polynomial with coefficients the bits of $s$, and $p(x)$ is a fixed $192$-bit primitive polynomial. Conversions from bitstring to integer or polynomial, and back, are big-endian (the last bit corresponds to the least significant bit of integer or constant term of polynomial).

We define $g(h_1,h_2)$ as follows:

from $h_1$ and $h_2$ we get the integers $l_1$ and $l_2$ as the first $64$ bits, and build the first $64$ bits of $g(h_1,h_2)$ as $l_1+l_2$;

from $h_1$ and $h_2$ we get the polynomials $m_1(x)$ and $m_2(x)$ which coefficients are the last $192$ bits, and build the last $192$ bits of $g(h_1,h_2)$ as the coefficients of $$m_1(x)\cdot x^{l_2}+m_2(x)\bmod p(x)$$

The desired $g(h(s_1),h(s_2))=h(s_1\|s_2)$ holds, and its computation is fast, independently of the length of the original strings, noticing that $x^{l_2}\bmod p(x)$ is quickly computed by pre-computing $q_j(x)=x^{2^j}\bmod p(x)$ for $0\le j<64$, and multiplying modulo $p(x)$ those $q_j(x)$ corresponding to the bits set in $l_2$.

The hash of any bitstring $s$ of at most $192$ bits is its length $l$ (in binary over $64$ bits), followed by $192-l$ 0 bits, followed by $s$. For longer bitstrings, we can either use the definition, or use $g(h(s_1),h(s_2))=h(s_1\|s_2)$; both allow computing the hash of an arbitrary bitstring in time roughly proportional to its length.

It is trivial to make collisions, but that can not occur between bitstrings of different length, or which differ only in a small segment (of size no more than $192$ bits).

One suitable $p(x)$ is $x^{192}+x^{149}+x^{97}+x^{46}+1$, taken from Jörg Arndt's Table of weight-5 binary primitive polynomials with (roughly) equally spaced coefficients.

My original (and in retrospect overly complex) proposed $h(s)$ is $256$-bit ($32$ bytes), among which the first $64$ bits are the number of bits in $s$, and the $192$ last bits are $2^s\bmod p$, where $p$ is a fixed $192$-bit prime such that $(p-1)/2$ is prime and $2^{(p-1)/2}\bmod p=p-1$.

Because $2^s\bmod p=2^{s\bmod(p-1)}\bmod p$, we can compute the hash of a string in time about proportional to its length, by first reducing $s$ modulo $p-1$.

from $h_1$ and $h_2$ we get the integers $l_1$ and $l_2$ as the first $64$ bits, and build the first $64$ bits of $g(h_1,h_2)$ as $l_1+l_2$;
from $h_1$ and $h_2$ we get the integers $m_1$ and $m_2$ as the last $192$ bits, and build the last $192$ bits of $g(h_1,h_2)$ as $$m_1^{2^{l_2}\bmod(p-1)}\cdot m_2\bmod p$$

It is easy to find a collision, but it won't occur by accident, and only for strings of equal length. For example, the $24$-byte strings
000000000000000000000000000000000000000000000001 and
ff41208fd469fe8cfdcfdb1b1a977095890fed10d7b6f8a3 both hash to
00000000000000C0000000000000000000000000000000000000000000000002.

The above has the property that it is slightly difficult to come up with a string having a given random value in the right part (first preimage resistance); that's the discrete log problem. By making $p$ much larger, or using a group derived from point addition on an elliptic curve of known order, we can make that very difficult. This is nice to have, but not useful in the context of the question.