Chapter 3
Data fingerprinting

Initial version: 2024-02-13
Last update: 2024-04-09

In the previous chapter I have discussed the problem of saving in a register informations about a wide range of physical objects you may be an owner of. It doesn't matter if it is a car, home, money, kids, dog or anything else – some data related to this thing must be saved. This is why from now I will abstract from real objects and regardless of their type treat all data equally just as a sequence of characters.

The goal I set for myself in this chapter is to teach you what you have to do to identify this sequence of characters (regardless of what real object it represents) and guarantee that any interference with it will leave a noticeable trace.

Table of contents


Fingerprints


Fingerprints are one-way relation. If you know a person, you can get her or his fingerprint easily. But if you have a fingerprint it's impossible to find out whose fingerprint it is unless this person is not in police records. By fingerprinting data I understand a process of calculating their fingerprint – something which is unique and characterizing only this one sequence of bytes. Something which resembles a real fingerprint.

Making a good digital fingerprint is in itself a very serious science and all theoretical considerations are out of the scope of this book. However I want you to have some indispensable knowledge and intuition in this area mostly to make you believe that it can work properly.

Very easy approach


How you can calculate a fingerprint in practice? You can start from something as simple as the number of characters in your electronic document.

This type of fingerprint has some desirable property every fingerprint should have:

  • You can compute it quickly.
  • It is deterministic – for the same sequence it returns always the same fingerprint.
  • You cannot recreate or guess "contents" of string based on fingerprint.
  • The size of fingerprint may be fixed (for example you can use 128-bit binary numer which is big enough to hold most of data).


However it lacks uniqueness. It is very easy to change signed sequence of data or even make a new one keeping the same fingerprint.


data:        number of chars:
big apple    9
small dog    9
small pig    9


Improved approach


To make your fingerprint more resistant to content tampering you may fix dimension of virtual page and assume that every character occupies one cell that is always at the intersection of some row and some column. Next for every column and every row you can calculate four numbers: number of letters, numer of digits, number of other printable characters and number of white characters.


data:

          1111111111222222222233
01234567890123456789012345678901

This morning I decided to transf (27,0,0,5)
er the amount of 137 (one hundre (22,3,1,6)
d and thirty-seven) dollars to y (25,0,2,5)
our account (1234-56789-012).    (10,12,5,5)

                                     in column:
43333334444312232221222331323323 <-- number of letters
00000000000001111122111011100000 <-- number of digits
00000000000020000100010100011000 <-- number of other printable characters
01111110000111101001101002010121 <-- number of white characters

Fingerprint:
(4,0,0,0)(3,0,0,1)(3,0,0,1)(3,0,0,1)
(3,0,0,1)(3,0,0,1)(3,0,0,1)(4,0,0,0)
(4,0,0,0)(4,0,0,0)(4,0,0,0)(3,0,0,1)
(1,0,2,1)(2,1,0,1)(2,1,0,1)(3,1,0,0)
(2,1,0,1)(2,1,1,0)(2,2,0,0)(1,2,0,1)
(2,1,0,1)(2,1,1,0)(2,1,0,1)(3,0,1,0)
(3,1,0,0)(1,1,0,2)(3,1,0,0)(2,0,1,1)
(3,0,1,0)(3,0,0,1)(2,0,0,2)(3,0,0,1)
(27,0,0,5)(22,3,1,6)(25,0,2,5)(10,12,5,5)
This is much more resistant to any forgery attempts. For example your hash will change if you add an extra space even if the meaning of the text itself remains unchanged.

You can try to find better way for calculating fingerprint. Doing this please keep in mind that two very important properties should be fulfilled:

Fingerprint should have pseudorandom nature. This is something which is not known in real life, as you are familiar with situation when all your fingerprints of a given finger are the same no matter where you left them. Digital fingerprint should be different: every time you change at least one bit of data it should also changed. Moreover, the whole fingerprint should changed as an effect of changing only one bit and it should be impossible to predict how it will change. In my example given above you know in advance that every time you change contents of only one cell, at most one row and one column (if any) is affected while all other numbers stay the same.

The above mentioned behavior, when even the smallest change in input yield a fingerprint so different that the new fingerprint appears uncorrelated with the old (previous) fingerprint, is known under the avalanche effect name.

Fingerprint should be collision resistant. Collision means that you have different documents resulting the same fingerprint. And it is quite obvious that there must be a lot of (sometimes even infinitely many) documents resulting the same fingerprints. Resistance in this case means that despite you have a lot of documents with the same fingerprint, only a tiny part of them has meaningful content and only one is relevant to your case.

Cryptographic hash function


As you may guess creating application-ready fingerprint function is not an easy job but fortunately several known collectively as cryptographic hash functions have been developed.

In the wider sense a hash function is any function that can be used to map data of arbitrary size to fixed-size values (though there are some hash functions that support variable length output). The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

A cryptographic hash function implements a hash algorithm that must be able to withstand all known types of cryptanalytic attack. In short those attacks have two goals:

  • You can try to guess the data based on its hash.
  • You can try to predict what the hash will be like given the data, or how the (known) hash will change if you modify the (known) data (but without computing the hash).


In theoretical cryptography, the security level of a cryptographic hash function has been defined using the following properties:

  • Pre-image resistance Given a hash value h, it should be difficult to find any message m such that:

    
    h - known
    
    find `m` such that:
    
    h = hash(m)
    
    This concept is related to that of a one-way function.
  • Second pre-image resistance Given an input m1 it should be difficult to find a different input m2 such that:

    
    m1 - known
    
    find `m2` such that:
    
    hash(m1) = hash(m2)
    
    This property is sometimes referred to as weak collision resistance.

    Second pre-image resistance prevents an attacker from crafting a document with the same hash as a document the attacker cannot control.
  • Collision resistance It should be difficult to find two different messages m1 and m2 such that:

    
    find `m1` and `m2` such that:
    
    hash(m1) = hash(m2)
    
    Such a pair is called a cryptographic hash collision. This property is sometimes referred to as strong collision resistance. It requires a hash value at least twice as long as that required for pre-image resistance; otherwise collisions may be found by a birthday attack [sec003].

    Collision resistance prevents an attacker from creating two distinct documents with the same hash.


Informally, these properties mean that a malicious adversary cannot replace or modify the input data without changing its hash. Thus, if two strings have the same hashes, one can be very confident that they are identical.

However even a function meeting these criteria may still have undesirable properties. For example some cryptographic hash functions are vulnerable to length-extension attack [sec004]. In practice, collision resistance is insufficient for many practical uses. In addition to collision resistance, it should be impossible for an adversary to find two messages with substantially similar hash; or to infer any useful information about the data, given only its hash. This is why a hash function additionally should behave as much as possible like a random function.

With cryptographic hash function you have only a fingerprint of your data. Having a fingerprint you can verify if it is a fingerprint of a specific document, but you cannot guarantee that the document is yours. For this you need another tool you will learn in next chapter.


Hash in practice


In this chapter you will get practical skills of computing hashes. It's good to know it even if you don't want to implement blockchain simply because with this you can verify integrity of your data.

Import required libraries


Import libraries directly required to hash data:


import json
import base64
and libraries required to print images (which be an example of binary data you will enclose in your block of data):


import cv2
from PIL import Image
import io 


Data to bytes


Prepare block of data – I use dictionary as the most versatile structure – and convert it to bytes. Remember that regardless of what is your data, what they represent, from hash function point of view it is always treated as a sequence of bytes.


# Initialize dictionary

testData = {"k1": 1,
            "k2": "abc",
            "k3": ["a", "bc", "def"]}

print(testData)

# From dictionary to bytes:  
# Step 1/2: convert dictionary into JSON string

testString = json.dumps(testData, sort_keys=True)

print(testString)

# Step 2/2: convert string to bytes

testBytes = testString.encode('utf-8')

print(testBytes)
In result you will see the following output:


{'k1': 1, 'k2': 'abc', 'k3': ['a', 'bc', 'def']}
{"k1": 1, "k2": "abc", "k3": ["a", "bc", "def"]}
b'{"k1": 1, "k2": "abc", "k3": ["a", "bc", "def"]}'
The last output seems to be the same as two previous but it looks like so only because your data consist of ASCII characters. It would be much more clear if you use for example Polish letters. If you change value abc under the key k2 into Łódź you will see clearly binary nature of last result:


{'k1': 1, 'k2': 'Łódź', 'k3': ['a', 'bc', 'def']}
{"k1": 1, "k2": "\u0141\u00f3d\u017a", "k3": ["a", "bc", "def"]}
b'{"k1": 1, "k2": "\\u0141\\u00f3d\\u017a", "k3": ["a", "bc", "def"]}'
In any case Python signals binary sequence with letter b in front of it.



Bytes to data


Restore data from bytes:


# From bytes to dictionary
# Step 1/2: convert bytes to JSON string

testString2 = testBytes.decode('utf-8')

print(testString2)

# Step 2/2: convert JSON string to dictionary
testData2 = json.loads(testString2)

print(testData2)
In result you will see the following output:


{"k1": 1, "k2": "abc", "k3": ["a", "bc", "def"]}
{'k1': 1, 'k2': 'abc', 'k3': ['a', 'bc', 'def']}


Add binary field to dictionary


Not all your data is a clear text, sometimes you may have various binary data, like for example an image. To put it as a part of dictionary, you have to serialize it, i.e. turn it into sequence of printable characters, a pure ASCII string. You can do this with base64 encoding:


pathImageSrc = "/content/drive/MyDrive/ul/blockchain/fotka04.png"

# Read binary data (PNG image), convert it to text with Base64
# and add it to dictionary under the 'image' key

with open(pathImageSrc, "rb") as data:
  image = data.read()
  Image.open(io.BytesIO(image)).show()
  print(image)
  testData["image"] = base64.b64encode(image).decode('utf-8') 

print(testData)

testString = json.dumps(testData, sort_keys=True)

print(testString)

testBytes = testString.encode('utf-8')

print(testBytes)
In result you will see the following output:

Figure: An image I include in a dictionary as a part of data



b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x00\x00\x00\x01\x00\x08\x02\x00[...]\xa1c\xd2\xaa\x00\x00\x00\x00IEND\xaeB`\x82'
{'k1': 1, 'k2': 'abc', 'k3': ['a', 'bc', 'def'], 'image': 'iVBORw0KGgoAAAANSUhEUgAAA[...]FTkSuQmCC'}
{"k1": 1, "k2": "abc", "k3": ["a", "bc", "def"], "image": "iVBORw0KGgoAAAANSUhEUgAAA[...]FTkSuQmCC"}
b'{"k1": 1, "k2": "abc", "k3": ["a", "bc", "def"], "image": "iVBORw0KGgoAAAANSUhEUgAAA[...]FTkSuQmCC"}'


Get image back



testString = testBytes.decode('utf-8')
testData = json.loads(testString)
image = base64.b64decode(testData["image"].encode('utf-8'))
Image.open(io.BytesIO(image)).show()
print(image)

pathImageDst = "/content/drive/MyDrive/ul/blockchain/fotka04_result.png"

with open(pathImageDst, "wb") as data:
  data.write(image)
In result you will see the following output:

Figure: An image I include in a dictionary as a part of data



b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x00\x00\x00\x01\x00\x08\x02\x00[...]\xa1c\xd2\xaa\x00\x00\x00\x00IEND\xaeB`\x82'


Calculating Hash


Python is equipped with the standard stash of popular hash functions; they are available in the hashlib library. Compare the following two examples which in code differs only at one character which is letter e in string Hello in the first case and then I replaced it with letter f which result a string Hfllo.

Example 1:

Code:


import hashlib

testData = {"k1": "Hello"}
testString = json.dumps(testData, sort_keys=True)
testBytes = testString.encode('utf-8')
hashObj = hashlib.sha256(testBytes)

print(hashObj.hexdigest())
Result:


86dc834e4217edf2a217cdc370e3e44746276ef64362e5c506c5f2059a99a8a3


Example 2:

Code:


testData = {"k1": "Hdllo"}
testString = json.dumps(testData, sort_keys=True)
testBytes = testString.encode('utf-8')
hashObj = hashlib.sha256(testBytes)
print(hashObj.hexdigest())
Result:


33faddb754cbbb19c12150c4605c348fc0058fdf7bcd35c4ed100d6dee9ca750


As you can see even one-bit change (code for letter e is U+0065, 0x65 while d is U+0064, 0x64) results totally different hashes.

Sending unfraudable message


If you want to send a message and be sure that no byte would be changed, you can obey the following procedure:

  1. Alice and Bob both share a secret phrase S.
  2. Alice creates a combined sequence C concatenating the message M with the secret S appended to the end of the message:

    
    C = M + S
    
  3. Calculate a hash H_a of C:

    
    H_a = hash(C) = hash(M + S)
    
  4. Alice sends the message M and the computed hash H_a to Bob.
  5. Bob checks the message integrity by calculating hash of M + S himself:

    
    H_b = hash(M + S)
    
    to see if it’s the same as the hash send by Alice:

    
    H_b =?= H_a
    
  6. If both hashes are equal, Bob can be pretty sure that message he received is not frauded – what he received is actually what she sent. However, Bob has no guarantee that message was written by Alice.


The code implementing steps 1-3 may look like this:


import hashlib

secretPhrase = "PaSSwOrD"

def hashDataWithSecretPhrase(inputData, secretPhrase):
	combined = inputData + secretPhrase
	return hashlib.sha256(combined.encode()).hexdigest()
	
emailText = ...
hash = hashDataWithSecretPhrase(emailText, secretPhrase)
Now you are ready to send both emailText and hash to recipient.

Proof of work – solving a hash puzzle


Hashes of our interest are random and impossible to predict. This is highly desirable. However sometimes, for some reasons you will know later, you may require that computed hash fulfil some conditions. For example, you may require that hash of some data contains a given number of selected digits – it may have five leading zeros.

How is this possible? Hash of data is deterministic – if the sequence doesn't change, its hash doesn't change as well.

The hash puzzle is: Find the nonce that combined with data (as Alice combined message M with secret S) yields a hash value that starts with five leading zeros (or you can formulate any other reasonable condition of this type – see chapter 8: Difficulty and target for difficulty description which is used in Bitcoin). Typically nonce is a number and hence its name: nonce is an abbreviation of number used once.

Why someone may need to solve such a puzzle? Well, as you saw this is not a trivial puzzle. Saying the truth, it is not a puzzle at all, because you can't solve it other way than by brut-force testing one by one candidate as long as you will finally meet the requirements. Any heuristic, artificial intelligence, guess are useless.

Because it's not trivial, it's time consuming and doesn't guarantee success, you must have a good reason to take this work. After all when you complete your work, you can easily prove that you did it by presenting nonce value, combine it with data and calculate hash which will for sure meet the requirements. Because of the price someone must pay for searching nonce it is very unlikely that result presented by you was done by other person (it is possible but you would have to pay a lot for that to offset the incurred costs). This is why solving such a puzzle is an example of Proof of Work (PoW in short).

The process of searching for a nonce is called a mining, probably because the probability of success is as low as low is a probability of finding a gold nugget working in a mine, but the benefits of success are significant – enrichment of the miner, both in real and virtual (digital) world.

In Bitcoin mining procedure is required to create new blocks of transactions. A miner, after successfully completing the mining of a block, receive a reward and is granted to insert a transaction from nobody to himself with an amount of Bitcoin that halves every 210,000 blocks. This is how new Bitcoins are generated and placed on the market.

Below you have an implementation of the above Proof of Work procedure:


import json
import random

import hashlib
from timeit import default_timer as timer

def proofOfWork(blockOfData, difficulty):
  leadingZeros = "0" * difficulty
  counter = 1
  
  start = timer()
  while True:
    blockOfData["nonce"] = format(random.getrandbits(64), "x")
    dataString = json.dumps(blockOfData, sort_keys=True)
    dataBytes = dataString.encode('utf-8')
  	 
    hash = hashlib.sha256(dataBytes).hexdigest()
    
    if hash.startswith(leadingZeros):
      blockOfData["difficulty"] = difficulty
      blockOfData["hash"] = hash
      blockOfData["counter"] = counter
      end = timer()
      blockOfData["powTime"] = end-start
      break
      
    counter += 1

  return blockOfData

dataJSON = {"name": "Piotr", "country": "Poland"}
blockOfData = {"data" : dataJSON,
               "nonce": None}     

proofOfWork(blockOfData, difficulty = 1)

print(blockOfData)
Of course the nonce is not uniquely determined – there exist many different nonce conforming assumed rules. When I called twice the proofOfWork() function:


import copy

dataJSON = {"name": "Piotr", "country": "Poland"}
blockOfData = {"data" : dataJSON,
               "nonce": None}     

minedBlockOfData1 = proofOfWork(copy.deepcopy(blockOfData), difficulty = 3)
print(minedBlockOfData1)

minedBlockOfData2 = proofOfWork(copy.deepcopy(blockOfData), difficulty = 3)
print(minedBlockOfData2)
I got the following results:

{
  'data': {'name': 'Piotr', 'country': 'Poland'},
  'nonce': '1d7e53d71a2d60c7',
  'difficulty': 3,
  'hash': '000d26177ca0504cd510d64e7544141eb1e4db201a849883ca9fbf87eadaa542',
  'counter': 1080,
  'powTime': 0.009034951
}

{
  'data': {'name': 'Piotr', 'country': 'Poland'},
  'nonce': '4ce38ee0be35cf5b',
  'difficulty': 3,
  'hash': '00068006d0a6ee6c8e94ff6d2af909b683361c6e0842835995aa3fa868aa774b',
  'counter': 4023,
  'powTime': 0.033487146999999995
}
As you can see two different nonce: 1d7e53d71a2d60c7 and 4ce38ee0be35cf5b results with two, different but meeting assumptions about the number of leading zeros, hashes.

I strongly recommend you to spend your time playing with different difficulty definitions. If you stay with the one I gave above, you will obtain results similar to mine given in the following table where you can observe the consequences in computation complexity (time and number of iterations averaged over 10 runs; get full test results) of difficulty parameter:

Difficulty Number of iterations Time [s]
1       10.0   0.0001459932
2      156.9   0.0013508364
3     2259.5   0.0195884146
4    41935.7   0.4406605445
5  1858598.7  17.0438893153
6 12791193.6 124.2700059172


What does it mean: Hash data?


When you have static set of data – the set which will stay in the present form for ever – hashing is quite simple. However, when your data is constantly changing, some data are appended, deleted or removed you have to decide the way you will hash them.

You can distinguish the following types of hashing:

  • repeated,
  • independent,
  • combined,
  • sequential,
  • hierarchical.


Now I will explain you all of them so you could make a right decision when you will have to choose one of them.

Repeated hashing


Saying the truth this type of hashing shouldn't be considered here as it is not about evolving data (data which still are appended); it treats each data block independently from each other and this way there is no special procedure or data structure you should preserve. However it is good to know it as it is a part of other hashing procedures and helps protect your data (you will learn about it soon).

It is the simplest form of hashing when you calculate hash for data hash.


[IMG: hashing_repeated]


Of course you can repeat this procedure more than twice:


[IMG: hashing_repeated]


Independent hashing


Independent hashing means applying the hash function to each incoming data independently:


[IMG: hashing_independent]


It is fast (at least in case of relatively small blocks) but not preserve sequence of data and sequence of their hashing.

Combined hashing


Combined hashing allows you to get a single hash value for more than one independent blocks of data. To do this you simply combine them into one block of data and calculate its hash value afterward. This is in particular useful if you want to create one single hash value for a collection of data that is available at a given time:


[IMG: hashing_combined]


This type of hashing requires you to repeat computations on old data every time you add a new data. Because of this you should use it only when the individual pieces of data are small:


[IMG: hashing_combined_new_data_001-kopia]


Another property is that you can use hash value only when you finish computing it for last byte of your data. No matter how big the dataset is and how long it takes to compute hash, you have no "intermediate" results confirming partial correctness of your data. In case of big data sets this could be a performance bottleneck however in some other situation it can be a desired property (it makes forgery more difficult).

Sequential hashing


Combined hashing may be highly ineffective in case of fast incoming new data or changes in existing (big) data blocks because it requires you to recalculate hash again and again as new data or changes appear.

With sequential hashing you can incrementally update a hash value as new data arrive.

This is achieved by using combined and repeated hashing at the same time. Instead of hashing new data first you combine it with existing hash value. Next you apply to this combination a hash function in order to get new hash value:


[IMG: hashing_sequential]


What is important, new hash doesn't replace a previous one but rather you have a sequence of hash values whose evolution you can trace back. Of course, the "youngest" hash is the most important as it is a fingerprint of all data you have. If you want, you can delete all previous hashes. However, keeping them allows you to verify integrity of your data partially, without a need of recomputing hashes for all data.

This is in particular useful if you want to have a hash value for a collection of data which evolves in the way that new data is appended all the time, very often quickly and in large numbers.

Hierarchical hashing


This is another type of "incremental" hashing. In sequential hashing every change in a one of existing blocks of data require you to recompute all subsequent hashes – this is quite obvious. Because you do this for all subsequent blocks of data combined with hashes, you may need a lot of computing power:


[IMG: hashing_sequential_new_data]


Similar to sequential hashing, the idea of hierarchical hashing is the creation of one single hash value for a collection of data. However in this case you create a tree-like hierarchy of hash values with a single value on its top:


[IMG: hashing_hierarchical]


Every time you change a block of data you have to recompute only some hashes:


[IMG: hashing_hierarchical_new_data]


Hierarchical hashing forms a tree-like structure:


[IMG: hashing_hierarchical_as_a_tree_01]


To make it more efficient, data are usually located only in leaves node at the lowest level of the binary tree (binary means that every node, if it is only possible, has two child nodes):


[IMG: hashing_hierarchical_as_a_tree_02]



[IMG: hashing_hierarchical_as_a_tree_binary]


This representation is known as a hash tree or Merkle tree after Ralph Merkle, who patented it in 1979.

Implement hashing


You start by implementing parent class Hasher with a set of basic methods you may need:


import json

import hashlib
from enum import Enum

class Hasher:
  class OperationType(Enum):
    ADD = 1
    REPLACE = 2
    DELETE = 3
    
  def __init__(self):
    self.data = []
    self.hash = []

  def addData(self, data, index=None):
    if index is not None:
      self.data.insert(index, data)
    else:
      self.data.append(data)
    self.calculateHash(self.OperationType.ADD, index)
    
  def replaceData(self, data, index):
    self.data[index] = data
    self.calculateHash(self.OperationType.REPLACE, index)
    
  def deleteData(self, index):
    del self.data[index]
    self.calculateHash(self.OperationType.DELETE, index)
  
  def calculateHash(self, operationType, index):
    pass
Now you can implement any hashing method based on this blueprint of child class:


class SomeHasher(Hasher):
  def calculateHash(self, operationType, index=None):
    if operationType == self.OperationType.ADD:
      # Put adequate code for ADD operation here
      pass
    elif operationType == self.OperationType.REPLACE:
      # Put adequate code for REPLACE operation here
      pass
    elif operationType == self.OperationType.DELETE:
      # Put adequate code for DELETE operation here
      pass


Implementing independent hashing


You can start implementing hashing with the simples one, the independent hashing:


class IndependentHasher(Hasher):
  def calculateHash(self, operationType, index=None):
    if operationType == self.OperationType.ADD:
      if index is not None:
        dataAsString = json.dumps(self.data[index])
        dataAsBytes = dataAsString.encode('utf-8')
        h = hashlib.sha256(dataAsBytes).hexdigest()
        self.hash.insert(index, h)
      else:
        dataAsString = json.dumps(self.data[-1])
        dataAsBytes = dataAsString.encode('utf-8')
        h = hashlib.sha256(dataAsBytes).hexdigest()
        self.hash.append(h)
    elif operationType == self.OperationType.REPLACE:
      dataAsString = json.dumps(self.data[index])
      dataAsBytes = dataAsString.encode('utf-8')
      h = hashlib.sha256(dataAsBytes).hexdigest()
      self.hash[index] = h
    elif operationType == self.OperationType.DELETE:
      del self.hash[index]
You can test it with the following code:


def testHasher(hasher):
  data = {"data": "data 1"}
  hasher.addData(data)
  print(hasher.data)
  print(hasher.hash)
  
  data = {"data": "data 2"}
  hasher.addData(data,0)
  print(hasher.data)
  print(hasher.hash)
  
  data = {"data": "data 3"}
  hasher.replaceData(data,1)
  print(hasher.data)
  print(hasher.hash)
  
  hasher.deleteData(0)
  print(hasher.data)
  print(hasher.hash)
  
  data = {"data": "data 4"}
  hasher.addData(data)
  print(hasher.data)
  print(hasher.hash)
  
  data = {"data": "data 5"}
  hasher.addData(data,1)
  print(hasher.data)
  print(hasher.hash)
  
 
hasher = IndependentHasher()
testHasher(hasher)
In result you will see:


[{'data': 'data 1'}]
['fe18...27dc']
[{'data': 'data 2'}, {'data': 'data 1'}]
['e9e0...4ba1', 'fe18...27dc']
[{'data': 'data 2'}, {'data': 'data 3'}]
['e9e0...4ba1', '295a...f765']
[{'data': 'data 3'}]
['295a...f765']
[{'data': 'data 3'}, {'data': 'data 4'}]
['295a...f765', 'd8b4...6cfa']
[{'data': 'data 3'}, {'data': 'data 5'}, {'data': 'data 4'}]
['295a...f765', 'fb9c...6621', 'd8b4...6cfa']
where the full hashes are:


fe18f6a79ac237177d8fa7f5ad42f9a3529976db4d5ed0e1cfe3146770fa27dc
e9e07129af6e53ea7687de71b58019e75b3b9b52ee923931752576abbc1d4ba1
295a3e540a659b33d0e8fd825e97a1ae48c3930d04f90294e34d39359743f765
d8b435b3fac36463a53d703dce3b4b64d2ec495c48953392d2749ef713c16cfa
fb9c3a7893767b1437ffc8c1efb1eab0694d15b0d5daaf4687a1f961e9506621


Implementing sequential hashing



[code]


Implementing hierarchical hashing


Hierarchical hashing is not very difficult but the most complex of all of them. Below I will present simplified version of hierarchical hashing. To make my code more understandable I always recalculate the whole tree of hashes. On my repository (see ) you can find full version which recalculates only hashes that must be recalculated because their child nodes has changed.

Please look at simplified code carefully. When you understand it's idea, you can check it's more economic cousin. In both versions I organize hashes the same way: as a list of lists (in terms of Python programming language, or as a list of lists in general terms). At the lowest level, at index 0, there is a list of all hashes of all data; the number of hashes and their order corresponds to the number of data. One level up there are hashes of corresponding "child hashes". Look at the example below:


data = [data_1, data_2, data_3]

hash = [
  level_0 = [hash(data_1), hash(data_2), hash(data_3)],
  level_1 = [hash(level_0[0], level_0[1]), hash(level_0[2], None)],
  level_2 = [hash(level_1[0], level_1[1])]   # At the top most level (level indexed by -1 in Python)
]                                            # there must be only one hash.
Hash at level_0 is a list (array) of three hashes each corresponding to adequate data block from data list (array).

At level_1 you have two hashes:
  • First hash is a "parent hash" of two child hashes from previous level: hash of data_1 denoted as level_0[0] and hash of data_2 denoted as level_0[1].
  • Second hash is a "parent hash" of: one child hash from previous level – hash of data_3 denoted as level_0[2] and None as there is no more hashes at level_0.
At level_2 you have one hash which is a "parent hash" of two child hashes from previous level: hash of (level_0[0], level_0[1]) denoted as level_1[0] and hash of (level_0[2], None) denoted as level_1[1]. Because at this level there are no more hashes than one, this ends process of making tree-like hierarchy of hashes.

Here you have the code:


class HierarchicalHasherSimple(Hasher):
  def calculateHash(self, operationType, index=None):
    # For simplicity I always recalculate
    # a whole hierarchy
    newTree = []
    newTreeOneLevel = []
    for d in self.data:
      dataAsString = json.dumps(d, sort_keys=True)
      dataAsBytes = dataAsString.encode('utf-8')
      h = hashlib.sha256(dataAsBytes).hexdigest()
      hashData = {
        "uuid": uuid.uuid4().hex,
        "hash": h,
        "uuidL": None,
        "uuidR": None
      }
      newTreeOneLevel.append(hashData)
      
    newTree.append(newTreeOneLevel)

    while len(newTree[-1]) > 1:
      size = len(newTree[-1])
      i = 0
      newTreeOneLevel = []
      while i < size:
        childL = newTree[-1][i]
        i += 1
        
        if i==size:
          childR = None
        else:
          childR = newTree[-1][i]
        i += 1

        data = {"childL": childL, "childR": childR}
        dataAsString = json.dumps(data, sort_keys=True)
        dataAsBytes = dataAsString.encode('utf-8')
        h = hashlib.sha256(dataAsBytes).hexdigest()
        hashData = {
          "uuid": uuid.uuid4().hex,
          "hash": h,
          "uuidL": childL["uuid"],
          "uuidR": childR["uuid"] if childR is not None else None
        }
        newTreeOneLevel.append(hashData)
      
      newTree.append(newTreeOneLevel)
      
    self.hash = newTree
If you test it the sam way as IndependentHasher you will see results similar to mine:


[{'data': 'data 1'}]
[
  [
    {'uuid': '9b...13', 'hash': 'fe...dc', 'uuidL': None, 'uuidR': None}
  ]
]

[{'data': 'data 2'}, {'data': 'data 1'}]
[
  [
    {'uuid': '12...93', 'hash': 'e9...a1', 'uuidL': None, 'uuidR': None},
    {'uuid': '98...f8', 'hash': 'fe...dc', 'uuidL': None, 'uuidR': None}
  ],
  [
    {'uuid': '2e...c5', 'hash': 'f5...e7', 'uuidL': '12...93', 'uuidR': '98...f8'}
  ]
]

[{'data': 'data 2'}, {'data': 'data 3'}]
[
  [
    {'uuid': '38...43', 'hash': 'e9...a1', 'uuidL': None, 'uuidR': None},
    {'uuid': '4f...66', 'hash': '29...65', 'uuidL': None, 'uuidR': None}
  ],
  [
    {'uuid': '57...c4', 'hash': 'e8...8b', 'uuidL': '38...43', 'uuidR': '4f...66'}
  ]
]

[{'data': 'data 3'}]
[
  [
    {'uuid': 'b2...81', 'hash': '29...65', 'uuidL': None, 'uuidR': None}
  ]
]

[{'data': 'data 3'}, {'data': 'data 4'}]
[
  [
    {'uuid': 'c1...f8', 'hash': '29...65', 'uuidL': None, 'uuidR': None},
    {'uuid': '24...24', 'hash': 'd8...fa', 'uuidL': None, 'uuidR': None}
  ],
  [
    {'uuid': 'bc...64', 'hash': 'c6...89', 'uuidL': 'c1...f8', 'uuidR': '24...24'}
  ]
]

[{'data': 'data 3'}, {'data': 'data 5'}, {'data': 'data 4'}]
[
  [
    {'uuid': '52...c0', 'hash': '29...65', 'uuidL': None, 'uuidR': None},
    {'uuid': 'ce...b1', 'hash': 'fb...21', 'uuidL': None, 'uuidR': None},
    {'uuid': 'a2...03', 'hash': 'd8...fa', 'uuidL': None, 'uuidR': None}
  ],
  [
    {'uuid': '59...96', 'hash': 'b0...35', 'uuidL': '52...c0', 'uuidR': 'ce...b1'},
    {'uuid': 'c0...48', 'hash': '32...5c', 'uuidL': 'a2...03', 'uuidR': None}
  ],
  [
    {'uuid': 'be...f2', 'hash': 'f3...ad', 'uuidL': '59...96', 'uuidR': 'c0...48'}
  ]
]




Salting hashes


Hashes are commonly used to store informations about passwords. Instead of keeping password itself you have only its hash:


newPasswordHash = hash(newPassword)
SaveInPasswordFile(hash: newPasswordHash, forUser: u)
When the user u tries to login, enters password and its hash is computed:


hCheck = hash(enteredPassword)
CheckUserCredentials(hash: hCheck, forUser: u)
This way, even if someone has an unauthorized access to a file with passwords, it will know only hashes which is useless because you must provide correct password to get right hash of it.

And here is a place for some vulnerability. Notice that in real case you don't need to know a true newPassword but anything which after hashing returns you newPasswordHash:


hash(newPassword) -> newPasswordHash
hash(someMaybeStrangeSequenceOfCharacters) -> newPasswordHash
Both newPassword and someMaybeStrangeSequenceOfCharacters are equivalent in terms of security because both allows you to pass password protection due to the fact that both give an identical hash and in terms of their hashes verification are indistinguishable.

Potentially you may precompute such a table (if the list of passwords is very, very long, you can use techniques like a Rainbow Table [sec002] to save some space):


hash1 = hash(someMaybeStrangeSequenceOfCharacters1)
hash2 = hash(someMaybeStrangeSequenceOfCharacters2)
...
Of course it requires a lot of time, but you do this once and can break any password protected by a given hash.

Note that multiple hashing:


newPasswordHash = hash(hash(hash(enteredPassword)))
is not a solution of this problem at all because again it is enough to find anything which is equal to newPasswordHash satisfying this relation:


hash(hash(hash(enteredPassword))) = hash(hash(hash(anything)))
Multiple hashing doesn't protect your data enough, however may slow down any brute-force attempts to break your secret.

The weakness of hashes (even multiple times) is that you can pre-compute hashes, store them in a table and then simply search it to find the sequence you need. The question is: How to thwart such a pre-computed dictionary attacks?

The answer is simply: Make pre-computed dictionary useless.

Consider a password hash that is generated using the following function (where "+" is the concatenation operator):


newPasswordSaltedHash = hash(password + salt)
or even better with multiple hashing:


newPasswordSaltedHash = hash(hash(password) + salt)
The salt value don't have to be secret and may be generated at random and stored with the password hash. To get the best security it should be never repeated, so it is used as a nonce mentioned earlier and sometimes this is called just this way. A large salt value prevents pre-computation attacks by ensuring that each user's password is hashed uniquely. This means that two users with the same password will have different password hashes (if only different salts are used). This time, in order to succeed, an attacker needs to precompute tables for every possible salt value, which is impracticable. If salt is not secret an attacker should recompute hashes only for this salt, which is simpler but it also requires time which is enough to discourage anyone from doing so many times (for many different passwords).

In this case hashing the password many times, especially when you do it sufficiently many times [sec001], allows you to slow down the hashing process, and so make it more resistant to brute force cracking attempts.

Summary


The blockchain heavily depends on hashes. Hashes and hashing is used in the blockchain for:

  • computing a digital fingerprint of data;
  • storing data in the way allowing an easy identification of any kind of change, guaranteeing its integrity;
  • defining computational cost of new block of data.


Hence, understanding them, the way how you can compute them, is crucial for understanding the blockchain.