Chapter 6
Chain of blocks

Initial version: 2024-02-19
Last update: 2024-05-07

In this part you will see basic data structure used in blockchain: a sequence of blocks. In typical sequence of object you can easily exchange selected elements: you take out one of them and put in a new position. If you have a chain, it is not so easy to take out one link and "install" it on a new position. Yes, it is possible, but requires you to do a lot of work. This is why, instead of sequence of blocks, I will be talking about a blocks connected like a links in a chain – it would be really very hard to "exchange" any block in this data structure. Hence, blockchain.

Table of contents


Simple blockchain data structure


The blockchain data structure is a specific kind of data structure that is made up of ordered units called blocks. The illustration given in figure below shows a simplified blockchain data structure that consists of two blocks labeled BLOCK 1 and BLOCK 2.


[IMG: simplified_blockchain_data_structure_containing_four transactions]
Each block consists of two major data structures:

  • block header that keeps all metadata required to properly operate with each block;
  • and a data structure, which is typically a Merkle tree, that keeps informations about application-specific data (for example transactions) assigned to (maintained by) this block.
Each block header:

  • is identified by its cryptographic hash value corresponding to its content,
  • contains a hash reference to its preceding block header,
  • and a hash reference to the application-specific data, which is typically the root of Merkle tree.
A block can contain any data you want or need for your application. Moreover, blocks also may contain executable code as part of their data as it is in case of Ethereum. Because blockchain acts as a register allowing you to track ownership, thus it tracks ownership flow: who acquires something, who gives something. In consequence each one data stored in blockchain is very often called transaction even if it represents something different than real transaction. You can think about transaction as an act of adding new data to a blockchain regardless of this data kind and type.

Notice that BLOCK 1 is the very first block in this data structure, hence, it does not have a preceding block, and, consequently, Block 1 Header does not contain any reference to a preceding block header. The very first block is sometimes called the genesis block.

The idea of referencing by each block header its preceding block header allows blockchain preserves the order of the individual block headers which in turn preserves the order of blocks and in consequence the order of data. That make up the core of blockchain data structure.

In Python a single block may look like in example below:


block = {
  "header": {
    "blockHash": [HASH OF THIS BLOCK],
    "previousBlockHash": [HASH OF PREVIOUS BLOCK],
    "timestamp": [TIMESTAMP OF THIS BLOCK CREATION TIME],
    "dataMainHash": [HASH OF DATA, in my case ROOT OF MERKLE TREE]
    "difficulty": [DIFFICULTY OF POW],
    "nonce": [NONCE FOR POW],
    "mineTime": [TIME OF MINEING (POW)]
  },
  "data": [
    data1,
    ...
    dataN
  ],
  "metadata": { [OTHER APPLICATION OR IMPLEMENTATION SPECIFIC METADATA]
     "merkleTree": [FULL MERKLE TREE OF DATA in my case]
  }
}
I assume that all data1 to dataN are string representation of your data. To store [MERKLE TREE OF DATA] you may have a structure like below (here it is given for six data):


[[H_1=hash(data1),
  H_2=hash(data2),
  H_3=hash(data3),
  H_4=hash(data4),
  H_5=hash(data5),
  H_6=hash(data6)],
  
 [H_2_1=hash(H_1+H2),
  H_2_2=hash(H_3+H4),
  H_2_3=hash(H_5+H6)],
  
 [H_3_1=hash(H_2_1+H_2_2),
  H_3_2=H_2_3],
  
 [hash(H_3_1+H_3_2)]]
The last element on the list, the [hash(H_3_1+H_3_2)], is the root of this hashing data structure.

It may be computed with the help of the following method, which is a simplified version of the method presented in a chapter 3: Data fingerprinting, section: Implement hashing (subsection: Implementing hierarchical hashing). Notice that you calculate Merkle tree for static data – data which you know in advance and for sure will not change during calculations. I also decided to not include references from parent hashes to corresponding child hashes which saves storage space. Here you have the code:


'''
Input:
data - list with JSONs with application specific data

Return:
List with hashes.
'''    
def computeMerkleTree(data):
  dataHashes = []
  
  for d in data:
    jsonAsString = json.dumps(d, sort_keys=True)
    dataAsBytes = jsonAsString.encode('utf-8')
    dataHashes.append(sha256(dataAsBytes).hexdigest())
  
  merkleTree = [dataHashes]
  
  while len(merkleTree[-1]) > 1:
    size = len(merkleTree[-1])
    i = 0
    hashes = []
    while i < size:
      h1 = merkleTree[-1][i]
      i += 1
      if i == size:
        hashes.append(h1)
        break
      h2 = merkleTree[-1][i]
      i += 1
      
      h = sha256(h1.encode('utf-8')+h2.encode('utf-8')).hexdigest()
     
      hashes.append(h)
    
    merkleTree.append(hashes)
    
  return merkleTree   


Calculating hash of block


I should explain you how you calculate the hash for a block.

In the form of blockchain I consider, the cost is defined by the need of solving the hash puzzle. You consider slightly smaller data structure than "full" header (I will call this data structure as HHhash header – to avoid any confusion with complete header):


{
  "previousBlockHash": [HASH OF PREVIOUS BLOCK],
  "timestamp": [TIMESTAMP OF THIS BLOCK CREATION TIME],
  "dataMainHash": [HASH OF DATA, in my case ROOT OF MERKLE TREE]
  "difficulty": [DIFFICULTY OF POW],
  "nonce": [NONCE FOR POW],
}
Two fields of HH are important as you need them solving puzzle (see chapter 3: Data fingerprinting, section: Solving hash puzzle): these are difficulty and nonce field. nonce is the number you choose randomly so the hash of HH starts with difficulty leading zeros (or difficulty can be a parameter of any other reasonable procedure of this type).

Remaining fields are:
  • previousBlockHash – this is a hash of previous block for all blocks but the first; for the first this is predefined hash like a hash of a Genesis string string.
  • timestamp – the time you created the block (the time before you start computing hash for HH).
  • dataMainHash – this is a main hash of your data hashes; in my case this is a root of Merkle tree.
As you can see two fields are missing:
  • blockHash – this is what you are going to calculate so for obvious reasons it can not be included in HH.
  • mineTime – this is a time you needed to solve a puzzle for this block.
You will add both fields when you mine the block.

The whole procedure of calculating hash of the block is as follows:
  1. Instantiate the hash header (HH) data structure.
  2. Set the previousBlockHash field of HH to the hash of directly preceding block if it exist; otherwise set its value to any arbitrary chosen (for example being a hash of Genesis string text).
  3. The field timestamp should already be set during the time when block was created. If not, set it now to current timestamp.
  4. The field dataMainHash should already be set during the time when block was created.
  5. Set the field difficulty to the value reflecting currently required difficulty (determined by consensus algorithm – this will be a topic of following chapters).
  6. Select (randomly) nonce and assign its value to the field nonce of HH.
  7. Calculate hash h of HH.
  8. If h comply required difficulty reflected in the difficulty field, go to the step 10.
  9. Go to step 6 to repeat hashing procedure with different nonce.
  10. Add to the HH data structure two new fields with corresponding values:
    • blockHash with value equal to h.
    • mineTime with value equal to the total time you need to find correct nonce.
  11. End of block hashing procedure.


Consequences


As I wrote above, the idea of referencing by each block header its preceding block header allows blockchain preserves the order of the individual blocks. However it has also more serious consequences influencing what you can (not) do with such a chain.

Note that you cannot interfere with the data in the chain to keep all changes unnoticed. Even if you change a one bit in a data this will result in totally different hash of it (remember about the avalanche effect of cryptographic hash function I have mentioned in chapter 3: Data fingerprinting). This in turn will affect a hash of block containing this data. This is not the end. Avalanche will continue making "louder noise" as hashes of all subsequent blocks are calculated also for hashes of previous blocks. In consequence all hashes starting from the point that causes the change until the head of the whole chain are no valid anymore.


[IMG: effect of manipulating data for "hello" text in each block]


How does it protect data from being forged?


For reasons I have explained in previous chapters, the creators of blockchain decided to design it as a purely distributed peer-to-peer system. What is more, it is open to everyone. As such it is highly exposed to attempts to manipulate or forge the history of transaction (data) by dishonest peers. The very clear challenge is how to protect the history of transaction from being forged or manipulated while still keep blockchain open.

The challenge is how to protect the history of transaction from being forged or manipulated while still keep blockchain open.


The main idea how blockchain makes its data safe by ensuring its immutability is quite simple, at least in theory because in implementation this simplicity complicates "a little". You should design a system in which you have to somehow "pay" for placing your data in it. Because the system is open, everyone can verify what you have added to the chain, especially verify correctness of all hashes and if they are computed according to all rules. Because it cost you some money and it is a subject to a verification procedure it doesn't pay off to cheat which will inevitably result in a loss of money and your data being removed from the chain. What is more, the cost should growth exponentially respectively to the age of data in the system. The older data is the more effort (understood as a resources needed to do this, for example: money, hardware or time) you have to put in in order to manipulate it.

In the form of blockchain I consider, the cost is defined by hash puzzle. You have to solve hash puzzle finding right nonce value to be able to add new block in acceptable by others way. You don't know how long will that take, how many repetition of steps from 6 to 8 of the calculating hash of the block procedure you will need. Every second of running you computer trying to solve the puzzel cost some money as you have to pay for an electric energy consumed by it. You can do this faster by utilizing more powerful computer (whether one with better hardware or multiple commodity nodes) but this will cost you no less money as you will need more energy (and don't forget about money required to extend your processing power).

The cost is mainly determined by difficulty. This should be set at the high enough level to deteriorate anybody from any fraud attempt. On the other side you may ask if very high difficulty would stop normal blockchain operation? High cost means long computing time; it would be useless if you have to wait for placing your data in a blockchain for example one month. Don't forget about one very important assumption underlying the operation of blockchain: openness and constituting consensus on what is true and what should be left based on single votes of a big group of participants of equal rights. There are many nodes trying at the same time to add a new block. So even the time required to accomplish this task is set to be very long if it is done by one node, in practice it becomes much shorter thanks to massive attempts taken in parallel by many different nodes. Because chances to solve puzzle are purely random governed by statistical rules, the more nodes try it, the less time will that take. For example from chapter 3: Data fingerprinting, section: Solving hash puzzle you know that if difficulty is set to 6, number of iterations required to find nonce is around 15138250 which takes approximately 218.6903 seconds on my computer. However if I run this procedure in parallel on many nodes, one of them has found it in just a 0.0301 second (saying the truth it wasn't in parallel – I run this procedure 1000 times and selected the lowest time) (get full test results):

Difficulty Number of iterations Time [s]
average min max average min max
1 10 1 23 0.0001 0.000012975 0.000284
2 156.9 2 585 0.00135 0.000023121 0.0049
3 2259.5 206 4023 0.01958 0.00175 0.0344
4 41935.7 8972 121268 0.4406 0.07925 1.4449
5 1858598,7 207823 4247928 17,04389 1.807712 39.80722
6 12791193,6 213637 31763337 124.270 2.112669 307.675219


If cost doesn't prevent you from doing bad things and you still want to manipulate the data, you should do it as quickly as possible, on the latest block (the head of the whole chain) as this would cost you the least. Assume you want to forge the latest block. So you start solving puzzle for "modified" data. When you finish you are ready to replace it. Hold on! You are not the only one in the system. During the time you were recalculating forged version of the lates block many other nodes were trying to add new block on top of lates you trying to exchange! If you replace it, they will realize something is wrong. To make your attack successful you should recalculate the lates block and the also the next block after lates before any other node will finish its calculations based on true lates block. This significantly increases the difficulty of the task you would have to solve. Beside this remember that truth is what is backed up by most of the nodes. So replacing latest block and the following block before all other nodes is not enough. You have to do this on more than 50% of nodes to be the majority. In practice you should somehow control more than 50% of blockchain's processing power which in healthy system of this type is (almost) impossible.

Conclusions


There are no 100% guarantee that the blockchain data structure will not suffer any fraud, but the probability of this is really very, very low. The reasons for this are:

  • Data are link to each other in indisputable way. Even the smallest manipulation of its content stands out and becomes noticeable.
  • In consequence of previous, any changes in one of a blocks requires you to make recalculation for all other subsequent blocks.
  • To add a new data to the chain you have to make very expensive computations induced by hash puzzles that are unique for each block.


Implementing blockchain


At this moment you are ready to implement blockchain at very basic level. It should be functional, but not fully complete because some topics are still waiting for you to be presented. Anyway you should be ready to work with it.

The code should be self descriptive. If not, let me know and I will clarify what is vague. You will find that I use ThreadSafeList() class – see appendix section for its code and some explanations why I decide to use it.


import json
import random
import time
import uuid

from datetime import datetime
from hashlib import sha256

class Blockchain(object):
  class DATA_TYPE:
    TRANSACTION_CODE = 100
    TRANSACTION = {"code": TRANSACTION_CODE, "description": "transaction"}

  def __init__(self):
    self._chain = []
    self._pendingData = ThreadSafeList()
    self._dataToIncludeInNewBlock = []

  def prepareDataToIncludeInNewBlock(self):
    self._dataToIncludeInNewBlock = self._pendingData.getChunk()
    #self._dataToIncludeInNewBlock = self._pendingData[:min(2,len(self._pendingData))]

  def addNewBlock(self):
    def dataCompareFunction(data1, data2):
      if data1["uuid"] == data2["uuid"]:
        return True
      return False

    self.prepareDataToIncludeInNewBlock()
    newBlock = self.generateNewBlock()
    newBlock = self.proofOfWorkForBlock(newBlock, difficulty=4)
    
    # If newBlock == None then calculating PoW must be
    # interrupted by some external action, for example
    # arrival of block mined by other node.
    if newBlock:
      self._chain.append(newBlock)
      # BEGIN: Remove from _pendingData everything which is in _dataToIncludeInNewBlock
      self._pendingData.removeDuplicates(self._dataToIncludeInNewBlock, dataCompareFunction)
      # END
      self._dataToIncludeInNewBlock = []
      
      return True
    else:
      return False

  def acceptNewBlock(self, block):
    # In future version add some code here
    # to add to this blockchain structure
    # block mined by other node

    # Verify block
    # If verification is positive then add new block    
    # Remove from pendingData data present in new block
    pass

  def computeHashForData(self, data):
    return computeMerkleTree(data)

  def generateNewBlock(self):
    data = self._dataToIncludeInNewBlock
    merkleTree = self.computeHashForData(data)

    if len(self._chain) > 0:
      previousBlock = self._chain[-1]
      previousBlockHash = previousBlock["header"]["blockHash"]
    else:
      genesisString = 'Genesis string'
      genesisStringBytes = genesisString.encode('utf-8')
      previousBlockHash = sha256(genesisStringBytes).hexdigest()


    block = {
      "header": {
        "previousBlockHash": previousBlockHash,
        "timestamp": datetime.utcnow().isoformat(),
        "dataMainHash": merkleTree[-1][0],
        "difficulty": None,
        "nonce": None
      },
      "data": data,
      "metadata": {
        "merkleTree": merkleTree
      }
    }

    return block

  def proofOfWorkForBlock(self, block, difficulty):
    leadingZeros = "0" * difficulty

    start = time.time()
    block["header"]["difficulty"] = difficulty

    print("Start mining")
    while True:
      block["header"]["nonce"] = format(random.getrandbits(64), "x")
      dataString = json.dumps(block, sort_keys=True)
      dataBytes = dataString.encode('utf-8')

      hash = sha256(dataBytes).hexdigest()

      if hash.startswith(leadingZeros):
        block["header"]["blockHash"] = hash
        end = time.time()
        block["header"]["mineTime"] = end-start
        break

    return block 


  def addNewTransaction(self, sender, recipient, amount, document, signatureOfDocument):
    data = {
      "uuid": uuid.uuid4().hex,
      "dataType": self.DATA_TYPE.TRANSACTION,
      "sender": sender,
      "recipient": recipient,
      "amount": amount,
      "document": document,
      "sendersSignature": signatureOfDocument
    }

    self._pendingData.push(data)

  def hashBlock(self, block):
    blockAsBytes = json.dumps(block, sort_keys=True).encode('utf-8')
    return sha256(blockAsBytes).hexdigest()

  def saveBlockchain(self):
    data = {
      "chain": self._chain,
      "pendingData": self._pendingData._list,
      "dataToIncludeInNewBlock": self._dataToIncludeInNewBlock
    }

    with open("blockchain.txt", "w") as fileData:
      json.dump(data, fileData)

  def loadBlockchain(self):
    with open("blockchain.txt") as fileData:
      data =  json.load(fileData)

      self._chain = data["chain"]
      self._pendingData._list = data["pendingData"]
      self._dataToIncludeInNewBlock =  data["dataToIncludeInNewBlock"]
Here is how you can use it:


blockchain = Blockchain()
sender = "Foo"
recipient = "Bar"
amount = "Gold, 0.2kg"
document = {"sender": sender, "recipient": recipient, "amount": amount}
signatureOfDocument = "XYZ"
blockchain.addNewTransaction(sender, recipient, amount, document, signatureOfDocument)
print(blockchain._chain)
print(blockchain._pendingData._list)
print(blockchain._dataToIncludeInNewBlock)
print("Try to add a new block")
blockchain.addNewBlock()
print(blockchain._chain)
print(blockchain._pendingData._list)
print(blockchain._dataToIncludeInNewBlock)
The corresponding results of execution above code are given below (reformatted for clarity):


[]
[{
  'uuid': 'fb3f6bee857d42debf5ef6ecff3bd808',
  'dataType': {
    'code': 100,
    'description': 'transaction'
  },
  'sender': 'Foo',
  'recipient': 'Bar',
  'amount': 'Gold, 0.2kg',
  'document': {
    'sender': 'Foo',
    'recipient': 'Bar',
    'amount': 'Gold, 0.2kg'
  },
  'sendersSignature': 'XYZ'
}]
[]
Try to add a new block
Start mining
[{
  'header': {
    'previousBlockHash': '257c0cd7d1e5c9e8be39344583c5f08446cd0ba2537cc5ee4cfa7afae8f7a1ef',
    'timestamp': '2023-04-25T08:13:05.367967',
    'dataMainHash': '655a8bdd0598e7e2bcaee1332babd6ed25d8f61cc143c47a89aad3d87523d4d6',
    'difficulty': 4,
    'nonce': '216806b3e9030b3e',
    'blockHash': '00001f410f37dec7377c234dc594e95d9dbba6169d4b18e449c4e12692ac26e1',
    'mineTime': 2.916616916656494
  },
  'data': [{
    'uuid': 'fb3f6bee857d42debf5ef6ecff3bd808',
    'dataType': {
      'code': 100,
      'description': 'transaction'
    },
    'sender': 'Foo',
    'recipient': 'Bar',
    'amount': 'Gold, 0.2kg',
    'document': {
      'sender': 'Foo',
      'recipient': 'Bar',
      'amount': 'Gold, 0.2kg'
    },
    'sendersSignature': 'XYZ'
  }],
  'metadata': {
    'merkleTree': [['655a8bdd0598e7e2bcaee1332babd6ed25d8f61cc143c47a89aad3d87523d4d6']]
  }
}]
[]
[]


Facts, myths and questions


How long will it take me to generate a block?
No one can say exactly. In the Bitcoin the rate is limited by difficulty to one block every 10 minutes in average. But this is only an average and in practice you may spend 2 seconds, 2 days or even more – no one knows (compare with the table given above in this chapter). Spending your time trying to solve the puzzle (proof of work) doesn't get you any closer to success. Believing otherwise is what's known as the Gambler's fallacy (also known as the Monte Carlo fallacy). It's like trying to flip 5 coins at once and have them all come up heads. Each time you try, your chances of success are the same.

The gambler's fallacy is the belief that, if an event (whose occurrences are independent and identically distributed) has occurred more frequently than expected, it is less likely to happen again in the future. The fallacy is commonly associated with gambling, where it may be believed, for example, that the next dice roll is more than usually likely to be six because there have recently been fewer than the expected number of sixes. However because of independency of events six next dice roll has always the same chances no matter how long you play.

This makes all systems existed in number lottery games totally useless. Assumption that it is worth betting on numbers that have not been drawn for a long time, and not on those that were drawn in the last draw and analysing historical results doesn't make any sense. This is because every time before the drawing, each combination of numbers has exactly the same chance of being drawn.

Amos Tversky and Daniel Kahneman, in their article Belief in the Law of Small Numbers [amokah], described an experiment in which subjects demonstrated the false belief that the properties of large random samples also apply to small samples.


In consequence there's no such thing as making progress towards solving the puzzle. After working on it for some time, no matter how long or short, your chances of solving it are equal to what your chances were at the start or at any other moment before.

What makes a block to be valid? Is it possible to have two different blocks which are valid and ready to be appended to the chain?
There is no something such a true or right next block which everyone is trying to find. Usually there are many blocks that could be the next one. Each node (independently) builds up a list of pending transactions, and they pick some or all of them to go into the new block. Once they have their (own) list of accepted transaction (transactions they decide to include in a block), they calculate the hash of a Merkle tree for that list.

Even if every node had the same set of awaiting transactions and apply the same rules for picking and ordering them, they would all be starting from different point anyway, because the first transaction in every block is unique to the node trying to mine that block. This is a reward transaction describing how much you will earn if you successfully mine and place the block as a next block in the chain. A reward transaction is named the coinbase transaction. This is the only transaction that does not point to an existing (preceding transaction "explaining" how did you get bitcoins), but instead mints new bitcoin according to the bitcoin protocol monetary policy. You may read about this in the chapter 11: Bitcoin's money lifecycle.

You must be warned that even if you successfully mine a block, there is no guarantee that you will get the reward – other node may be faster than you.

So the starting point for each node is different and there can't be two identical candidate blocks. Saying the truth, there can be two identical candidate blocks but this may happen only when other nodes will mine block for you which is very low probable (however theoretically possible).

Even if you allow for the possibility of identical candidate blocks existence remember that solution to the cryptographic puzzle (proof of work) is not unique (see chapter 3: Data fingerprinting, section: Proof of work – solving a hash puzzle). There are multiple valid solutions for any given block - only one of the solutions needs to be found for the block to be solved.

Is there any proof that for every block there exists at least one nonce?
No. I would say even something much worse: you should expect that such a nonce does not exist for the block you are trying to mine.

How do you know, that nonce does not exists?
You don't know. I explained it in previous question: How long will it take me to generate a block? Two approaches are possible:
  • Time constrain In this approach you assume a maximum time or number of iteration you will try to mine the block.
  • Space constrain In this approach you assume that nonce comes from a finite interval, for example it may be an integer from interval $[0, 2^{32}-1] = [0, 4294967295]$. Finite interval naturally defines a finite space of numbers (for example 4294967296 in this case) and this way you know that you have at most 4294967296 trials to find correct nonce (if you select numbers one by one from the minimum value to the maximum).
In case of failure you change something in your block, for example add and/or remove some transaction(s) or simply alter something in the block header (the simples choice is to update timestamp) and start mining again. See below Real block example – Bitcoin case to know how this problem is solved in real case.

What is the maximum number of blocks?
There is no such thing as the maximum number of blocks. You simply add next block to the chain as soon as you solve a puzzle. In the Bitcoin the rate is limited by difficulty to one block every 10 minutes in average. And this is the only one limitation.

But I heard that the amount of coins possible to be mined is limited to 21 million – what will hapened when all of them will be mined?
Yes, you are right. 21 million block limit is a pure consequence of how currency is issued. You may read about this in the chapter 11: Bitcoin's money lifecycle. You should keep in mind for what you need blocks: the blocks are for proving that transactions occurred at a particular time. You can still trade using currency currently issued – you don't need new currency to be added to the general circulation of money to be able to sell or buy. In consequences transactions, as a transfer of ownership, will still occur once all the coins have been generated because people will still sell or buy something using existed currency. In result blocks will still be created (as long as people will be trading using Bitcoins).

Real block example – Bitcoin case


Below I present very general structure of block in case of Bitcoin:

Field Comment Size [B]
Magic no value always 0xD9B4BEF9 4
Blocksize number of bytes following up to end of block 4
Blockheader consists of 6 other items 80
Transaction counter positive variable length integer [lmb001] 1-9
Transactions the (non empty) list of transactions Depend on the number of transactions


The Blockheader field is rather a "metafield" as it contains no single value but all informations which in our simplified model are enclosed in header section. These are:

Field Description Size [B] When set or changed
Version Block version number 4 You upgrade the software and it specifies a new version
hashPrevBlock 256-bit hash of the previous block header 32 A new block comes in
hashMerkleRoot 256-bit hash based on all of the transactions in the block 32 A transaction is accepted
Time Current block timestamp as seconds since 1970-01-01T00:00 UTC 4 Every few seconds
Bits Current target in compact format 4 The difficulty is adjusted
Nonce 32-bit positive integer number 4 Nonce is tried to get valid hash


Names of the fields are self-descriptive and correspond to the names I have introduce for our simplified case, so I think this part does not require any additional comment from my side. You have to "extract" some of the data you see above and pack them into new data structure (I called it hash header) in order to solve cryptographic puzzle which is an indispensable condition to successfully add a valid block to the chain.

You may find interesting that Bitcoin mining proof of work (the same as the one I have described in previous chapter) is based on the hashcash proof of work algorithm. Hashcash refers to a proof-of-work system that was created by Adam Back in 1997. His creation was initially meant to limit email spamming and DDoS attacks. The main idea of this anti-spam protection was to accompany every e-mail with special string which is moderately hard to prepare (compute) which requires noticeable time interval, but not intractable; on the other hand it is trivial to verify. This way sender (spammer) hits some speed limit as he/she can not send message with incorrect special string as it wouldn't be accepted; and finding the correct string takes a few seconds.

The X-Hashcash: header allows you to insert the special string into the email headers section of the email the user sends. The header line looks something like this:


X-Hashcash: 1:16:2405080145:piotr@fulmanski.pl::ZjRqQ1VC:AGw=
which conforms to general format [hacaor, section: stamp format (version 1)]:


VERSION:BITS:DATE:RESOURCE:[EXTENSION]:RANDOM:COUNTER
where:
  • VERSION: Hashcash format version, 1 (which supersedes version 0).
  • BITS: How many bits of partial pre-image the stamp is claimed to have. Other words, number of leading zero bits in the hashed code (see also below).
  • DATE: The time that the message was sent, in the format YYMMDD[hhmm[ss]].
  • RESOURCE: Resource data string being transmitted, e.g., an IP address or email address.
  • EXTENSION: Extension (optional; ignored in version 1). Extension is a list of key-value pairs.
  • RANDOM: String of random characters, encoded in base-64 format.
  • COUNTER: Binary counter, encoded in base-64 format.
The fields chosen to be included in the header string was selected to minimize the risk of a DoS attack succeeding as much as possible. The presence of the recipient's email address requires that a different header be computed for each recipient. The date allows the recipient to record headers received recently and to ensure that the header is unique to the email message.

And what is a partial hash-collision? A full-collision is when all bits of hash(x) match hash(y). A $k$-bit partial collision is when only the $k$ most significant bits match.

To prepare above header string I used a simple Python script:


import base64
import random
import string

from hashlib import sha1

charsPool = string.ascii_letters + string.digits

def randomCharsSequence(size=6, chars=charsPool):
  return ''.join(random.SystemRandom().choice(chars) for _ in range(size))

version = "1"
bits = "16"
date = "2405080145"
resource = "piotr@fulmanski.pl"
extension = ""
dataBytes = randomCharsSequence().encode('utf-8')
random = base64.b64encode(dataBytes).decode("utf-8")

c = 0

while True:
  dataBytes = c.to_bytes(2, 'big') 
  counter = base64.b64encode(dataBytes).decode("utf-8")

  fields = [version, bits, date, resource, extension, random, counter]
  header = ":".join(fields)
  hash = sha1(header.encode('utf-8')).hexdigest()
  
  if hash.startswith("00"): # 16 zero bits = 2 zero bytes
    print(f"for counter={c}:")
    print(header)
    break
    
  c += 1
  
  if c%10 == 0:
    print(c)
I got the following result:

for counter=108:
1:16:2405080145:piotr@fulmanski.pl::ZjRqQ1VC:AGw=
You can easily verify its correctness even in a terminal (below in MacOS, where shasum command computes by default SHA1 checksum but I added -a option to make it clear; -n option prevents printing the trailing newline character):

MacBook-Piotr:case03 fulmanp$ echo -n 1:16:2405080145:piotr@fulmanski.pl::ZjRqQ1VC:AGw= | shasum -a 1
00582f58b83adcea6730325391b93ddcae75af16  -




In most cases in Bitcoin, when a hash is computed, it is computed twice. Most of the time SHA-256 hashes are used, however RIPEMD-160 is also used when a shorter hash is desirable (for example when creating a bitcoin address). Bitcoin uses: SHA256(SHA256(blockHeader)) but you have to be careful about byte-order – almost all integers are encoded in little endian. Only IP or port number are encoded in big endian.

Similarly Merkle tree in Bitcoin use a double SHA-256, the SHA-256 hash of the SHA-256 hash of something. General scheme of tree creation remains as it was discussed earlier. First you form the very bottom row of the tree with the ordered double-SHA-256 hashes of the byte streams of the transactions in the block. Then you form next level of the tree, up to the case of only one hash at some level, by concatenating hashes of two-hashes groups from preceding level. The row above consists of half that number of hashes. Each entry is the double-SHA-256 of the 64-byte concatenation of the corresponding two hashes below it in the tree. This procedure repeats recursively until you reach a row consisting of just a single double-hash. The question that arises is: what to do when forming a row in the tree (other than the root of the tree), it would have an odd number of elements? In my implementation I presented in chapter 3: Data fingerprinting, section: Implement hashing (subsection: Implementing hierarchical hashing) I add an extra None value:


data = [data_1, data_2, data_3]

hash = [
  level_0 = [hash(data_1), hash(data_2), hash(data_3)],
  level_1 = [hash(level_0[0], level_0[1]), hash(level_0[2], None)],
  level_2 = [hash(level_1[0], level_1[1])]
]
In Bitcoin the last hash (double-hash to be precise) is duplicated to ensure that the row has an even number of hashes:


data = [data_1, data_2, data_3]

hash = [
  level_0 = [dHash(data_1), dHash(data_2), dHash(data_3), dHash(data_3)],
  level_1 = [dHash(level_0[0], level_0[1]), dHash(level_0[2], level_0[2])],
  level_2 = [dHash(level_1[0], level_1[1])]
]


The body of the block contains the transactions. These are hashed only indirectly through the Merkle root. Because transactions aren't hashed directly, hashing a block with 1 transaction takes exactly the same amount of effort as hashing a block with 100, 1000 or any other number of transactions.

Extra nonce


The nonce Nonce field is 4-byte length and can hold numbers between 0 and 4294967295 (from 0x0 to 0xffffffff in hex). Miners increment (change systematically to benefit from space constrain) the nonce value when mining. Of course they do this in hope to find valid nonce – nonce that will result a block hash that fulfils some criterias. Unfortunately miners will usually exhaust the 4-byte nonce field in the block header without finding a valid nonce. To be able to find it, you have to change something in your block. The simplest and natural candidate is a Time timestamp filed as time flows naturally and when you finish exploring nonce space the time would be different than time when you started. The problem is that the passage of time is measured with an accuracy of one second but miners are so fast that they explore the 4-byte nonce in less than 1 second, so the time field isn't much help in this situation. Of course you may wait until next second but this way you will waste your time and thus chances to be the first who will find valid nonce. Therefore, miners look for other ways to modify the block header without having to wait or reconstruct the block of transactions.

The miners have found a solution which wasn't officially designed but become an official practice, and thus de facto a standard – they will adjusting what's referred to as an extra nonce (extraNonce). The procedure is as follows. A miner increments nonce until it overflows. Then it change extraNonce (and timestamp), resets nonce and starts again exploring nonce space. Why it works. Well, extraNonce is located in the coinbase transaction (the very first transaction in a block), so changing it alters the Merkle root and this way data you hash. The extraNonce is located inside scriptSig of the coinbase transaction. Without diving into details you should know at this moment that scriptSig field allows you to spend money you earn in previous (preceding) transaction. The coinbase transaction has no previous transaction so the scriptSig is basically an unused field that miners can use to put any data they like in to it. The only restriction is that it must be between 2 and 100 bytes in size. This way scriptSig became an unofficial field for an additional (extra) nonce value.



Summary


In this section you have learned data structure underlying blockchain and how it is turned into an immutable append-only data store.

The purpose of the chain of blocks is to keep the entire history of appended data in an ordered fashion. Just as important as maintaining the order of transactions is also the easy and quick detection of any changes (in most cases illegal) made to the data leading to their forgery. The structure of chains of blocks (blockchain in short) forces the potential forger not only to change the data of he or she is interested in, but also to recalculate all blocks from the moment of introducing the change until the head of the whole chain. However theoretically possible, this attempt should be so expensive (in terms of money, time, system resources etc.) that nobody should not even try to think about doing this. The cost is determined by the need of solving a hash puzzle of the required difficulty level.