In this blog post, I will describe the development of a hashing algorithm in CλaSH. Using CLaSH, a functional description of a circuit in Haskell can be compiled into hardware (VHDL, Verilog). Why CLaSH? This tool is developed at the CAES-chair at the University of Twente, and resulted in a spin-off company QBayLogic. At the Unviversity of Twente, I did my masters assignment and PhD. During that time, I’ve heard a lot of positive stories from colleagues who used CLaSH. The only thing that was missing for me, was a project to use CLaSH in. My recent interest in cryptocurrencies changed that. Let’s implement a hashing algorithm in CLaSH, and start mining one of the cryptocurrencies on an FPGA!

In this first post, I focus on the basics of CLaSH, without optimizing the design for an implementation on an FPGA. It should, however, result in synthesizable VHDL / Verilog code. I first introduce the mathematics behind a hashing algorithm, and converting the mathematical description step-by-step into synthesizable Haskell code. A follow-up post will go into more details about the optimization for an actual FPGA implementation on a SoCKit / DE-10-Nano board.

For a more detailed introduction of CLaSH, have a look at the blog post of a former colleague. I will assume you have a basic knowledge about the syntax of Haskell. However, don’t be afraid to read on without the knowledge; I also started using CLaSH without prior knowledge of Haskell.

Edit: this post is now updated for the recent release of Clash-0.99.

Cryptocurrency mining

Mining is finding a hash of a block that satisfies a certain difficulty. The hash is calculated of the header of a block, where the miner can freely choose the value of a number of bytes within that header: the nonce. The miner keeps on trying different values for the nonce until it find a hash that satisfies the difficulty. The difficulty can be converted into a target hash. The calculated hash should be a smaller value than this target hash. This basically means is that the hash should contains a number of leading ‘0’-bits.

The following Python-code demonstrates the main loop of a mining algorithm:

def mine(header, target):
  for nonce in range(0, 0xFFFFFFFF):
    input = header + nonce
    output = hash(hash(input))

    if output < target:
      return nonce

This examples shows that there are actually two hashes that have to be calculated for each nonce. The first one consists of the concatenation of the header and nonce, together 80 bytes. A second hash is calculated of the 64 bytes (512 bits) result. This second hash is then compared to the target hash.

Groestl hashing algorithm

The hashing algorithm I’m implementing is Grøstl. Grøstl is one of the finalists for the SHA-3. Moreover, it is one of the least used hashing algorithms for cryptocurrencies, such that there are less existing FPGA implementation available. A clear mathematical description of the Grøstl specification is available of their website, as well as an Grøstl implementation guide. The design of the hashing algorithm is inspired by the Rijndael block cipher, AES.

Hash function

The specification defines hash functions with different input and output sizes. For mining purposes, only the largest one is of interest, Grøstl-512, which returns a 512-bit hash. The hashing function compresses a message stream () into this 512-bit output. For this 512-bit output, the input is split into blocks of 1024 bits, where the last block is padded to this same length. A simplification is made here specifically for mining, since the message can always be contained in a single block.

The compression function is specified as follows:

It consists of two 1024-bit permutations P and Q and a few XOR-operations (). After the compression function processes all message blocks (a single one for mining), a final P permutations is applied, another XOR-operation, and a truncation to 512-bit, to end up with the final hash:

Permutation round

The permutation functions P and Q, consist of a number of round transformations. In Grøstl-512, 14 rounds of these transformations are applied. Each of these round consists of 4 transformations:

  1. AddRoundConstant
  2. SubBytes
  3. ShiftBytes
  4. MixBytes

A round R is composed for these four functions:

Such a function composition can directly be expressed in CLaSH:

round :: BitVector 1024 -> BitVector 4 ->  BitVector 1024
round r = MixBytes . ShiftBytes . SubBytes . AddRoundConstant r

Round matrix

The input of each round is a vector of 128 bytes (1024 bits). A sequence of bytes is mapped to a 8x16 matrix of bytes as follows:

In CλaSH, the data type BitVector n and Vec n a are defined, which can be used to represent data structures. The size of the fixed-size BitVector is specified by n. A BitVector directly translates to a std_logic_vector of the same size in VHDL. The Vec type specifies a fixed-length vector of type a and size n.

Conversion functions are defined in CλaSH that can convert a large BitVector into a Vec of smaller BitVectors and vice versa. As an example, pack can convert a BitVector 1024 into a Vec 128 (BitVector 8). The reverse is possible using the unpack function. Moreover, the function unconcat can split a Vec (n * m) a into Vec n (Vec m a). The function concat does the reverse.

toVec :: BitVector 1024 -> Vec 128 (BitVector 8)
toVec = unpack

fromVec :: Vec 128 (BitVector 8) -> BitVector 1024 
fromVec = pack

vecToMat :: Vec 128 (BitVector 8) -> Vec 16 (Vec 8 (BitVector 8)) 
vecToMat = unconcatI

matToVec :: Vec 16 (Vec 8  (BitVector 8)) -> Vec 128 (BitVector 8) 
matToVec = concat

These functions can therefore be used to change the representatation of the round matrix.

AddRoundConstant

The AddRoundConstant transformation performs and XOR of the input with a constant matrix. For different rounds this matrix is slightly different. In mathematical notation, AddRoundConstant is defined as:

In Clash, I will first construct the P and Q matrices before the input is XOR’ed. In both matrices, there is only a single row that is not filled with either all zeros or ones (0xFF). Also, note that all the constants in the P matrix are the complement of the constants in the Q matrix. The constants in the first row of the P matrix start at 0 and are incremented by 0x10 for each element. In Haskell, a list of these constants can be specified based on the start, increment and final value in the list. This list can then be converted to a fixed sized vector type that is supported by Clash.

pc :: Vec 16 (BitVector 8)
pc = $(listToVecTH [0::BitVector 8,0x10..0xF0])

For each element (byte), e, in this vector, the XOR with the round number is calculated. Since there are 14 rounds, a BitVector of size 4 is used to represent this number. The size of this BitVector is extended to 8 bits, to match the size of the constants. Finally, each element is converted into a column vector by adding 7 (d7 as a type-level natural number) replications of 0 for the P matrix. For the Q matrix, each column starts with the 7 replications and ends with the element from the constant vector. The same constant vector is used, of which the complement is calculated first.

pcvR :: BitVector 4 -> BitVector 1024      
pcvR i = pack m
  where
    pr i e = xor i e :> replicate d7 0
    i'     = zeroExtend i :: BitVector 8
    m      = map (pr i') pc

qcvR :: BitVector 4 -> BitVector 1024      
qcvR i = pack m
  where
    qr i e = replicate d7 0xFF :< xor i e
    i'     = zeroExtend i :: BitVector 8
    qc     = map complement pc
    m      = map (qr i') qc

The functions return a BitVector 1024 using the pack function, which flattens the Vectors. Both the input of the round and the constants can now be XOR’ed as one 1024-bit wide XOR-operation.

addCP :: BitVector 4 -> BitVector 1024 -> BitVector 1024
addCP i m = xor (pcvR i) m

addCQ :: BitVector 4 -> BitVector 1024 -> BitVector 1024
addCQ i m = xor (qcvR i) m

SubBytes

The second transformation SubBytes performs a substitution of each byte by another byte. SubBytes is defined as:

where S is the S-box used in AES.

For the Clash implementation of this transformation, I again start a vector of constants. In this case each byte maps to another byte. I therefore use the input byte as an index in a vector that points to the result of the substitution. A 8-bit wide index corresponds to a Vector that contains 256 elements.

aes :: Vec 256 (BitVector 8)
aes = 0x63:> 0x7c:> 0x77:> 0x7b:> 0xf2:> 0x6b:> 0x6f:> 0xc5:> 0x30:> 0x1:> 0x67:> 0x2b:> 0xfe:> 0xd7:> 0xab:> 0x76:> 0xca:> 0x82:> 0xc9:> 0x7d:> 0xfa:> 0x59:> 0x47:> 0xf0:> 0xad:> 0xd4:> 0xa2:> 0xaf:> 0x9c:> 0xa4:> 0x72:> 0xc0:> 0xb7:> 0xfd:> 0x93:> 0x26:> 0x36:> 0x3f:> 0xf7:> 0xcc:> 0x34:> 0xa5:> 0xe5:> 0xf1:> 0x71:> 0xd8:> 0x31:> 0x15:> 0x4:> 0xc7:> 0x23:> 0xc3:> 0x18:> 0x96:> 0x5:> 0x9a:> 0x7:> 0x12:> 0x80:> 0xe2:> 0xeb:> 0x27:> 0xb2:> 0x75:> 0x9:> 0x83:> 0x2c:> 0x1a:> 0x1b:> 0x6e:> 0x5a:> 0xa0:> 0x52:> 0x3b:> 0xd6:> 0xb3:> 0x29:> 0xe3:> 0x2f:> 0x84:> 0x53:> 0xd1:> 0x0:> 0xed:> 0x20:> 0xfc:> 0xb1:> 0x5b:> 0x6a:> 0xcb:> 0xbe:> 0x39:> 0x4a:> 0x4c:> 0x58:> 0xcf:> 0xd0:> 0xef:> 0xaa:> 0xfb:> 0x43:> 0x4d:> 0x33:> 0x85:> 0x45:> 0xf9:> 0x2:> 0x7f:> 0x50:> 0x3c:> 0x9f:> 0xa8:> 0x51:> 0xa3:> 0x40:> 0x8f:> 0x92:> 0x9d:> 0x38:> 0xf5:> 0xbc:> 0xb6:> 0xda:> 0x21:> 0x10:> 0xff:> 0xf3:> 0xd2:> 0xcd:> 0xc:> 0x13:> 0xec:> 0x5f:> 0x97:> 0x44:> 0x17:> 0xc4:> 0xa7:> 0x7e:> 0x3d:> 0x64:> 0x5d:> 0x19:> 0x73:> 0x60:> 0x81:> 0x4f:> 0xdc:> 0x22:> 0x2a:> 0x90:> 0x88:> 0x46:> 0xee:> 0xb8:> 0x14:> 0xde:> 0x5e:> 0xb:> 0xdb:> 0xe0:> 0x32:> 0x3a:> 0xa:> 0x49:> 0x6:> 0x24:> 0x5c:> 0xc2:> 0xd3:> 0xac:> 0x62:> 0x91:> 0x95:> 0xe4:> 0x79:> 0xe7:> 0xc8:> 0x37:> 0x6d:> 0x8d:> 0xd5:> 0x4e:> 0xa9:> 0x6c:> 0x56:> 0xf4:> 0xea:> 0x65:> 0x7a:> 0xae:> 0x8:> 0xba:> 0x78:> 0x25:> 0x2e:> 0x1c:> 0xa6:> 0xb4:> 0xc6:> 0xe8:> 0xdd:> 0x74:> 0x1f:> 0x4b:> 0xbd:> 0x8b:> 0x8a:> 0x70:> 0x3e:> 0xb5:> 0x66:> 0x48:> 0x3:> 0xf6:> 0xe:> 0x61:> 0x35:> 0x57:> 0xb9:> 0x86:> 0xc1:> 0x1d:> 0x9e:> 0xe1:> 0xf8:> 0x98:> 0x11:> 0x69:> 0xd9:> 0x8e:> 0x94:> 0x9b:> 0x1e:> 0x87:> 0xe9:> 0xce:> 0x55:> 0x28:> 0xdf:> 0x8c:> 0xa1:> 0x89:> 0xd:> 0xbf:> 0xe6:> 0x42:> 0x68:> 0x41:> 0x99:> 0x2d:> 0xf:> 0xb0:> 0x54:> 0xbb:> 0x16:> Nil

Using this vector, the substitution is implemented in Clash by mapping the indexing function (!!) over all bytes in the BitVector.

subbytes :: BitVector 1024 -> BitVector 1024
subbytes v = r
  where
    bs = unpack v :: Vec 128 (BitVector 8)
    s  = map (aes !!) bs
    r  = pack s

ShiftBytes

The ShiftBytes transformation applies a different circular shift to the left, to each row in the state matrix. The shift per row is specified in the following lists, which are different for P and Q:

Two vectors are constructed in Clash to represent these shifts.

rhoP :: Num a => Vec 8 a
rhoP = 0 :> 1 :> 2 :> 3 :> 4 :> 5 :> 6 :> 11 :> Nil
rhoQ :: Num a => Vec 8 a
rhoQ = 1 :> 3 :> 5 :> 11 :> 0 :> 2 :> 4 :> 6 :> Nil

In order to apply the shifts to each row, the BitVector needs to be transformed into a matrix, a vector of vectors of bytes. The function unconcat creates a vector of columns; however, the shift must be applied to rows. Therefore, the matrix is transposed before the shift, and transposed back after the shift is applied.

shiftBytes :: Enum i => Vec 8 i -> BitVector 1024 -> BitVector 1024
shiftBytes rho i = pack o
  where
    bytes   = unpack i :: Vec 128 (BitVector 8)
    matrix  = unconcat d8 bytes :: Vec 16 (Vec 8 (BitVector 8))
    matrixT = transpose matrix
    oT      = zipWith rotateLeft matrixT rho
    o       = transpose oT

shiftBytesP :: BitVector 1024 -> BitVector 1024
shiftBytesP = shiftBytes rhoP

shiftBytesQ :: BitVector 1024 -> BitVector 1024
shiftBytesQ = shiftBytes rhoQ

The rotateLeft function implements a dynamic rotation that leads to inefficient hardware, since it does not use knowledge of constant shifts. The function rotateLeftS does generate more efficient hardware. However, we need a workaround in Clash since it is not possible to specify a Vector of SNats that specify the shift per row. In the next blog post this will be addressed.

MixBytes

The final, and most complicated and computationally intensive transformation, is MixBytes. This transformation performs a matrix-matrix multiplication:

However, this is not a regular multiplication, but a multiplication of Galois fiels. Just like for AES, the finite field GF(2^8) is used, using the same irreducible polynomial:

Each term in this polynominal can be expressed as a bit. This results in “0b1_0001_1011” or “0x11B”.

The B-matrix is a circulant matrix, where each row is shifted to the right by a single position relative to the previous row. In short, this matrix is defined as:

I will, however, use its transpose:

Luckily the B matrix contains only 5 distinct values. I will therefore implement a Galois multiplication function that uses pattern matching to distinguish between these 5 values as one of the multiplicants. Addition is easy for these Galois fields, since it’s an XOR-operation. I make use of the fact that . The only multiplication that I therefore need to implement is the times 2; the other multiplications can be rewritten as additions (XOR’s).

The Galois multiplication is a regular multiplication modulo the irreducible polynomial. The multiplication with 2 is implemented using a bit shift. The results of the multiplication is the remainder after the division with the irreducible polynome after the shift. This remainder is obtained by inspecting the most-significant-bit. If this is high, the lower 8 bits of the irreducible polynomial are subtracted from the product, otherwise no subtraction has to be performed. Like addition, subtraction is also defined as an XOR.

prim11B = 0x11B :: BitVector 9

gfMult :: BitVector 8 -> Int -> BitVector 8
gfMult a 1 = a
gfMult a 2 = gfMult2 a
gfMult a 3 = xor a $ gfMult2 a
gfMult a 4 = gfMult2 $ gfMult2 a
gfMult a 5 = xor a . gfMult2  $ gfMult2 a
gfMult a 7 = r
  where
    a2 = gfMult2 a
    a4 = gfMult2 a2
    r  = xor a4 $ xor a2 a

gfMult2 :: BitVector 8 -> BitVector 8
gfMult2 a = r
  where
    s = shiftL a 1
    r | msb a == 1 = xor s $ truncateB prim11B
      | otherwise  = s

A separate gfMult2 function is created, since recursive functions cannot automatically be compiled into hardware.

The matrix-matrix multiplication itself is implemented by multiplying each element with the first column of matrix B. The result is rotated to correct for the circulant nature of the matrix. This creates a 3D structure. After a transpose of the two innermost vectors, a XOR-tree flattens the 3D structure into the final matrix.

mixbytes :: BitVector 1024 -> BitVector 1024
mixbytes v = r
  where
    b = unpack v :: Vec 128 StateByte
    yss = unconcatI b :: Vec 16 (Vec 8 StateByte)
    mxm = gfBxM yss
    r = pack mxm

gfBxM yss = map gfBxV yss
gfBxV ys  = map (fold xor) $ transpose $ smap gfBxEr ys
gfBxE y   = y2 :> y7 :> y5 :> y3 :> y5 :> y4 :> y3 :> y2:> Nil
  where
    y1 = y
    y2 = gfMult y 2
    y4 = gfMult y 4
    y3 = xor y y2
    y5 = xor y y4
    y7 = xor y4 y3 
gfBxEr s y = rotateRightS (gfBxE y) s

Combining all tranformations

Now that all four transformation are described, these can be composed into a single function for the P and Q permutation:

roundP r = mixbytes . subbytes . shiftBytesP . addCP r
roundQ r = mixbytes . subbytes . shiftBytesQ . addCQ r

Multiple rounds

One option to implement multiple (14) rounds is to create a recursive function. In this recursive function the input of the current round is the result of the previous round. The input of round 0 is the input of this permutation function, which internally calculates multiple rounds.

roundsPr :: BitVector 4 -> BitVector 1024 -> BitVector 1024
roundsPr 0 = roundP 0
roundsPr r = roundP r . roundsPr (r-1)

roundsQr :: BitVector 4 -> BitVector 1024 -> BitVector 1024
roundsQr 0 = roundQ 0
roundsQr r = roundQ r . roundsQr (r-1) 

Another options, which can automatically be translated to hardware, uses higher-order functions. The ifoldl function, as shown in the figure below, automatically increments the round number (i) and chains the functions together. The number of rounds is set using a dummy vector of which only the size is used.

ifoldl

roundsPf
  :: (CLog 2 n ~ 4, KnownNat n) =>
     SNat n -> BitVector 1024 -> BitVector 1024
roundsPf d i = ifoldl (\a i b -> roundP (unpack $ pack i) a) i (replicate d undefined)
roundsQf
  :: (CLog 2 n ~ 4, KnownNat n) =>
     SNat n -> BitVector 1024 -> BitVector 1024
roundsQf d i = ifoldl (\a i b -> roundQ (unpack $ pack i) a) i (replicate d undefined)

Top level

The compression function , for a single message block :

and final output:

can now be defined by combining the permutation functions. The initial value, iv512 for this Groestl-hash, is 512.

iv512 = 512 :: BitVector 1024

The final output can be calculated in Clash by combining the recursive function that calculated the permutations for all 14 rounds.

groestlr :: BitVector 1024-> BitVector 512
groestlr m = truncateB o
  where
    mP  = xor iv512 m
    oP  = roundsPr 13 mP
    oQ  = roundsQr 13 m
    h0  = xor iv512 $ xor oP oQ
    oP2 = roundsPr 13 h0
    o   = xor iP2 oP2

Or the fold-based permutation functions can be use to obtain a similar hashing function. This hashing function can be automatically translated into hardware.

groestlf :: BitVector 1024-> BitVector 512
groestlf m = truncateB o
  where
    mP  = xor iv512 m
    oP  = roundsPf d14 mP
    oQ  = roundsQf d14 m
    h0  = xor iv512 $ xor oP oQ
    oP2 = roundsPf d14 h0
    o   = xor iP2 oP2

One minor detail is still missing. Usually the input of a hashing function is not exactly 1024 bits. Therefore, the input must first be padded. The padding function extends the length of the last message block to 128 bytes, e.g. 1024 bits. For cryptocurrency mining, the first hash is calculated of a 80 byte message (640 bits). This results in a hash of 640 bits. The second hash is calculated of the 640 bits intermediate hash. In both cases the input must be extended to 1024 bits first.

Given that the padding function is limited to only support a single message, it can be implemented in Clash by adding a few Vectors of constants. A sufficiently large Vector of 0s ensures that I can always take 120 bytes, independently of the input. Finally, the last 8 bytes are appended, such that the result is a Vector of 128 bytes.

pad :: Vec n (BitVector 8) -> Vec 128 (BitVector 8)
pad i = ov
  where
    blocks  = 1 :: BitVector 64
    blocksV = unpack blocks :: Vec 8 (BitVector 8)
    w       = singleton 0x80
    ip      = i ++ w ++ replicate d120 0
    ip120   = take d120 ip
    ov      = ip120 ++ blocksV

A small helper function is used to pad a BitVector, by first unpacking it into a Vector of BitVectors.

padBitVector :: KnownNat n => BitVector (n * 8) -> BitVector 1024
padBitVector = pack . pad . unpack

Now all functions are implemented in Clash, that together form the hashing algorithm suitable for mining! A one-liner is enough to combine them all.

groestlGroestlF :: BitVector 640 -> BitVector 512
groestlGroestlF = groestlf . padBitVector . groestlf . padBitVector

To test the hashing functions, I use data from the GroestlCoin blockchain, GroestlCoin Block 1971106. The GroestlCoin applies the Groest-hashing function twice to the header of a block. The following Clash code contains the block header, and the expected hash. The comparison at the end tests the calculated hash against the expected hash.

block1971106 :: BitVector 640
block1971106 = 0x00000020f501ca94ae05eea597de359b1c4b02b5f97ab06cd3cbc4a259ca0c0000000000e4143d74891c36407e06dc1a4be40350bd2704510eb8b5eb785212ec38bca986d62b8b5ab47b1c1b187aea67
hashBlock1971106 :: BitVector 256
hashBlock1971106 = 0x00000000001b71007302e17daa16c0b357990778c586def3f1b0f5d0c75694d2

toByteV :: KnownNat n => BitVector (n*8) -> Vec n (BitVector 8)
toByteV a = unpack a

reverseTruncHash :: BitVector 512 -> BitVector 256
reverseTruncHash = pack . reverse . take d32 . toByteV

testHash :: Bool
testHash = hashBlock1971106 == reverseTruncHash (groestlGroestlF block1971106)

It works!

Test bench

Finally, I will create a test bench using Clash that can be used in a hardware simulator. Therefore the magic testInput, expectedOutput and topEntity function need to be defined.

import Clash.Explicit.Testbench

topEntity :: Clock System Source -> Reset System Asynchronous -> BitVector 640 -> Signal System (BitVector 256)
topEntity = exposeClockReset $ pure . reverseTruncHash . groestlGroestlF

testBench :: Signal System Bool
testBench = done
  where
    testInput      = block1971106
    expectedOutput = outputVerifier clk rst (hashBlock1971106  :> Nil)
    done           = expectedOutput (topEntity clk rst testInput)
    clk            = tbSystemClockGen (not <$> done)
    rst            = systemResetGen
runTestBench = sampleN 2 testBench

The complete Clash code can be downloaded from: Groestl.hs. The Clash code, including the test bench, can now be converted into VHDL using the following command:

[email protected]:~$ clash --vhdl Groestl.hs

The resulting hardware implementation is not very efficient. All calculations are performed in a single clock cycle. This results in a huge design that can only run at a low clock frequency.

Conclusion

I hope that this blog post showed that it is easy to convert a mathematical description of an algorithm into synthesizable Clash code. However, the current implementation is not very efficient. In a future blog post, I will address this issue and try to optimize the design to fit on a SoCKit / DE-10-Nano board. Moreover, the goal is to maximize the hashing rate that can be obtained using the board.