# Groestl on FPGA using CLaSH

In this blog post, I will describe the development of a hashing algorithm in CλaSH. Using CLaSH, a functional description of a circuit in Haskell can be compiled into hardware (VHDL, Verilog). Why CLaSH? This tool is developed at the CAES-chair at the University of Twente, and resulted in a spin-off company QBayLogic. At the Unviversity of Twente, I did my masters assignment and PhD. During that time, I’ve heard a lot of positive stories from colleagues who used CLaSH. The only thing that was missing for me, was a project to use CLaSH in. My recent interest in cryptocurrencies changed that. Let’s implement a hashing algorithm in CLaSH, and start mining one of the cryptocurrencies on an FPGA!

In this first post, I focus on the basics of CLaSH, without optimizing the design for an implementation on an FPGA. It should, however, result in synthesizable VHDL / Verilog code. I first introduce the mathematics behind a hashing algorithm, and converting the mathematical description step-by-step into synthesizable Haskell code. A follow-up post will go into more details about the optimization for an actual FPGA implementation on a SoCKit / DE-10-Nano board.

For a more detailed introduction of CLaSH, have a look at the blog post of a former colleague. I will assume you have a basic knowledge about the syntax of Haskell. However, don’t be afraid to read on without the knowledge; I also started using CLaSH without prior knowledge of Haskell.

**Edit:** this post is now updated for the recent release of Clash-0.99.

## Cryptocurrency mining

Mining is finding a hash of a block that satisfies a certain difficulty. The hash is calculated of the header of a block, where the miner can freely choose the value of a number of bytes within that header: the **nonce**. The miner keeps on trying different values for the nonce until it find a hash that satisfies the difficulty. The difficulty can be converted into a target hash. The calculated hash should be a smaller value than this target hash. This basically means is that the hash should contains a number of leading ‘0’-bits.

The following Python-code demonstrates the main loop of a mining algorithm:

This examples shows that there are actually two hashes that have to be calculated for each nonce. The first one consists of the concatenation of the header and nonce, together 80 bytes. A second hash is calculated of the 64 bytes (512 bits) result. This second hash is then compared to the target hash.

## Groestl hashing algorithm

The hashing algorithm I’m implementing is Grøstl. Grøstl is one of the finalists for the SHA-3. Moreover, it is one of the least used hashing algorithms for cryptocurrencies, such that there are less existing FPGA implementation available. A clear mathematical description of the Grøstl specification is available of their website, as well as an Grøstl implementation guide. The design of the hashing algorithm is inspired by the Rijndael block cipher, AES.

### Hash function

The specification defines hash functions with different input and output sizes. For mining purposes, only the largest one is of interest, Grøstl-512, which returns a 512-bit hash. The hashing function compresses a message stream () into this 512-bit output. For this 512-bit output, the input is split into blocks of 1024 bits, where the last block is padded to this same length. A simplification is made here specifically for mining, since the message can always be contained in a single block.

The compression function is specified as follows:

It consists of two 1024-bit permutations P and Q and a few XOR-operations (). After the compression function processes all message blocks (a single one for mining), a final P permutations is applied, another XOR-operation, and a truncation to 512-bit, to end up with the final hash:

### Permutation round

The permutation functions P and Q, consist of a number of round transformations. In Grøstl-512, 14 rounds of these transformations are applied. Each of these round consists of 4 transformations:

- AddRoundConstant
- SubBytes
- ShiftBytes
- MixBytes

A round *R* is composed for these four functions:

Such a function composition can directly be expressed in CLaSH:

#### Round matrix

The input of each round is a vector of 128 bytes (1024 bits). A sequence of bytes is mapped to a 8x16 matrix of bytes as follows:

In CλaSH, the data type *BitVector n* and *Vec n a* are defined, which can be used to represent data structures. The size of the fixed-size BitVector is specified by *n*. A BitVector directly translates to a *std_logic_vector* of the same size in VHDL. The *Vec* type specifies a fixed-length vector of type *a* and size *n*.

Conversion functions are defined in CλaSH that can convert a large BitVector into a Vec of smaller BitVectors and vice versa. As an example, *pack* can convert a *BitVector 1024* into a *Vec 128 (BitVector 8)*. The reverse is possible using the *unpack* function. Moreover, the function *unconcat* can split a Vec (n * m) a into Vec n (Vec m a).
The function *concat* does the reverse.

These functions can therefore be used to change the representatation of the round matrix.

#### AddRoundConstant

The AddRoundConstant transformation performs and XOR of the input with a constant matrix. For different rounds this matrix is slightly different. In mathematical notation, AddRoundConstant is defined as:

In Clash, I will first construct the P and Q matrices before the input is XOR’ed. In both matrices, there is only a single row that is not filled with either all zeros or ones (0xFF). Also, note that all the constants in the P matrix are the complement of the constants in the Q matrix. The constants in the first row of the P matrix start at 0 and are incremented by 0x10 for each element. In Haskell, a list of these constants can be specified based on the start, increment and final value in the list. This list can then be converted to a fixed sized vector type that is supported by Clash.

For each element (byte), *e*, in this vector, the XOR with the round number is calculated. Since there are 14 rounds, a BitVector of size 4 is used to represent this number. The size of this BitVector is extended to 8 bits, to match the size of the constants. Finally, each element is converted into a column vector by adding 7 (d7 as a type-level natural number) replications of 0 for the P matrix. For the Q matrix, each column starts with the 7 replications and ends with the element from the constant vector. The same constant vector is used, of which the complement is calculated first.

The functions return a BitVector 1024 using the pack function, which flattens the Vectors. Both the input of the round and the constants can now be XOR’ed as one 1024-bit wide XOR-operation.

#### SubBytes

The second transformation *SubBytes* performs a substitution of each byte by another byte. SubBytes is defined as:

where S is the S-box used in AES.

For the Clash implementation of this transformation, I again start a vector of constants. In this case each byte maps to another byte. I therefore use the input byte as an index in a vector that points to the result of the substitution. A 8-bit wide index corresponds to a Vector that contains 256 elements.

Using this vector, the substitution is implemented in Clash by mapping the indexing function (!!) over all bytes in the BitVector.

#### ShiftBytes

The *ShiftBytes* transformation applies a different circular shift to the left, to each row in the state matrix. The shift per row is specified in the following lists, which are different for P and Q:

Two vectors are constructed in Clash to represent these shifts.

In order to apply the shifts to each row, the BitVector needs to be transformed into a matrix, a vector of vectors of bytes. The function *unconcat* creates a vector of columns; however, the shift must be applied to rows. Therefore, the matrix is transposed before the shift, and transposed back after the shift is applied.

The *rotateLeft* function implements a dynamic rotation that leads to inefficient hardware, since it does not use knowledge of constant shifts. The function *rotateLeftS* does generate more efficient hardware. However, we need a workaround in Clash since it is not possible to specify a Vector of SNats that specify the shift per row. In the next blog post this will be addressed.

#### MixBytes

The final, and most complicated and computationally intensive transformation, is *MixBytes*. This transformation performs a matrix-matrix multiplication:

However, this is not a regular multiplication, but a multiplication of Galois fiels. Just like for AES, the finite field GF(2^8) is used, using the same irreducible polynomial:

Each term in this polynominal can be expressed as a bit. This results in “0b1_0001_1011” or “0x11B”.

The B-matrix is a *circulant matrix*, where each row is shifted to the right by a single position relative to the previous row. In short, this matrix is defined as:

I will, however, use its transpose:

Luckily the B matrix contains only 5 distinct values. I will therefore implement a Galois multiplication function that uses pattern matching to distinguish between these 5 values as one of the multiplicants. Addition is easy for these Galois fields, since it’s an XOR-operation. I make use of the fact that . The only multiplication that I therefore need to implement is the times 2; the other multiplications can be rewritten as additions (XOR’s).

The Galois multiplication is a regular multiplication modulo the irreducible polynomial. The multiplication with 2 is implemented using a bit shift. The results of the multiplication is the remainder after the division with the irreducible polynome after the shift. This remainder is obtained by inspecting the most-significant-bit. If this is high, the lower 8 bits of the irreducible polynomial are subtracted from the product, otherwise no subtraction has to be performed. Like addition, subtraction is also defined as an XOR.

A separate *gfMult2* function is created, since recursive functions cannot automatically be compiled into hardware.

The matrix-matrix multiplication itself is implemented by multiplying each element with the first column of matrix B. The result is rotated to correct for the circulant nature of the matrix. This creates a 3D structure. After a transpose of the two innermost vectors, a XOR-tree flattens the 3D structure into the final matrix.

### Combining all tranformations

Now that all four transformation are described, these can be composed into a single function for the P and Q permutation:

### Multiple rounds

One option to implement multiple (14) rounds is to create a recursive function. In this recursive function the input of the current round is the result of the previous round. The input of round 0 is the input of this permutation function, which internally calculates multiple rounds.

Another options, which can automatically be translated to hardware, uses higher-order functions. The *ifoldl* function, as shown in the figure below, automatically increments the round number (i) and chains the functions together. The number of rounds is set using a dummy vector of which only the size is used.

### Top level

The compression function , for a single message block :

and final output:

can now be defined by combining the permutation functions. The initial value, *iv512* for this Groestl-hash, is 512.

The final output can be calculated in Clash by combining the recursive function that calculated the permutations for all 14 rounds.

Or the *fold*-based permutation functions can be use to obtain a similar hashing function. This hashing function can be automatically translated into hardware.

One minor detail is still missing. Usually the input of a hashing function is not exactly 1024 bits. Therefore, the input must first be padded. The padding function extends the length of the last message block to 128 bytes, e.g. 1024 bits. For cryptocurrency mining, the first hash is calculated of a 80 byte message (640 bits). This results in a hash of 640 bits. The second hash is calculated of the 640 bits intermediate hash. In both cases the input must be extended to 1024 bits first.

Given that the padding function is limited to only support a single message, it can be implemented in Clash by adding a few Vectors of constants. A sufficiently large Vector of 0s ensures that I can always *take* 120 bytes, independently of the input. Finally, the last 8 bytes are appended, such that the result is a Vector of 128 bytes.

A small helper function is used to pad a BitVector, by first unpacking it into a Vector of BitVectors.

Now all functions are implemented in Clash, that together form the hashing algorithm suitable for mining! A one-liner is enough to combine them all.

To test the hashing functions, I use data from the GroestlCoin blockchain, GroestlCoin Block 1971106. The GroestlCoin applies the Groest-hashing function twice to the header of a block. The following Clash code contains the block header, and the expected hash. The comparison at the end tests the calculated hash against the expected hash.

*It works!*

### Test bench

Finally, I will create a test bench using Clash that can be used in a hardware simulator. Therefore the **magic** *testInput*, *expectedOutput* and *topEntity* function need to be defined.

The complete Clash code can be downloaded from: Groestl.hs. The Clash code, including the test bench, can now be converted into VHDL using the following command:

```
[email protected]:~$ clash --vhdl Groestl.hs
```

The resulting hardware implementation is not very efficient. All calculations are performed in a single clock cycle. This results in a huge design that can only run at a low clock frequency.

## Conclusion

I hope that this blog post showed that it is easy to convert a mathematical description of an algorithm into synthesizable Clash code. However, the current implementation is not very efficient. In a future blog post, I will address this issue and try to optimize the design to fit on a SoCKit / DE-10-Nano board. Moreover, the goal is to maximize the hashing rate that can be obtained using the board.