"Below you can see an example of a tensor with a (symmetric) dynamic range of $x_{f}$ $[-amax, amax]$ mapped through quantization to a an 8 bit integer, $2^8=256$ discrete values in the interval $[-128, 127]$ (32-bit floating-point can represent ~4B numbers in the interval $[-3.4e38, 3.40e38]$).\n",
"Quantization of floating point numbers can be achieved using the quantization operation\n",
"\n",
...
...
@@ -37,7 +37,7 @@
"\n",
"On FPGA, we do not use int8 quantization, but fixed-point quantization, bu the idea is similar. Fixed-point representation is a way to express fractions with integers and offers more control over precision and range. We can split the $W$-bits making up an integer (in our case $W=8$) to represent the integer part of a number and the fractional part of the number. We usually reserve 1-bit representing the sign of the digit. The radix splits the remaining $W-1$ bits to $I$ most significant bits representing the integer value and $F$ least significant bits representing the fraction. We write this as $<W,I>$, where $F=W-1-I$. Here is an example for an unsigned $<8,3>$:\n",
"This fixed point number corresponds to $2^4\\cdot0+2^3\\cdot0+2^2\\cdot0+2^1\\cdot1+2^0\\cdot0+2^{-1}\\cdot1+2^{-2}\\cdot1+2^{-3}\\cdot0=2.75$.\n",
...
...
%% Cell type:code id:d60b2756 tags:
``` python
#Install some dependencies
%pipinstallpyarrowhls4mlpyparser
```
%% Cell type:markdown id:bd9c5bc4 tags:
# Quantization aware training with QKeras
Quantization is a powerful way to reduce model memory and resource consumption. In this tutorial, we will use the libary QKeras to perform quantization aware training (QAT).
In contrast to in Keras, where models are trained using floating point precision, QKeras quantizes each of the model weights and activation functions during training, allowing the network to adapt to the numerical precision that will eventually be used on hardware.
During the forward pass of the network, each floating point weight is put into one of $2^{bitwidth}$ buckets. Which one it goes into is defined through rounding and clipping schemes.
Below you can see an example of a tensor with a (symmetric) dynamic range of $x_{f}$ $[-amax, amax]$ mapped through quantization to a an 8 bit integer, $2^8=256$ discrete values in the interval $[-128, 127]$ (32-bit floating-point can represent ~4B numbers in the interval $[-3.4e38, 3.40e38]$).
Quantization of floating point numbers can be achieved using the quantization operation
$$x_{q} = Clip(Round(x_{f}/scale))$$
where $x_{q}$ is the quantized digit and $x_{f}$ is the floating point digit. $Round$ is a function that applies some rounding scheme to each number and $Clip$ is a function that clips outliers that fall outside the $[-128, 127]$ interval. The $scale$ parameter is obtained by dividing the float-point dynamic-range into 256 equal parts.
On FPGA, we do not use int8 quantization, but fixed-point quantization, bu the idea is similar. Fixed-point representation is a way to express fractions with integers and offers more control over precision and range. We can split the $W$-bits making up an integer (in our case $W=8$) to represent the integer part of a number and the fractional part of the number. We usually reserve 1-bit representing the sign of the digit. The radix splits the remaining $W-1$ bits to $I$ most significant bits representing the integer value and $F$ least significant bits representing the fraction. We write this as $<W,I>$, where $F=W-1-I$. Here is an example for an unsigned $<8,3>$:
This fixed point number corresponds to $2^4\cdot0+2^3\cdot0+2^2\cdot0+2^1\cdot1+2^0\cdot0+2^{-1}\cdot1+2^{-2}\cdot1+2^{-3}\cdot0=2.75$.
The choice of $I$ and $F$ has to be derived as a trade-off between representation range and precision, where $I$ controls the range and $F$ the precision.
In the following we will use a bitwidth of 8 and 0 integer bits. Not considering the sign bit, this means that the smallest number you can represent (the precision) and the largest number (the range) is:
With zero integer bits the largest number you can represent is just below (but not including) 1. For weights in deep neural networks, being constrained to be less than 1 is often a reasonable assumtion.
What QKeras (and other QAT libraries) do, is to include the quantization error during the training, in the following way:
- "Fake quantize" the floating-point weights and activations during the forward pass: quantize the weights and use them for the layer operations
- Immediately un-quantize the parameters so the rest of the computations take place in floating-point
- During the backward pass, the gradient of the weights is used to update the floating point weight
- The quantization operation gradient (zero or undefined) is handled by passing the gradient through as is ("straight through estimator")
## Inspect the original model
In the following we will use the QKeras library to add quantizers to our model. First, let's load the baseline model and remind ourselves what the architecture looks like:
So we have 3 hidden layers with [64,32,32] neurons. We don't see it here, but they are all followed by an "elu" activation. The output is one node activated by a sigmoid activation function.
## Translating to a QKeras QAT model
There are two ways to translate this into a QKeras model that can be trained quantization aware, lets first do it manually:
The magic happens in ```quantized_bits``` (see implementation [here](https://github.com/google/qkeras/blob/master/qkeras/quantizers.py#L1245)), where the parameters are the following:
-```bits```: The bitwidth, allowing you to have $2^{bits}$ unique values of each weight parameter
-```integers```: How many are integer bits, in this case zero. All 8 bits are used to represent the fractional part of the weight parameter, with no bits dedicated to representing whole numbers. This forces the value to be between -1 and 1. For DNNs this can be useful because the focus is entirely on the precision of the fraction rather than the magnitude of the number. Question: Would this also work on the output node if your algorithm is a regression of the jet mass?
-```symmetric```: should the values be symmetric around 0? In this case it doesnt have to be.
-```alpha```: with $2^W$ unique values available, we could let them go from [-2^W, 2^W-1] like above, but we can also let them go from $[-2^W*\alpha, (2^W-1)*\alpha]$. ```alpha``` is a scaling of the weights. Enabling this often leads to improved performance, but it doesnt talk so nicely to hls4ml, so we recommend leaving it at 1 (or get ready for having to debug)
Having added this, QKeras will automatically apply fake quantization for us during the forward pass, accounting for the quantization error and returning a network that is optimized for the precision you plan on using in hardware.
Another thing to notice is that we leave the sigmoid and the final output logit unquantized. This is because this is were we want the values to be very accurate, and it is not going to save us a lot of resources quantizing it.
### Automatic model quantization through config
You can also set the quantization for the full model using a model configuration. Sometimes this can be sater if you're using the same quantizer for all layers of the same type. You don't have to use this for this tutorial, we already have a model, but we will leave it here as an example:
But be careful that activation functions like softmax/sigmoid and perhaps logit layers you want to keep at full presision doesn't get quantized!
%% Cell type:markdown id:e59eea22 tags:
## But how many bits?
So now we know how to quantize our models, but how do we know wich precision to choose?
Finding the best number of bits and integer bits to use is non-trivial, and there are two ways we recommend:
- The easiest strategy is to scan over the possible bit widths from binary up to some maximum value and choose the smallest one that still has acceptable accuracy, and this is what we often do.
Code for how to do this can be found [here](https://github.com/thesps/keras-training/blob/qkeras/train/train_scan_models.py#L16), and is illustrated below.
For binary and ternary quantization, we use the special ```binary(alpha=1.0)(x)``` and ```ternary(alpha=1.0)(x)``` quantizers.
<imgsrc="images/quant_scan.png"width="400"/>
- Another thing you can do is to use our library for automatic quantization, [AutoQKeras](https://github.com/google/qkeras/blob/master/notebook/AutoQKeras.ipynb), to find the optimal quantization for each layer. This runs hyperparameter optimisation over quantizers/nodes/filters simultenously. An example can be found at the end of [this notebook](https://github.com/fastmachinelearning/hls4ml-tutorial/blob/main/part6_cnns.ipynb) "Bonus exercise: Automatic quantization with AutoQKeras".
%% Cell type:markdown id:17da0954 tags:
## Pruning
Besides reducing the numerical precision of all the weights, biases and activations, I also want to remove neurons and synapses that do not contribute much to the network overall decision. We do that by pruning, let's remove 50\% of the weights (spasity=0.5):
Great, we now have our model ready to be trained! There is one last important thing we have to think about and that is the *precision of the input*. In the L1T, all of the inputs are quantized. For instance, the precision used for the GT is listed [here](https://github.com/cms-l1-globaltrigger/mp7_ugt_legacy/blob/master/doc/scales_inputs_2_ugt/pdf/scales_inputs_2_ugt.pdf).
Ideally, when you train your network, you use the hardware values that the algorithm will actually receive when running inference in the trigger.
We saw, however, that the inputs were all scaled to have a mean of zero and variance of one in the previous exercise. That means that the new optimal precision for the inputs have changes and you need to define what the precision will be. Here we will do it by inspection and intuition, and use the same precision for all of the input features. Let's now load, scale the data and look at the input value distribution:
In this case, the values seem to be <50soletsassume6integerbits($2^6=64$).Thenumberoffractionalbitswilldefineourprecision,andwillaffectthenetworkperformance.Let'sassume10issufficient(thesmallestincrementwecanrepresentis$2^{-10}=0.0009765625$).Tomakeournetworkadapttothisinputprecision,weneedto"treat"ourtrainingandtestingsetwithaquantizertogofromFP32$\rightarrow<16,6>$:
plt.xlabel('Quantized Feature Value (standardized)')
plt.ylabel('Frequency')
plt.title('Stacked Histogram of all features')
plt.legend(loc='upper right',ncol=2)
plt.semilogy()
plt
```
%% Cell type:markdown id:3e0e7438 tags:
The weight distribution looks similar, but we can not really say how much we loose in performance before training with different input precisions.
Phew, okay, finally time to train. For this part there are 2 things to note: you need to add a pruning callback and also you might need to adjust the learning rate (like add a learning rate decay). Let's train!
So it seems despite having reduced the numerical precision of the model and the input, as well as removing 50% of the model weights, we're doing pretty good! This can be tuned to get even better, by carefully adjusting the input precision and the model precision, especially increaseing the precision of the logit layer.
# Generating firmware with
<imgsrc="images/hls4ml_logo.png"width="400"/>
Time to translate this model into HLS (which we will integrate in the emulator) and use to generate the vhdl to be integrated in the trigger firmware.
hls4ml seamlessly talks to QKeras, making our jobs way easier for us, but there is still some work for us to do to make sure we get good hardware model accuracy. Lets start!
There are a few things I already know in advance and would like my model to include:
- Be execuded fully parallel (=unrolled) to reach the lowest possible latency. We set the ReuseFactor=1 and Strategy=Latency
- The correct input precision
- The correct model output (that's something you have to figure out yourself!)
- Use "correct" precisions to make sure the hardware model performs the same as the software one. QKeras handles weights/biases and activation functions for us, but there are a few other parameters that need to be set by hand
For the final point, have a look at the following diagram:
Whereas the $weight$ and $bias$ is set to its optimal value from the QKeras model, the accumulator $accum$ and $result$ is set to some default value that might not be optimal for a given model and might need tuning. Let's do a first attemt and compare the ROC curves:
part='xcu250-figd2104-2L-e',#Target FPGA, ideally you would use VU9P and VU13P that we use in L1T but they are not installed at lxplus, this one is close enought for this
clock_period=2.5,# Target frequency 1/2.5ns = 400 MHz
input_data_tb='qX_test.npy',# For co-simulation
output_data_tb='qy_test.npy',# For co-simulation
)
hls_model.compile()
```
%% Cell type:markdown id:9e00bf3f tags:
First, what does our newly created model look like?
Here you can see that the precision is what we set it to be in QKeras as well as what we set manually in the config. One thing to note is the different definitions used in QKeras and in ap_fixed:
-```quantized_bits(8,0) -> ap_fixed<8,1>```
-```quantized_relu(8,0) -> ap_ufixed<8,0>```
Also you can see that the defualt value for result/accu is set to $16,6$. This can also be tuned to more optimal values.
%% Cell type:code id:660f657e tags:
``` python
# Let's also run predict on the C++ implementation of our model and make sure it's the ~same as for the QKeras model:
# This is very slow for the C++ implementation of our model, so lets only do 1000 events for this
Oh! That was easier than expected. If you see the accuracies differing significantly, it's a good idea to look into accumulator and reult precisions. Also with tools like $Trace$ and $Profiling$ that you can learn from in the [official hls4ml tutorial](https://github.com/fastmachinelearning/hls4ml-tutorial/blob/main/part2_advanced_config.ipynb) can be helpful! In this case, it doesnt seem like it's necessary. Now let's build it! Lets run C-synthesis (C++ to register-transfer level), Vivado logic synthesis (gate level representation) and co-simulation (send test vectors, do an exhaustive functional test of the implemented logic)
Now, lets, look at the reports! The latency can be extracted from the C-synthesis report, and validated from the co-simulation report (where actual data is sent through the logic. The resource consumption can be extracted from the implementation report (Vivado logic synthesis) and is more accurate then what is quoted in the C-synthesis report:
%% Cell type:code id:dbc8b9f2 tags:
``` python
print("\nC synthesis report (latency estimate):")
print_dict(report["CSynthesisReport"])
#print_dict(report["CosimReport"]) Not working due to missing libc header sys/cdefs.h :(?
A latency of $2.5\cdot16=40$ ns, that is not bad! Also, the network is using very little resources: 8k out of 1728k LUTs, 15 out of 12k DSPs. This is <1% of the total available resources. We have a set of HLS files that will be integrated into the CMSSW emulator (```L1TMLDemo_v1/firmware/```) and VHDL that will be integrated into the mGT firmware (```L1TMLDemo_v1/myproject_prj/solution1/impl/vhdl/```). That's next!