Home > On-Demand Archives > Theatre Talks >

## Fixed-Point Made Easy: A Guide for Newcomers and Seasoned Engineers

Dan Boschen - **Watch Now** - Duration: 36:56

Fixed-point implementation is popular for lowest power, lowest cost solutions when it is critical to make the most out of limited computing resources. However, the jargon and rules can be overwhelming to newcomers and seasoned engineers alike.

In this theatre talk, Dan will guide you through the common representations and rules for working with binary fixed point. This will include the Q notation for fractional number representation, two's complement, signed and unsigned numbers, considerations for truncation, rounding and overflow, and easy to follow rules for binary arithmetic. There will be plenty of fun examples to demonstrate the key concepts and practical use of the methodologies. If you are new to fixed-point or rusty and would like a refresher, this talk is for you! This would particularly apply to anyone that needs a recap on fixed-point and is interested in attending Dan's talk "Fixed-Point Filters - Modelling and Verification Using Python".

Even those exposed to fixed point in the past will appreciate this work-out session to quickly get back in top fixed-point shape!

**johned**

**DBoschen**Speaker

Thanks Johned! Sounds interesting- Could you elaborate with an example? I am not sure I a following it yet…

**johned**

Sure thing, Dan,

If we look at your negation of 18, if you write down the all the bits from the lsb up to the first '1' then we have : 10

Now negate all of the rest: 000100

Combine the two: 00010010

PS Glad to see you're using the correct Q format (Arm, with used to be called AMD format) :-)

**DBoschen**Speaker

Ah I see! That is very nice, thanks for sharing that.

**RickLyons**

Hi Dan. This terrific presentation of yours reminds me of a quote from English poet Goeffrey Chaucer [1340-1400], "Gladly would he learn, and gladly teach."

**DBoschen**Speaker

Ha! Thanks for watching Rick. You hit it on the head. I have such good memories and learned a lot from going through the 2nd order resonator in such detail with you, starting with your great write up reminding us of the interesting quantization patterns for the poles in such structures and the Coupled Form: https://www.dsprelated.com/showarticle/183.php

**nathancharlesjones**

Are you able to recommend any good fixed-point math libraries, Dan? I've played around with fixed-point before a little and I feel like I could make +,-,*,/ functions but I'd get stuck if I needed to do anything more complex like sin or sqrt.

**DBoschen**Speaker

I recommend fpbinary for Python which I cover in detail in my other workshop. I believe they have plans for adding trig and sqrt functions. However my concern for verification is if the algo used exactly matches the implementation - for that I would typically match the particular implementation when that is known (LUT etc).

**nathancharlesjones**

What about for on an embedded controller? Any C/C++ libraries for fixed-point math/DSP?

**DBoschen**Speaker

I’ve come across this but have no experience using it: https://github.com/deftio/fr_math

**nathancharlesjones**

Thanks! I'll check it out.

**rokath**

Great in-deep explanation, Dan!

Just let me add how one can deal with float operations avoiding float or division in this example:

`temp = (( 5281 * ADCRaw) >> 16) - 50; // r=(adc* 3300 /4095 - 500 )/10 = adc* 3300/40950 - 50;`

The 12-bit ADC value needs to be multipled by 0.80568.. and as we know division is costly. But you can transform it to a shift operation if you know the divisor at compile time, here 40950. A calculator gives: (3300/40950) * 2^16 = 5.281,289... ~ 5281.

The max ADC value is 4095, so 5281*4095 = 21.625.695 -> no 32-bit overflow possible. To decrease the error further a 22-bit shift is even better. This allows a really fast computation at runtime. Hope that helps someone.

**DBoschen**Speaker

Thank you Rokath!

**RemingtonFurman**

Thanks! The negative weight representation for the sign bit is very useful, and I'm surprised I haven't understood it that way before.

The different "TI" vs "ARM" Q notations are unfortunate. I like the "ARM" notation more too, because it's explicit with how many bits are being used. On a microcontroller it's often going to be a multiple of 8, so if you see "Q3.4" you know you're looking at "TI" notation. But on an FPGA or other chip that might not be immediately clear. Also, after seeing negative m and n values, I agree that Q(m,n) is a better notation.

I'm a fan of Randy Yates' fixed point documents, and it looks like he just made an update to "Fixed-Point Arithmetic: An Introduction" four days ago on April 21, 2023, though the revision history doesn't say what changed.

Another neat fixed-point math fact is that overflowing a fixed point value during a computation doesn't matter, as long as later operations in the computation bring the result back in range before the final result. Your binary wheel diagram makes that much easier to visualize.

**DBoschen**Speaker

Thanks for the good comments Remington. I believe Randy simply uploaded his "PA10" version from July 3, 2021 as listed in the Revision table. The date at the top is the print date, but the Rev matches the table. However this was a significant addition as my "PA9" version was only 1111 pages long and now it stands at a whopping 11001 pages!

**DBoschen**Speaker

In the recorded presentation I mention 2's complement as "Flip All the Bits and add 1 LSB". This would be clearer as "Flip All the Bits and 1 LSB Weight" (or "Flip All the Bits and Add 1 Bit"), which is what is now shown on the downloadable pdf . I wanted to be explicit since "Flip all the bits and add 1" as often stated can be confusing. To convert negative numbers to and from 2's complement, we add the weight given to the LSB in the Q representation we are using. (ex: Q15.1 would be flip all the bits and add 0.5)

Great talk, Dan,

Back in the days before everyone had PCs and laptops I was designing some filters by hand and negating values using the "Flip All the Bits and add 1 LSB" technique I'd been taught at Uni.

The chief engineer told me about a much more efficient solution "starting with the lsb write down all the bits up to and including the first "1" then negate all of the other bits". Much easier to do in your head and also easier to implement in and FPGA :-)