Home > On-Demand Archives > Workshops >
Fixed-Point Filters – Modelling and Verification Using Python
Dan Boschen - Watch Now - EOC 2023 - Duration: 02:21:36
NEW: All files related to this workshop have been zipped and can be downloaded by clicking on the link in the left column "Click Here to Download Slides (PDF)"
Digital filters are commonly used in the processing of signals, whether they be wireless waveforms, captured sounds, and biomedical signals such as ECG; typically for the purpose of passing certain frequencies and suppressing others. Fixed-point implementation is attractive for lowest power lowest cost solutions when it is critical to make the most out of limited computing resources, however there can be significant performance challenges when implementing filters in fixed-point binary arithmetic. When a fixed-point implementation is required, a typical design process is to start with a floating-point design that has been validated to meet all performance requirements, and then simulate a fixed-point implementation of that design while modifying the precision used to ensure the requirements are met.
In this workshop, Dan takes you through the practical process of simulating a fixed-point digital filter using open-source Python libraries. This is of interest to participants wanting to see a motivating example for learning Python as well as those with experience using Python. Also included: a quick recap of basic filter structures and filter performance concerns. A significant background in Digital Signal Processing (DSP) or digital filter design is not required. Having taken an undergraduate Signals and Systems course is sufficient. For a more detailed review of binary fixed-point operations and notations that will be used in this workshop, please attend Dan's Theatre Talk "Fixed-Point Made Easy: A Guide for Newcomers and Seasoned Engineers" that will be scheduled before this. After attending this talk, the participants will be equipped to confidently convert a given filter implementation to fixed-point prior to detailed implementation. If you have a floating-point filter design and need to implement it in fixed-point, this workshop is for you!
What this presentation is about and why it matters
This workshop walks through the practical process of converting a validated floating‑point digital filter design into a fixed‑point implementation, using Python for simulation and verification. Fixed‑point arithmetic is still widely used in low‑power, low‑cost embedded systems (MCUs, DSPs, FPGAs, and ASICs). The talk explains how quantization and finite word‑length effects appear in filters, how those errors accumulate, and how to pick bit widths and placements so the implemented filter meets your system SNR and rejection requirements. If you must move a working floating‑point filter to an embedded platform (or evaluate whether you can), a clear methodology and the right simulation tools make that migration predictable rather than scary.
Who will benefit the most from this presentation
- Embedded systems and FPGA engineers who must implement filters with constrained data widths and power budgets.
- DSP engineers migrating prototypes from floating point to production fixed point.
- Students and practitioners who want an applied, hands‑on introduction to fixed‑point effects (using Python/Jupyter).
- Anyone deciding tradeoffs between floating‑point ease and fixed‑point cost/power savings.
What you need to know
The talk is approachable with a single undergraduate Signals & Systems course under your belt. To get the most out of it, be comfortable with:
- Basic filter representations: impulse response, transfer function H(z), numerator/denominator coefficients, and the distinction between FIR (feedforward) and IIR (feedback) filters.
- Implementation forms: direct form I, direct form II, and transposed forms, plus the idea of factoring a high‑order IIR into second‑order sections (SOS / biquads).
- Sampling and spectral analysis: what an FFT represents, how an FFT bin is a narrowband filter, and why increasing FFT length reduces per‑bin noise (processing gain).
- Fixed‑point notation Qm.n (signed/unsigned), where m is integer bits and n fractional bits, and the practical meaning of resizing (shifting the fixed window) and saturation/wrap behavior.
- Key sizing rules you will see in the talk: multiplier outputs require a+b bits (input width plus coefficient width) to be exact; accumulators grow as roughly \(\log_2(N)\) for N summed terms; quantization of a uniform LSB has variance \(\sigma^2 = Q^2/12\), which underpins the common SNR rule of thumb.
A couple useful formulas introduced in the talk (MathJax syntax): the quantization variance for a step Q is \(\sigma^2 = \frac{Q^2}{12}\). For a full‑scale sinusoid, the quantization SNR approximation is often written as \(\text{SNR}_{\text{dB}} \approx 6.02\cdot B + 1.76\ \text{dB}\), where B is the number of bits.
Glossary
- Fixed‑point: A binary representation with a fixed radix point; precision and range are set by the number of integer and fractional bits (Qm.n).
- Floating‑point: A numeric format that encodes a significand and exponent (e.g., IEEE‑754); convenient for design but uses more power/area typically.
- Quantization noise: Random error introduced when mapping a higher‑precision value to a finite set of levels (e.g., rounding or truncation).
- Qm.n: Notation for fixed‑point formats: m integer bits, n fractional bits (total bits = m+n, sign bit included if signed).
- SNR (Signal‑to‑Noise Ratio): Ratio of signal power to noise power, typically expressed in dB; used here to measure filter output quality after quantization.
- FFT bin / processing gain: Each FFT bin is a narrowband filter; larger FFTs concentrate white noise into more bins and reduce per‑bin noise by 10·log10(N).
- FIR: Finite Impulse Response filter—no feedback, inherently stable, often easier to design in fixed point.
- IIR: Infinite Impulse Response filter—uses feedback. Quantization inside loops gets shaped by the filter and can accumulate or be amplified.
- SOS (Second‑Order Section): Factoring a high‑order IIR into cascaded 2nd‑order biquads to reduce sensitivity and ease fixed‑point realization.
- Accumulator growth: When summing N values, required bit growth is about \(\log_2(N)\); multipliers produce outputs with bitwidth equal to the sum of operand widths.
Final note — why this presentation is worth your time
Dan packs practical experience into a clear, example‑driven workflow: start from a validated floating‑point design, use targeted simulations (Jupyter + fp‑binary), iterate on coefficient quantization and node sizing, and verify using FFTs and SNR/EVM metrics. He balances intuition (why truncation introduces a DC offset, how quantization noise sums in power) with concrete recipes (where to resize, how many bits to add for accumulation, and why SOS often wins for stability). If you want a hands‑on, pragmatic path to get floating‑point filters onto constrained hardware with predictable results, this talk — and its notebook demos — will save you a lot of trial and error. Enjoy the practical demonstrations and the clear explanations; they make a tricky topic much more approachable.
Hi Mathieu, thanks for these insightful comments.
The final bit sizes shown are not necessarily the optimum bit sizes, and optimum would depend on a total allowable SNR degradation as well as allowable reduction of the stop band rejection. What I do is simply iterate on the bit widths while monitoring the performance of both (SNR degradation as well as achieved stop band rejection) with a maximum allowable degradation in mind. As mentioned in the talk, I used a sine wave as it would be easiest for a wider audience to follow the quick demonstration but recommend actually doing this with a test signal that contains the reference waveform that fills the bandwidth used as well having elevated noise at worst case interference levels (and the reference waveform is used throughout to confirm waveform quality with the EVM function). As far as setting the bit width between stages, a rule of thumb is that it should be higher precision than the final output (similar to the rule of thumb and for the same reasons that I demonstrate that the multiplier outputs in an FIR filter should be higher precision than the final output of an FIR filter) but how much really depends on the gain (and structure!) of the subsequent stages- so by doing the iterative approach I use with Q(m,n) format, we are adjusting both gain and precision in between each stage and with that quickly seeing each stages contribution to the final output performance. Not demonstrated, but I recommend further using the floating point model to create a reference waveform at each major node (in between the SOS sections), and then with that the SNR degradation can be monitored directly at that stage. Ultimately you work with an SNR budget representing the total allowable degradation for the whole filter, and balance how much of the budget to use for each section. You'll see in this the sensitivity of the stages such that the distribution of the noise degradation is optimized based on bit width growth (growing the bit width of one shrinks that of another and we find the optimum balance with that in mind). I am developing a tool that automates much of this while also displaying conveniently the total resources (adds, delays, mults) used - but in this talk I didn't want that educational detail of manually tuning the filter hidden.
I am happy you noticed the utility (and sophistication) of my EVM function. I call this a "Rho Tool" which is equivalent to what we may traditionally call EVM when limited to decision samples only. However here for a sine wave, or any other reference waveform when we are concerned with the accuracy of every sample provided, the "Error Vector" is the difference between the sample and its noise free replica at every sample. I detail the process of creating such a tool here: https://dsp.stackexchange.com/questions/86682/issue-with-snr-and-sinad-measurement-using-matlab-functions-in-specific-cases/87596#87596 . I have implemented this for very high dynamic range, high accuracy measurements and the 2D search you mention was accomplished very efficiently via binary search (log2(N) steps to converge to floor).
All files have been zipped and can now be downloaded from the "Click Here to Download Slides (PDF)" link on the left.
Thank you, Stephane. And thank you, Dan, for a very well-done and highly interesting session. Bravo.
One of my motivations for attending the EOC is to learn from experts like Dan. The value of learning from such experts is incredible. In Dan's case, I learn something new every time I see his presentations. His methods, and resources are well beyond what many courses I've attended in person. Dan's attention to detail is obvious as is his passion. I plan to retake a few of his courses, but I would like to see what he offers for the material he said he is putting together in this presentation for a possible filters course. I find Dan's courses well worth the cost and time attending.
Thank you for the kind words David. It’s always great to have such interested co-learners such as yourself!
Cannot join -- "the host has another meeting in progress"
This live event is over. A recording will be posted later today.
Ah, my mistake. I thought the schedule said 10am.
Yes, 10 am EDT.




















Hi, thanks a lot for this excellent presentation from a subject that way underrated today, I got a lot of value from it.
I just have few questions :
1) I did not get why the SOS#1 and SOS#2 output formats were set to 20 bits width (maybe I missed the reason in the presentation).
Did you resized the SOS#1 and SOS#2 output formats to smaller size in order to "relax" constraint on the downstream coefficients ?
Is there a rule of thumb to select the datapath width between second-order-sections ?
Should not we try to limit the quantization noise as much as possible in the first section of the filter, so that less noise is cascaded through the chain ?
2) Finally, as an exercise, I would like to run the jupyter notebook, therefore I want to replicate the EVM function
I never really computed EVM for sinewaves.
Do have to perform a 2D search to find the best {delay, magnitude} of the output vs the reference
and then compute the error vector ?
I don't need the code, just let me know If I am on the right way ;-)
Best regards