From LLMs to SLMs in Embedded World
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning, multimodal understanding, and natural-language interaction, but their computational and memory demands place them far beyond the reach of resource-constrained embedded platforms. As embedded systems increasingly require on-device intelligence, supporting tasks such as voice interfaces, anomaly detection, semantic understanding, and autonomous decision-making, there is a growing need to scale language models down without sacrificing essential accuracy, responsiveness, or safety.
This presentation traces the evolution from cloud-scale LLMs to optimized Small Language Models (SLMs) tailored for embedded systems. It examines the algorithmic, architectural, and co-design innovations that make this transition possible, including model compression, quantization, structured pruning, distillation, sparsity-aware compute, and memory-efficient attention mechanisms. It also highlights system-level considerations such as real-time inference, energy constraints, thermal limits, secure deployment, and domain-specific customization.
The session provides insight into how SLMs enable embedded devices to run meaningful language and reasoning workloads locally, reducing latency, improving privacy, increasing reliability in disconnected environments, and enabling new classes of intelligent edge applications. Representative use cases across automotive, industrial automation, smart IoT, and wearable devices illustrate the emerging potential of deploying compact language models directly at the edge.
What this presentation is about and why it matters
What changes when the model is no longer allowed to assume cloud-scale compute, constant connectivity, or generous memory? Shivangi Agrawal frames that tension through embedded use cases like wearables, automotive assistants, industrial IoT, smartphones, and agriculture, then walks through the practical shift from large language models to small language models. The session blends a grounded industry overview with an architecture-and-deployment tour, including latency, privacy, hardware limits, compression, and on-device inference. It is a good fit if you are trying to understand where SLMs fit in edge systems, and what tradeoffs matter when the target is a constrained device rather than a data center.
Who will benefit the most from this presentation
- Embedded systems engineers evaluating whether on-device AI can fit within power, RAM, and latency budgets
- ML engineers moving from cloud-hosted LLM workflows to edge deployment
- Product architects working on privacy-sensitive or intermittently connected devices
- Technical leads in automotive, industrial IoT, healthcare, or mobile products considering local inference
What you need to know
Helpful background, but not required:
- Basic familiarity with LLMs and the idea of inference
- Some awareness of embedded constraints such as memory, power, and latency
- High-level understanding of model compression terms like quantization or pruning
Glossary (terms used in this talk)
- Quantization: The process of mapping a high-precision or continuous value to a limited set of representable values. In digital signal processing, quantization introduces approximation error and can affect accuracy and stability.
- State-space model (SSM) (State-Space Model): A system representation using internal state variables and matrices that define state evolution and input/output behavior.
- Attention (QKV) (Query-Key-Value): A mechanism that computes weighted combinations of values using similarity scores derived from queries and keys.
- Transformer: A sequence model that uses attention mechanisms to capture long-range dependencies without recurrence.
- Pruning: A model compression technique that removes unnecessary weights, channels, or connections to reduce model size and compute.
- Small language model (SLM): A language model designed to run with fewer parameters and lower resource demands than a large general-purpose model. SLMs are often used where latency, memory, power, or connectivity are constrained.
- Edge AI: The practice of running AI inference, and sometimes training, on devices near where data is produced instead of sending everything to a centralized cloud. It is used to reduce latency, protect privacy, and improve resilience to poor connectivity.
- NPU (Neural Processing Unit): A specialized processor designed to accelerate neural network workloads, especially matrix-heavy inference. NPUs are commonly used in mobile and embedded systems because they can deliver better performance per watt than general-purpose CPUs.
Toolbox (mentioned in this talk)
- TensorFlow Lite: A lightweight runtime and model format for deploying machine learning models on mobile and embedded devices. It is commonly used as an intermediate step before target-specific compilation or conversion.
- Apple A18 Pro: A mobile SoC platform with integrated AI acceleration used for on-device workloads. It represents the kind of consumer hardware that can host low-latency edge inference.
- NVIDIA Jetson Orin NX: An edge AI module aimed at robotics and other embedded systems that need substantial on-device compute. It combines GPU acceleration with a compact form factor for deployment near sensors and actuators.
- Google Edge TPU: An edge accelerator for running machine learning inference with low power consumption. It is designed for always-on or battery-sensitive deployments where efficiency matters.
- VITOS N 78: A low-power edge AI platform mentioned as an example of ultra-low-power hardware. It represents the class of devices used when power budgets are extremely tight.
- ONNX: An open model representation used to move trained neural networks between frameworks and runtimes. It helps with model portability across different deployment targets.
Final thoughts
Practical and forward-looking, this session gives a useful mental model for why edge AI is not just a smaller version of cloud AI. The value is in how it connects architecture, deployment constraints, and hardware choices into one picture, so embedded teams can reason more clearly about feasibility and tradeoffs. It will help engineers, architects, and product leads who need to place intelligence close to the device without losing sight of real-world constraints. The talk keeps returning to a simple idea: where intelligence lives changes what it can safely and usefully do.
This overview is AI-generated from the session transcript. Spot an issue? Let us know.








BLERP: Bandwidth, Latency, Energy, Reliability, Privacy