On-Device AI: Why We Bet on Local Inference

For the past few years, the default answer to "how do we add AI to our product?" has been to call a cloud API. Send data up, get predictions back. It works, it scales, and for many use cases it remains the right choice. But as we built more AI-powered features for our clients and our own products, we kept running into the same friction points: latency that broke real-time experiences, privacy concerns that blocked adoption, and connectivity requirements that ruled out entire categories of users.

That friction pushed us toward on-device inference — running ML models directly on the user's hardware. The trade-offs are real, but for certain classes of problems, local inference isn't just viable — it's superior.

When On-Device Wins

Not every AI feature belongs on-device. Large language models with hundreds of billions of parameters still need server-side compute. But a surprising number of practical AI tasks run well locally: speech recognition, text correction, image classification, anomaly detection, and real-time sensor processing. These models are small enough to fit in memory and fast enough to run at interactive speeds.

The advantages compound quickly. Latency drops from hundreds of milliseconds to single-digit milliseconds. Data never leaves the device, which sidesteps entire categories of compliance concerns — GDPR, HIPAA, data residency laws. The feature works without an internet connection. And there are no per-request API costs, which matters more than people think at scale.

The Framework Landscape

Two years ago, on-device inference meant wrestling with CoreML conversions, quantization pipelines, and platform-specific optimizations. Today the ecosystem is maturing fast. Apple's MLX framework brings a NumPy-like interface for running models on Apple Silicon with unified memory. WhisperKit wraps OpenAI's Whisper model into a native Swift package optimized for Apple's Neural Engine. TensorFlow Lite and ONNX Runtime continue to improve for cross-platform scenarios.

We have been using WhisperKit and MLX extensively. WhisperKit in particular has been impressive — it runs Whisper large-v3 on an M-series Mac with real-time factor well below 1x, meaning transcription is faster than the audio itself. Combined with MLX for post-processing and text correction, you can build sophisticated speech-to-text pipelines that run entirely on-device with quality rivaling cloud APIs.

The Quantization Trade-off

Making models small enough for on-device deployment usually involves quantization — reducing weight precision from 32-bit floats to 8-bit or even 4-bit integers. Modern quantization techniques preserve most of the model's accuracy — in our experience with Gongje, 4-bit quantized LLMs (like Qwen 3 1.7B at roughly 1.2 GB) perform well enough for text correction tasks on machines with as little as 8 GB of RAM. The key is using well-calibrated quantization methods rather than naive rounding, which will hurt quality noticeably.

Privacy as a Feature

Privacy is often framed as a constraint — something you have to comply with. But when you run AI on-device, privacy becomes a feature you can market. Users are increasingly aware of where their data goes. A speech recognition feature that explicitly never sends audio to the cloud is a compelling differentiator. We have seen this firsthand with our own products: users choose the on-device option even when a cloud version with slightly better accuracy is available.

When to Stay in the Cloud

On-device inference is not a universal solution. If your model needs frequent updates based on collective user data, cloud deployment is simpler. If the model is too large for consumer hardware (most LLMs above 7B parameters), cloud is the only option. If you need to aggregate results across users in real-time, the data has to go somewhere central. And if your users are on older or low-powered devices, you cannot guarantee the inference experience.

The best architecture often combines both. Use on-device inference for latency-sensitive, privacy-critical features, and cloud inference for heavy-lift tasks that benefit from larger models. The key is making this decision deliberately for each feature rather than defaulting to cloud because that is what everyone does.

Looking Ahead

Hardware is getting better fast. Every generation of Apple Silicon, and increasingly Qualcomm's Snapdragon chips, ships with more powerful neural engines. Models are getting more efficient — distillation and architecture innovations are shrinking the compute needed for a given quality level. We expect that within two years, most of the AI features that currently require cloud APIs will be viable on-device for flagship hardware.

For us at Glasir, on-device AI is not just a technical interest — it is a strategic bet. We are building expertise in local inference, optimized model deployment, and privacy-first AI architecture because we believe this is where a significant portion of the industry is heading. Our work on Gongje has been the proving ground, and we are carrying those lessons into every client engagement where on-device makes sense.