Gongje (講嘢) started as an internal experiment. A few of us on the team speak Cantonese, and we were frustrated that existing speech-to-text tools either did not support Cantonese well or required sending audio to cloud servers. We wanted something that worked offline, ran natively on macOS, and actually understood Cantonese — including the way people naturally mix it with English and Mandarin.
What started as a side project became our first open source release. Here is what we built and what we learned along the way.
How It Works
Gongje is a macOS menu bar app written in Swift. You press a global hotkey (Option-Space by default), speak, and press it again. The transcribed text gets injected directly into whatever app you are using — your editor, a chat window, a browser field. Everything runs on-device. No audio ever leaves your machine.
Under the hood, the pipeline has two stages. First, WhisperKit runs a CoreML-optimized Whisper model to transcribe your speech in real time using Apple Silicon's Neural Engine. We support both OpenAI's standard Whisper models (which produce written Chinese) and community-trained Cantonese models (which produce spoken Cantonese characters). Second, an optional on-device LLM powered by MLX Swift post-processes the transcription to fix homophone errors and add punctuation.
The Homophone Problem
Cantonese has a lot of homophones (同音字) — different characters that share the same pronunciation. Whisper frequently picks the wrong one. A cloud-based solution could lean on a massive language model to resolve these, but we wanted to stay fully offline. Our solution is an on-device LLM correction layer using MLX Swift. It takes the raw Whisper output, fixes homophone errors, adds missing punctuation, and preserves Cantonese rather than converting to Mandarin.
To keep this responsive, we debounce LLM requests by 300ms so we are not flooding the model with every incremental transcription update. We also implemented a drift guard using Levenshtein distance — if the LLM's output deviates too far from the original transcription, we reject it. The LLM should correct, not rewrite.
Design Decisions Worth Sharing
Text injection via simulated paste. We considered using macOS Accessibility APIs to insert text directly into the focused app, but AX-based text insertion is unreliable across apps and does not handle CJK input methods well. Instead, we save the clipboard contents, write the transcription to the pasteboard, simulate Cmd-V via CGEvent, and restore the original clipboard after a short delay. It works universally with any app that supports paste.
vDSP noise reduction instead of a neural model. The app already runs Whisper and optionally an LLM simultaneously, leaving limited headroom on 8 GB machines. Instead of adding a neural noise suppression model, we used Apple's Accelerate framework — a high-pass biquad filter to remove low-frequency rumble, and spectral gating that learns a noise profile from the first half-second of audio. These run on the AMX coprocessor on Apple Silicon at effectively zero CPU/GPU cost.
A setup wizard for first launch. Without onboarding, new users would see a menu bar icon appear, get hit with permission prompts, and watch a large model download start without context. We built a 6-step wizard that walks through microphone permission, accessibility permission, model selection and download, optional LLM setup, and hotkey configuration. It only shows once, but it makes the first experience significantly better.
Why Open Source
Cantonese is an underserved language in the AI space. The community of developers working on Cantonese NLP is small. By open sourcing Gongje, we hope to attract contributors who can improve the tool in ways we cannot — particularly around dialect support and model fine-tuning.
There is a practical benefit too. Open sourcing forces you to write better code and documentation. When you know others will read your source, you invest more in clean architecture and sensible defaults. For potential clients, the repository is the most honest portfolio piece we have — they can see exactly how we write code and think about architecture.
If you are interested in Cantonese speech-to-text, on-device AI, or just want a transcription tool that respects your privacy, check out Gongje on GitHub.