Voice to Prompt in 3 Hours
Built it for $0. Saves me 50 minutes a day.
I was sick of typing long AI prompts every day. So I stopped typing. Here's exactly how I built a hands-free voice system using free APIs โ and how you can too.
Start Reading โ๐ก If you're still typing every prompt by hand, you're wasting at least 40 minutes a day. This guide fixes that.
WisprFlow went paid. I'd been using it to dictate AI prompts hands-free โ and suddenly it was $12/month. I was about to close the tab and pay it, when I thought: how hard can this actually be?
I was spending 30โ40 minutes a day just typing prompts. Not thinking. Not building. Just typing the same kinds of instructions into Claude, over and over. That's 4+ hours a week on pure friction.
The Decision Moment
What I Actually Needed to Build
- Press a keyboard shortcut โ start recording
- Release โ send audio to Whisper for transcription
- Clean up the raw transcript (remove filler words)
- Convert spoken intent into a structured AI prompt
- Paste it into whatever app I'm in
The Full Stack
| Tool | What It Does | Cost |
|---|---|---|
| Groq (Whisper V3 Turbo) | Speech-to-text transcription | Free tier |
| GPT OSS 120B (via OpenRouter) | Cleans raw transcript, removes filler words | Free tier |
| Llama 4 Scout | Converts cleaned text โ structured prompt | Free tier |
| Python + xdotool | Keyboard shortcut + clipboard paste on Linux | $0 |
The total infrastructure cost: $0. All three AI models have free tiers generous enough to handle 50โ100 prompts a day without hitting limits.
The One Honest Caveat
Set Up Groq API
Created account, grabbed API key, tested Whisper V3 Turbo with a 10-second clip. Worked first try. Shocked.
Built the Recording Script
Python script using sounddevice. Keyboard shortcut with xdotool. Hold to record, release to transcribe. Raw transcript back in ~1.2 seconds.
Added the Cleaning Layer
Raw Whisper output was full of 'um', half sentences, repeated words. Piped it through GPT OSS 120B: 'Clean this transcript, preserve meaning, remove filler.' Night and day difference.
Added the Prompt Layer
The real unlock. Llama 4 Scout converts spoken intent into a structured AI prompt. 'Make this shorter' becomes a proper Claude instruction. This is where it got powerful.
Clipboard + Auto-paste
xdotool pastes the result wherever my cursor is. Works in Claude, Cursor, browser, anywhere. Done.
The Part Where It Almost Broke
The keyboard shortcut detection was flaky on Wayland. Spent 45 minutes debugging. Switched from pynput to xdotool + a background listener. Fixed. That's the messiest 45 minutes of the build โ everything else was clean.
Raw โ Clean โ Prompt
Clean: "Make this shorter and more punchy"
Prompt: "Rewrite the following text to be more concise and punchy. Remove filler words and unnecessary phrases. Preserve the core meaning."
The Real Number
Prerequisites
- Python 3.10+ installed
- Free Groq account (groq.com) โ takes 2 minutes
- Free OpenRouter account for GPT OSS 120B and Llama 4 Scout
- Linux (xdotool) or macOS (use pbpaste + Automator instead)
Install deps: pip install groq sounddevice numpy openai
Set env vars: GROQ_API_KEY and OPENROUTER_API_KEY in your .bashrc
Create the Python script (link below) โ 80 lines total
Add a keyboard shortcut: Settings โ Keyboard โ Custom Shortcut โ run the script
Test: hold shortcut, say something, release โ see the structured prompt paste
The Script
Build It This Weekend
3 hours. $0. 50 minutes saved daily.
Set up Groq account and test Whisper V3 Turbo with a sample clip.
Build the recording script + keyboard shortcut. Test basic record โ transcribe flow.
Add cleaning layer (GPT OSS 120B) + prompt conversion layer (Llama 4 Scout). Test end-to-end.
I build things like this every week.
Follow on Threads for real builds, free tools, and the messy parts nobody shows.
Follow @utkarsh.gen โ