May 23, 2026

Talk, Don't Type

I've started talking to my computer more than typing. So have a lot of others. Now that voice input is actually useful, the next question is where it belongs.

Calendar event creation is a good test case. It is tedious to enter by hand: choose a calendar, pick a date, spin through time wheels, add a title, location, and guests.

What if you could explain the event in plain language and let the app create the draft for you?

Start Simple

As always, I start simple and get something working, then build up from there.

For this example, let's build an iOS app. From there we can capture the audio, transcribe it, send it up to a server, and ask the model to hand us back something that fits an event schema.

iOS is a great fit here because Apple provides a speech framework that handles the transcription. I’ll send that transcription to OpenAI, have it return a structured JSON response, and then parse it with Codable.

The first transcript can be simple:

I have lunch with Kristen next Tuesday at North Star Cafe.

Make It Faster

Once that worked, the next question was speed. If single-event extraction is this reliable, maybe the big model is overkill and expensive.

So I swapped it for a smaller one. Same transcripts through both, compare the output, look for the tradeoff.

Turns out the smaller model gave back nearly the same fields with noticeably less latency. The product got faster and the UI never changed. In a production setting, we'd want to verify the smaller model with evals before trusting it.

Multiple Events

A single event is already much faster by voice.

But what if we want to take this closer to 50x or 100x? What if the user wants to enter multiple events at once?

That is the next step in the feature: can we handle multiple events through a single transcript?

Using the same framework from above, we could send a transcript like this:

I'm working at the ice cream shop June 8th, June 12th, um, June 18th, and June 24th. These are all eight-hour shifts starting at 11 a.m. Oh, and this is at Mitchell's Ice Cream.

With the smaller model, that transcript may be too complicated. It has to separate several dates, infer repeated shift details, and return multiple event drafts, so we switch back to the larger model for more reasoning and higher intelligence.

Add Routing

Looking at this through a product lens, users would most likely create single events the majority of the time. Now we're stuck with the higher-latency model because we need to support multiple events.

The fix here is to create a model router.

The app still sends one transcript, but the server doesn't dive straight into the full extraction. First it makes a quick routing call with a small model.

That call has one job: is this a single event or a multi-event request? Single goes to the fast model. Multi goes to the bigger model.

Show The Thinking

Continuing to make the experience better, it is possible our multiple-events call runs for a few seconds. So what do we do in the meantime? We do not want to just show the user a spinner.

Using LLMs once more, we can make a parallel call that extracts quick bites of information.

For this, we can use a lightweight model, potentially GPT-5.4 nano. Its only goal is to return little chips of information quickly, so we can show the user the types of details we are going to extract.

The goal is to improve every part of the experience. We should never leave users staring at a spinner without context.

Voice vs. Typing

What started as one model call is now a small product system. The app transcribes on device, routes on the server, chooses the right model, extracts the event structure, and keeps the user informed while slower requests finish.

Now compare the two paths:

Manual entry means opening the calendar, typing a title, setting a date, scrolling the times, adding a location, inviting people, and repeating it for every event. That takes most people 30 seconds to a minute, and with multiple events it can easily become several minutes.
Voice entry means saying it once and confirming the result. That can happen in under 10 seconds.

So voice isn't just nicer here. It's several times faster, and it scales to multiple events without asking you to do any more work.

None of this means every feature should work this way. It works here because transcription is accurate, LLMs can turn loose speech into structure, and a little orchestration keeps the experience fast.

For the right tasks, talking really is better than typing. You just have to let the UX and the orchestration grow up together.