Most speech recognition apps have no trouble transcribing a native speaker being recorded with a pro microphone in a quiet room. This isn’t a challenge.
So to test them more thoroughly, I created a nightmare recording of two non-native speakers with loud city background noise.
How did they fare?
Let’s find out.
Otter was one of the most frequently mentioned solutions when we asked for suggestions on Twitter and in the Ahrefs community. And for good reason. It is easy to set up, has an intuitive interface, and offers clear pricing.
What stands out from the rest is the app’s ability to record online meetings and transcribe them—simply by pasting the meeting URL. But you can also import a video/audio file or record audio right in the app.
Besides, you can connect your calendar to never miss a meeting.
I got decent results, but there was a lot to edit too.
It didn’t get some names right. But I can’t blame any tool for not picking up “Ahrefs” or “Tim Soulo” 100% of the time.
One thing I found is that after it notified the transcriptions were ready, it might still do something in the background (adjust time stamps, tag speakers, etc.). Like a student still scribbling on a test paper while passing it to the teacher.
You can start for free and upgrade to a paid plan later. You can import up to three files and record 290 minutes of meetings before you need to upgrade (as of April 2023).
Setting up an account was a no-brainer. I found the interface easy to navigate as well. One personal remark is that it felt a little too “cold” to use since I saw things like “Place Order,” “Billing,” and “Invoice” way too often.
You might get an impression that it was designed by an accounting team (as opposed to Descript that comes next in this roundup).
Besides auto-generated transcripts, Rev offers live captions for Zoom meetings. You also have the option to place an order for human transcriptions.
Poor audio with city noise was a bit too much for Rev. Some words were missing, while others were misrecognized. As a result, some paragraphs didn’t make much sense, while others were fine.
You can transcribe the first audio file (up to 45 minutes) for free. I got a bill for $1.25 with a discount that resulted in a total of $0.00. Thanks, accounting team. 😉
Rev also has a 14-day trial of its paid plan. But that was tricky to find. To locate it, you need to go to the footer of the homepage and look for it under “Services.”
Descript welcomed me by name (which was a nice coincidence). The main thing you have to know is that it is a standalone software rather than a web service. It is much more than a speech-to-text converter. It’s basically a video editing tool. And there’s definitely a learning curve. But thankfully, onboarding is extremely funny and engaging.
As I mentioned, Descript is more of a video editing tool that is good with transcribing. I’d call it “Canva for video/captions.” You can add B-rolls, effects, animations, and more.
You can easily drag and drop and basically produce a complete video with its help. But if you just need a transcript or captions of a video or audio, you can do that too.
My sample audio had quite muddy results. At times, it had difficulty recognizing abbreviations (e.g., SEO). I also had a problem with removing filler words like uh and um.
I found that if I didn’t choose an option to remove them, they, um, just stayed there even though I didn’t need them most of the time. But if I did choose to remove them, it occasionally ate up parts of other words, causing even more trouble.
Also, it couldn’t recognize parts that a human being would have no problem understanding just from context, e.g., “Jack of all trades” became ‘“jackal, trades.”
On the bright side, I believe you can still understand what the text is about.
You can start with basic functions for free and upgrade if needed.
MacWhisper is a transcription tool powered by Whisper. It’s an automatic speech recognition (ASR) system developed by OpenAI, the same company that brought us ChatGPT.
As OpenAI states on its website:
Whisper is trained on 680,000 hours of multilingual and multitask supervised data collected from the web.
Whisper is not something you can simply “run” as is. What’s more, it is pretty complicated to set up if you do want to run it yourself. Github, Python—you get the gist.
Luckily, there are tools like MacWhisper that take this off your shoulders and let you use the power of AI in a simple user interface.
Just plain speech-to-text recognition with time stamps. Unfortunately, it doesn’t auto-tag the speakers.
When you run the tool, you have to choose a “model” to work with. Basically, the lighter the model, the quicker it will run. But larger models will produce better results. Also, in MacWhisper, those larger (better but slower) models are only available in the paid version.
I decided to start with the free “small” model, which was stated to have “normal speed with good accuracy.”
It was OK, but no better than the competitors. I assumed it would work fine with high-quality audio, but not with the horrible examples I fed to it.
“AI is overrated,” I thought. But before closing the Mac and switching back to my dear Windows PC, I decided to give the “large” model a try.
And you know what, AI is not overrated. I found the results to be much better than anything else.
The transcript was really, really good. It even got things like “Ahrefs” and “SaaS” right! Though still not 100% of the time.
You can run smaller models for free. For a large model, you’ll need to purchase a license.
This tool is the easiest to use. Simply drag and drop your file—then it’s ready. It takes some time to process, though.
Nothing besides downloading a transcription.
My first impression was that the results were perfect because, visually, it delivered a confident-looking text:
But after proofreading, I realized that it simply did not include the parts it failed to recognize—sometimes several words in a row.
It’s free to use.
Premiere Pro is not exactly a “transcription tool” but rather a video editing software. I’m including it because I assume that some companies may already have it in their arsenal (like we do).
To get to the transcription feature in Premiere Pro, just go to the “Captions and graphics” workspace and click “Create transcription.”
If we take only speech recognition into account here, what it does well is creating precise time stamps, auto-tagging the speakers and, if needed, automatically adding an editable captions track to a video project.
Let’s be straightforward: I found the noisy audio transcript to be a failure. I couldn’t comprehend what people were talking about in the first place.
Still, I think this feature can be really helpful if you are creating captions from high-quality audio. I used it myself several times and had nothing to complain about when the recording quality was good.
You need an Adobe Creative Cloud subscription to use Premiere Pro.
While signing up and uploading files is rather straightforward, you have to spend some time answering questions about you and your company before you can finally get to the tool itself. And no, you can’t skip typing in your company name, your role, and your company size.
But once you get through this, the interface is clean and intuitive.
You can generate a transcript or captions for video or audio. There is also an option to request a manual review of the transcript. Alternatively, you can generate subtitles in a different language, so you have transcription and translation in one click.
Happy Scribe did a really good job transcribing the audio. It had no problem with words like “SEO” and “SaaS” (obviously the weakest point for many tools). It could also auto-tag the speakers, which might be helpful in certain situations.
I could test one file for free. After that, I would need to buy credits to be used for each minute of video or audio transcribed.
Sonix is a tool for automatic transcriptions, translations, and integration with meeting apps.
Besides meetings integration, which is almost a given for most tools, AI summary generation is an interesting feature (in beta as of April 2023.) But I already got impressive results from it.
You also get some extra tools to work with video captions—a timeline view and an option to split captions into several lines. You can also import an existing transcript, and Sonix will sync it with the audio.
Sonix has a custom vocabulary feature. I found that helped a bit with names like “Tim Soulo” and “Ahrefs,” but it didn’t work 100% of the time. It mostly did well. But at times, it mistook SEO for CEO and returned the word “Excel” seemingly out of nowhere.
The transcript made sense in general but required quite a lot of edits if it needed to be perfect.
Sonix has a free trial for 25 minutes of transcriptions. After that, you need to purchase pay-as-you-go credits or get a subscription.
Notta is yet another transcription service that works for both real-time meetings and existing recordings.
Besides transcription, Notta focuses on streamlining certain workflows and offers features such as calendar sync and scheduler (in beta as of April 2023).
Background noise and poor audio quality were not deal breakers for Notta. The transcription results turned out mostly OK but still had some problems.
Sentence structure was sometimes a bit weird, certain words went missing, and my favorite “Jack of all trades” part wasn’t that neat this time.
Another thing worth noting is that, for some reason, it failed to recognize two speakers, and the whole interview was tagged as “Speaker 1.”
You can start with a free basic subscription and try a three-day trial of the paid plan, Notta Pro.
As you can see, there are plenty of tools to choose from. Still, it seems that OpenAI stirred things up a bit by releasing a free ASR (automatic speech recognition) system, which I found to be considerably more capable than others.
But pure speech recognition quality is just one factor. Maybe you do need to record your Zoom meetings (Otter), work with captions in a large video project (Premiere Pro), or quickly create a Canva-style video (Descript).
Also, I need to stress that I was trying to push these tools to the edge by giving them the worst-case scenario recording. For more natural uses, the differences in the outcome might be much less noticeable.
It’s great to see that there are so many options out there, and I hope this review will help a bit in finding the one that is perfect for you.
Got questions? Ping me on Twitter.