Google Makes Speech-to-Text API More ‘Business Friendly’

Google Speech-to-Text API update

Google has rolled out several major updates to its Cloud Speech-to-Text speech recognition technology.

The overhaul is the biggest since Google announced the service two years ago and is designed to make Speech-to-Text more useful for businesses.

Among the updates are pre-built models for transcribing phone calls and video, features that support automatic punctuation, and a new tagging and grouping mechanism for transcription workloads. In keeping with its business focus, the updates also come with a standard service level agreement (SLA) guaranteeing 99.9 percent availability.


Further reading

  • Google Offers GPU Option for Cloud HPC Workloads
  • NetSuite SuiteCommerce Fast Tracks Ecommerce

“Access to quality speech transcription technology opens up a world of possibilities for companies that want to connect with and learn from their users,” wrote Google product manager Dan Aharon in a blog April 9. The update takes advantage of Google’s latest research around machine learning technology, he said.

Google announced Cloud-Speech-to-Text in June 2016. The technology gives developers a way to convert audio to text. Google has describedSpeech-to-Text as an API that applies neural network models to the task of converting speech to text. The technology is designed to process both pre-recorded audio and real-time streaming audio so it can work in a call-center setting just as well as it might in transcribing voice mail messages.

The API can be used to transcribe short and long-form audio in 120 languages and dialects in near real-time. It is tailored to recognize and transcribe speech in real world conditions involving multiple speakers and background noise. According to Google Speech-to-Text can even transcribe proper nouns and appropriately format content such as dates and phone numbers.

Since cloud Speech-to-Text is powered by Google’s machine learning technology, the accuracy of its transcription improves over time, the company has claimed.

Aharon listed several enterprise use cases for the technology including human-computer interactions, call-center analytics and automated transcriptions of phone calls, audio and video content.

As an example of the newly updated API’s capabilities, Aharon pointed to a TV broadcast involving four speakers and lots of background noise. Depending on the length of the game, Speech-to-Text would be able to transcribe the contents of the broadcast in about two hours, he claimed.

The multiple pre-built models that Google has made available with the latest update include those tailored for specific uses cases such as video to audio transcriptions and phone call transcriptions.

The updates reflect feedback from organizations that have been testing cloud Speech-to-Text since is launch in 2016, Aharon said.  Information provided by customers of the technology has allowed Google to prioritize features and focus on what to do next, he said.

Pricing for the API starts at $0.006 per 15 seconds of audio. Video models start at $0.012 per 15 seconds though it is available at a discount through May 31.

The updates to the Speech-to-Text API are the second major announcement from Google’s Cloud AI speech products group in recent days. Last month, Google’s introduced Cloud Text-to-Speech, a speech synthesis API that converts text to speech.