A simple browser extension that can transcribe and translate audio on any web page. The extension works by capturing audio on the web page and streaming it to Deepgram's Speech-To-Text API for transcribing, then the resulting transcript is sent to Azure's translation service for a translation into a target language.
Here's a very high-level overview of the architecture.
Here is a demo showing the extension transcribing and translating audio content from various sites. I was impressed by how fast both Deepgram and Azure were able to transcribe and translate the audio. I think real-time streaming to two separate services is not an ideal solution, but as a prototype, it performed pretty well.
I believe some target use cases for an extension like this would be:
for older content, where no one is going to update with new transcriptions or translations
content from smaller organizations/individuals that lack the resources/ability to provide multiple translations
when you need translation into less common language
My goal for this extension is to make content more accessible.
Submission Category:
I'm submitting this under the "Accessibility Advocates" category—based on it's simple function of making content more accessible.
Simple browser extension that can transcribe and translate any web page with audio content.
Transcribe and Translate
A simple browser extension that can transcribe and translate audio from any web page. The extension works by capturing audio on the web page, then streams it to Deepgram's Speech-To-Text API for transcribing, and then to Azure's translation service for a final translation into a target language. It is my entry to the Deepgram + DEV hackathon.
Development Setup
The project contains two parts, the main extension source and a API service proxy responsible for requesting short-lived access tokens for the extension to use.
Here is an example on how to stream audio data to Deepgram using their REST endpoint.
constsocket=newWebSocket('wss://api.deepgram.com/v1/listen',['token',token.key]);mediaRecorder.ondataavailable=function(evt){if(socket&&socket.readyState===socket.OPEN){socket.send(evt.data);//}}socket.onmessage=function(results){// parse the results from Deepgram}
Azure Translator
I used Azure's Translator service for translating the resulting transcripts from Deepgram STT to a target language. Again, similiar to Deepgram, you'll need to:
This was my first time developing a browser extension, so I'll do my best to explain how it works. I used the following components:
background scripts are where you put long-running or complex logic, or code that may need access to the underlying browser
content scripts are sandboxed scripts that run in the context of a web page (e.g, a tab), and they are mostly used for display and collecting data from the web page.
browser action is an icon representing your extension, and a common way for users to interact with the extension.
// Sending data from web page to an extension's content scriptwindow.postMessage({type:'some-action',data:...});// Listening to a web page from a content scriptwindow.addEventListener('message',(evt)=>{...});
In summary, this was just a brief overview on some of the browser extension features/components used in our extension, later we'll see they are used in our implementation.
Functional Components
So far we've seen how to set up and use both Deepgram's Speech-to-Text and Azure's Translator services. We also touch briefly on some common browser extension components and their functions. Next we'll define functionally how our extension should work.
user clicks on the browser action, which activates/opens our extension if it's not running or closes it if it's already running
detect when we've opened/switch to a tab, stop our extension if it was open for a previous tab
create the translation results container and display on the web page
start capturing audio for the current web page
create the MediaRecorder to listen to the audio
create the WebSocket to send the audio to Deepgram
send streaming audio to Deepgram whenever the MediaRecorder captures any audio
send transcript to Azure Translator whenever we get a transcript
send the translation results to content script for display whenever we get a translation
This list is not exhaustive, but it describes the key functionality of our extension. Let's take a look at each item to see how it could be implemented.
Activating the Extension
To activate the extension from the browser action, we add a listener—when triggered, it will either open or close our extension.
chrome.browserAction.onClicked.addListener(function(){if(someStateAboutCurrentTab.isOpen){// ... handle closing the extension}else{// ... handle opening the extension}});
Handle Opening or Switching to a Tab
Browsers can have multiple tabs, but our extension will only work on one tab at a time (i.e, the currently active tab). If there was a previous tab, we should always force closing the extension—regardless if it was open for the previous tab—just to make sure.
To close the extension, we should clean up resources (e.g, WebSocket, MediaRecorder) and notify the content script to remove the translation results container.
The purpose of the content scripts is to display our translations to the user, and notifying the background script that the users prefers a new target translation language.
Here's an example of how we could display/hide translation results.
// Create the div that will display the translations, and add to the bodyfunctioncreateTranslationDisplay(){consttemplate=document.createElement('template');template.innerHTML='<div id="translations"></div>';document.getElementsByTagName('body')[0].append(template.content.firstChild);}functionhideTranslationDisplay(){document.getElementById('translations').style.display='none';}functionshowTranslationDisplay(translation){constdiv=document.getElementById('translations');div.style.display='block';div.innerText=translation;}
Here's how a content script can handle receiving new translations, or events (e.g, close, open) from the background script.
chrome.tabCapture.capture({audio:true,video:false},function(stream){// Once a stream has been establish, we need to make// sure it continues on it's original destination (e.g, your speaker device), but we can now also record itAudioContextaudioContext=newAudioContext();audioContext.createMediaStreamSource(stream).connect(audioContext.destination);});
Note: our transcriber and translator extension was developed against Chrome specific extension API so it will only work for Chrome based browsers (e.g, Chrome, Edge).
Recording Audio Stream for Transcribing
We've just seen how to capture the audio from any web page, now we need to have our extension use it to start the transcription process.
// Receives an audio stream from the tabCapturefunctionstartAudioTranscription(stream){mediaRecorder=newMediaRecorder(stream,{mimeType:'audio/webm'});// Start a connection to Deepgram for streaming audiosocket=newWebSocket(deepgramEndpoint,['token',key]);socket.onmessage=handleTranscriptionResults;mediaRecorder.ondataavailable=function(evt){if(socket&&socket.readyState===socket.OPEN){socket.send(evt.data);// audio blob}}}
Transcriptions to Translation
Next, on each successful transcription from Deepgram's Speech-to-Text service, we take the transcript and send it off to Azure's Translator service.
Phew, we made it :-). Thanks for checking out my submission, I hope you liked it and more importantly learn a little about Deepgram and how browsers extensions work.