From the course: OpenAI API and MCP Development
Audio API: Transcribe audio
From the course: OpenAI API and MCP Development
Audio API: Transcribe audio
Do you remember once listening to a great podcast or a meeting that you wished that you would have kept notes? Since November 2022, OpenAI provides access to a wide set of language models with different capabilities and usage. Including the Speech-to-Text API that can do exactly that, convert from speech to written text. And you have two options. The Speech-to-Text API provides two endpoints, transcriptions and translations. And this is the type of use cases that you can build with the audio API. You can build voice agents, meaning chat applications with vocal features, like a human, with very human-like features. You can also transcribe and translate audio. And this is part of the specialized models that you can find all the way to the bottom right here. So the speech to text, all right, that provides two speech to text endpoints. And for the next example, we're going to split the demonstration into two parts. First, we want to look at one example to transcribe from speech to written text. And next, we're going to see how to translate from a speech audio file into text. No matter what language it is coming from for the original transcript, it's going to be translated into English. And you have a list of the supported languages here. So let's go back here to find out how to proceed with transcription first. So we're going to copy from line 6 to 11. And we're going to go find our next example for the next corresponding exercise files and starter projects. And for this example, you have access to helper functions that you're going to find in the utils file. So first you have one functions to save files. So for some temporary files, we're going to need to save it in the files directory. And right below, you're going to find this function that's going to allow to convert from speech to text. And this is where you're going to find the endpoint, line 42, to transcribe from an audio file to written text. So that's going to take one audio file as an input. And we use the WhisperOne language model. And once the transcription is complete and successful, we're going to be able to return it and then display it in our app. So let's go actually look at how the application is set up. So we have some starter boilerplate code. And for this example, we're using Streamlets. And that's going to allow to have a nice user interface for our application. So we're going to be able to use a file uploader to upload an audio file and then run the transcription operation, the transcription process, and then display the written text on the application, on the user interface. So we have a file uploader and also a Submit button. So this is where it's going to be processed. And this is where we want to then run our functions to convert from speech to text. So we're going to have one temporary file path. So once you upload the file, so I'm going to show you that next. All right, so let's actually run it. And remember that you always have access to a readme file for every starter project. And for this example, you see that you have a presentation for every library that we use, including Streamlit, and below, all the way to the bottom, you see how to run this app using the Streamlit command. So let's do that. Let's go back to main, open the terminal, and then I'm gonna run this app with Streamlit run main.py. And we're gonna see how it looks like. So here you have a drag and drop feature. You can browse files from here. And you have a few files available as source, as source for media. And we're going to use this one, which is an English testimonial. Let's open it, and then we're going to submit it. What we want is to proceed, process the file to transcribe from speech to text. And for now, we're just going to read that this is successfully processed. And we actually want to, here, display the exact same text, but on the user interface. So for that, we're just going to use different features and components from the library streamlets and I'm going to use you can do St. markdown first we can here specify that this is a transcription like this and next we're going to use St. markdown like this and then allow to display the original text and I'm going to add it's not going to be text area instead I'd like to actually allow to display this in blue so that's going to be a paragraph and the nice thing to be able to add some HTML is that we're going to be able to use some inline style like this so let's use this and try again I'm going to refresh all right so let's this time try with this file testimonial which is in German. We're going to submit. All right, so now you can see that this is actually in German. Okay, so let me go back. And what I'd like to be able to do also is to actually listen to the original audio file. So for that, we're going to have access to not this. So I'm just going to have access to the temporary file path, which is the one that we use when you upload the file. So once you upload a file using your app, it's going to create a temporary file path. So we're going to use this one to then listen to the audio, the original audio. OK, so let's try that again. Run the app and see what we have now. All right, so let's try another one. So we also have this audio file, which is in French, which is the original language here. All right, so let's listen. Our experience with Connecticut was only positive. So now I can testify that this is matching the original audio file, so sample, with the text written here. So this is working just fine, so I know that this is correctly transcribed. What would be nice for any English user and reader is to be able to read this original text in French in English. So that's why we're gonna use another endpoint to translate. This time, the audio file, by using the translation endpoint, again with the audio API.