The other day I was sent an audio recording of a meeting I had been part of, but no textual transcript was available.
With a relatively small amount of hacking it is now very easy to use the OpenAI Whisper model via the transcriptions endpoint to generate high quality text from such a file.
This example uses Python, it’s a great language for these sort of ad-hoc tools.
Process
The process breaks down into three logical parts:
- install all dependencies
- prepare the audio file (in this case, splitting it to ensure we kept within the 25MB filesize limits of whisper)
- run the prepared files via OpenAI and reassemble a single text file
Environment and Dependencies
Because the source was in m4a format, it was necessary to install FFMPEG on my machine.
The suggested methods are:
- Windows
winget install ffmpeg
- MacOS
brew install ffmpeg
- Linux
sudo apt-get install ffmpeg
I then needed a Python environment with the key dependencies added.
I mainly use conda to manage environments, so I set up a new environment like this:
conda create -n myenv python=3.12
conda install -n myenv openai
conda install -n myenv -c conda-forge pydub
conda activate myenv
last preparation step was to create an API key in my OpenAI account, and add that into the environment variables for my new conda environment:
conda env config vars set OPENAI_API_KEY=VALUE_OF_MY_KEY
Splitting the audio
The meeting was over an hour long, so to split the audio into smaller chunks I used this script, making use of the pydub library:
"""
This module splits an audio file into multiple segments based on maximum segment length
The default maximum segment length is 15 minutes.
Output format is m4a
Usage: `python split.py <path_to_audio_file>`
Your environment must meet the following requirements:
- pydub Python package installed
- FFMPEG installed on your platform in the path
- Windows `winget install ffmpeg`
- MacOS `brew install ffmpeg`
- Linux `sudo apt-get install ffmpeg`
"""
import sys
import os
from pydub import AudioSegment
import math
def split_audio(file_path, segment_length=15*60*1000): # 15 minutes in milliseconds
# Load the audio file
audio = AudioSegment.from_file(file_path)
# Get the total length of the audio file
total_length = len(audio)
# Calculate the number of segments needed
num_segments = math.ceil(total_length / segment_length)
# Loop through and create each segment
for i in range(num_segments):
start_time = i * segment_length
end_time = min((i + 1) * segment_length, total_length) # Ensure the last segment does not exceed total length
segment = audio[start_time:end_time]
# Generate the output file name
output_file = f"{file_path[:-4]}_part{i+1}.m4a"
# Export the segment as an m4a file
segment.export(output_file, format="ipod") # see https://github.com/jiaaro/pydub/issues/755
print(f"Exported: {output_file}")
source_path = os.path.abspath(sys.argv[1])
split_audio(source_path)
Speech to text
The second script takes a wildcard list of audio files and runs them through the OpenAI transcribe endpoint using the official Python library for the openai API.
The results are saved to text files, then all the text files concatenated into one output.
"""
This module reads audio files and transcribes them using OpenAI's API.
Usage: python transcribe <input_wildcard>
Where input_wildcard is a wildcard pattern that matches the audio files to be transcribed.
For example if the audio files are named "test/audio1.m4a", "test/audio2.m4a", etc.,
you can use the wildcard "test/audio*.m4a" to transcribe all the files.
Each transcription is saved to a text file with the same name as the audio file, but
with a .txt extension.
All the text files are then combined into a single output file named "output_file.txt".
Your environment must meet the following requirements:
- OpenAI Python package installed
- set OPENAI_API_KEY environment variable to your OpenAI API key
"""
import os
import sys
import glob
import shutil
from openai import OpenAI
client = OpenAI()
input_wildcard = sys.argv[1]
input_files = glob.glob(input_wildcard)
print(f"Transcribing from{input_files}")
output_files = []
for x in input_files:
audio_file= open(x, "rb")
print("Transcribing file " + x)
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text",
language="en"
)
output_file = f"{x[:-4]}.txt"
output_files.append(output_file)
with open(output_file, "a", encoding="utf-8") as f:
print(transcription, file=f)
print(f"Transcription written to: {output_file}")
print(f"Outputs are {output_files}")
concat_file = os.path.dirname(output_files[0]) + "/output_file.txt"
with open(concat_file,'wb') as wfd:
for f in output_files:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
Results
Because we are generating the text from a simple audio file, the transcription has no way of associating speech with different speakers - if you can get an AI transcription generated by the online meeting platform from the meeting organisers then that is always going to be better.
On the first trial, one of the output files started with several paragraphs of English speech transcribed as Welsh. On investigation, the relevant audio file started with a long contribution from two people who were sharing a laptop mic in a “boomy” office - my guess is that having this poor audio at the start of the clip confused the algorithm.
Changing the split boundaries to avoid this seemed to cure the problem.