Build a Video Transcription

In the modern days of content creation, video content is very often. From educational tutorials to entertainment everywhere videos serve as a powerful medium for conveying information. However, the accessibility and searchability of video content are often hindered by the lack of textual representation which can be addressed by the video transcription mechanism. Video transcription is a process of converting spoken words in a video into written text. In this article, we will implement a video transcription which will print the timestamp with the speaker’s corresponding speech.

How does video transcription work?

Video transcription involves the conversion of spoken language in a video (MP4) into written text which creates a textual representation of the audio content. This process is essential for various reasons like improving accessibility, aiding in content searchability, and facilitating language translation. The key steps component from the video file can be easily achieved by using audio extraction tools or libraries. Once extracted, the audio file serves as the input for the transcription process.

Silence Detection: To enhance the accuracy and readability of the transcription, the audio is often split into segments based on periods of silence which is required to isolate distinct speech segments to make it easier to transcribe and analyze the content effectively.
Speech Recognition: The transcribing process heavily relies on speech recognition technologies. Advanced algorithms and machine learning models are employed to convert the audio content into written text. Popular speech recognition APIs like Google’s Speech Speech Recognition API utilizes deep learning techniques to accurately transcribe spoken words.
Timestamping: Timestamps are added to the transcription to indicate when each segment of text was spoken in the video. This timestamping helps in synchronizing the written text with the corresponding moments in the video which allows accurate referencing.

Installing required modules

At first we need to install all required modules to our runtime.

!pip install SpeechRecognition
!pip install moviepy
!pip install pydub
!apt-get install ffmpeg

Importing required libraries

Now we import all required Python modules like speech recognition, MoviePy, Pydub etc.

Python3

import os
import re
import speech_recognition as sr
from moviepy.video.io.VideoFileClip import VideoFileClip
from pydub import AudioSegment
from pydub.silence import split_on_silence

Driver functions

This overall process with be controlled by only two driver functions which are discussed below–>

Silence Detection: In video transcription it is required to detect long silence portions where no speech in there which greatly helps the usability of the transcription. To do this we will simply define a small function(detect_silence) with a silence threshold value and minimum silence length. These values can be changed as per requirement to detect long or short silence whatever our objective. However keep in mind that, capturing very little silence(less than 3s) may lead to wrong transcription. Here we are detecting minimum of 5s silence if present.

Python3

def detect_silence(audio, silence_thresh=-80, min_silence_len=500):
    silence_segments = []
    non_silence_segments = split_on_silence(
        audio, silence_thresh=silence_thresh, min_silence_len=min_silence_len)
    start_time = 0
 
    for segment in non_silence_segments:
        end_time = start_time + segment.duration_seconds
        silence_segments.append((start_time, end_time))
        start_time = end_time
 
    return silence_segments

Video Transcription: After we have detected the silence segment then we can proceed to transcription. We will define a function(transcribe_video) which will calculate the timestamps from silence segments and utilizes the Google’s speech recognition API to extract the speech from the input. After that it will just prepare a text file to keep the timestamp, speaker and speaker’s speech.

Python3

def transcribe_video(video_path):
    # Extract audio from the video
    video_clip = VideoFileClip(video_path)
    audio_clip = video_clip.audio
    audio_path = "/content/audio.wav"
    audio_clip.write_audiofile(audio_path)
 
    # Use pydub to split the audio on silence and get the timestamps
    audio = AudioSegment.from_wav(audio_path)
    silence_segments = detect_silence(audio, silence_thresh=-80, min_silence_len=500)
 
    # Initialize the recognizer
    recognizer = sr.Recognizer()
 
    # Transcribe the audio
    with sr.AudioFile(audio_path) as source:
        audio_data = recognizer.record(source)
        transcript = recognizer.recognize_google(audio_data)
 
    # Generate the transcription text with timestamps
    transcription_with_timestamps = []
    cumulative_duration = 0.0
 
    for i, sentence in enumerate(transcript.split('.')):
        # Set a default timestamp if there are more sentences than silence segments
        if i < len(silence_segments):
            timestamp = (cumulative_duration + audio.duration_seconds) / 1000.0
            cumulative_duration += silence_segments[i][1] - silence_segments[i][0]
        else:
            timestamp = cumulative_duration + audio.duration_seconds / 1000.0
 
        # Format the timestamp and add it to the transcription
        timestamp_str = f"{timestamp:.3f}"
        transcription_with_timestamps.append(f"[{timestamp_str}] Speaker: {sentence.strip()}")
 
    # Save the transcription to a text file
    output_file_path = "/content/transcription.txt"
    with open(output_file_path, 'w') as file:
        file.write('\n'.join(transcription_with_timestamps))
 
    print(f"Transcription saved to: {output_file_path}")

User Input

Now we will call the transcribe_video function to make transcription of the input video. We have used a sample video(MP4) file to do this. Remember that highly animated, noisy speech or corrupted videos can’t be used here as Google’s speech recognition will to take that kind of input and may raise various common exceptions like bad request or broken pipe etc.

Python3

# Pipeline for video transcription
video_path = "/content/1-English-as-mp4-video.mp4"
# Call the transcribe_video function
transcribe_video(video_path)

Output:

MoviePy - Writing audio in /content/audio.wav
MoviePy - Done.
Transcription saved to: /content/transcription.txt

After you will get this output you can simply download the transcription text file from the location what is given by the output.

[0.069] Speaker: let's have a look at another example in this case we will look at a very short file containing multiple languages sentences that are long enough to easily detect what language they are in some of this is in Russian or at least some Cyrillic script some of it is Italian some of it is German and some of it is French once again we can look at the translated extracted text if we had already translate it this is not the case here so let's go back to the original right click select translate of multilingual in this case it will analyze every single sentence each segment that you have here will be analyzed to see if we can detect the language and then translate from that to English

So, if we see our sample video we can cross-check that the speech starts from the exactly correct timestamp(0.07) what printed. Also it has only one speaker. And the transcription is correct with all correct upper cases(for names). However we can tune the silence threshold and minimum silence length to capture more sharp silence segments(less than 3s). But for general purpose this implementation is enough for video transcription as sharp silence capture may lead to wrong transcription for some cases.

Conclusion

We can conclude that, video transcription is very important for various purposes but involves multiple steps. However we can efficiently handle it by various modules available in Python. Some advance cases, we can perform speech-to-speech translation to translate the video transcription to the desired language.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
How to Avoid Overfitting in Machine Learning?
Difference between Recursive and Recurrent Neural Network
What is Hugging Face and Transformers
DeepAR Forecasting Algorithm
What is Descriptive Analytics and how does it summarize past data in a simple way?

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	13