![]() |
In the modern days of content creation, video content is very often. From educational tutorials to entertainment everywhere videos serve as a powerful medium for conveying information. However, the accessibility and searchability of video content are often hindered by the lack of textual representation which can be addressed by the video transcription mechanism. Video transcription is a process of converting spoken words in a video into written text. In this article, we will implement a video transcription which will print the timestamp with the speaker’s corresponding speech. How does video transcription work?Video transcription involves the conversion of spoken language in a video (MP4) into written text which creates a textual representation of the audio content. This process is essential for various reasons like improving accessibility, aiding in content searchability, and facilitating language translation. The key steps component from the video file can be easily achieved by using audio extraction tools or libraries. Once extracted, the audio file serves as the input for the transcription process.
Build a Video TranscriptionInstalling required modulesAt first we need to install all required modules to our runtime. !pip install SpeechRecognition Importing required librariesNow we import all required Python modules like speech recognition, MoviePy, Pydub etc. Python3
Driver functionsThis overall process with be controlled by only two driver functions which are discussed below–>
Python3
Python3
User InputNow we will call the transcribe_video function to make transcription of the input video. We have used a sample video(MP4) file to do this. Remember that highly animated, noisy speech or corrupted videos can’t be used here as Google’s speech recognition will to take that kind of input and may raise various common exceptions like bad request or broken pipe etc. Python3
Output: MoviePy - Writing audio in /content/audio.wav After you will get this output you can simply download the transcription text file from the location what is given by the output. [0.069] Speaker: let's have a look at another example in this case we will look at a very short file containing multiple languages sentences that are long enough to easily detect what language they are in some of this is in Russian or at least some Cyrillic script some of it is Italian some of it is German and some of it is French once again we can look at the translated extracted text if we had already translate it this is not the case here so let's go back to the original right click select translate of multilingual in this case it will analyze every single sentence each segment that you have here will be analyzed to see if we can detect the language and then translate from that to English
So, if we see our sample video we can cross-check that the speech starts from the exactly correct timestamp(0.07) what printed. Also it has only one speaker. And the transcription is correct with all correct upper cases(for names). However we can tune the silence threshold and minimum silence length to capture more sharp silence segments(less than 3s). But for general purpose this implementation is enough for video transcription as sharp silence capture may lead to wrong transcription for some cases. ConclusionWe can conclude that, video transcription is very important for various purposes but involves multiple steps. However we can efficiently handle it by various modules available in Python. Some advance cases, we can perform speech-to-speech translation to translate the video transcription to the desired language. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 13 |