Week 3 - Starting with data extraction

I started with audio clustering part first.
Audio is a term used to describe any sound or noise that is within a range the human ear is capable of hearing and clustering is the type of unsupervised learning algorithm where the aim is to find the hidden information within the feature space , unlike supervised learning algorithms the labels are not provided here. A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters.
There are generally two types of clustering:
a)Hard clustering : In hard clustering, each data point either belongs to a cluster completely or not.
b)Soft clustering : In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

There are different categories of clustering algorithms such as
a)Connectivity model: These models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away.
b)Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters
c)Distribution models:These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution
d)Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster.

I referred to these useful resources for clustering algoritms:
1.clustering-algorithms-with-python
2.audio-signal-feature-extraction-and-clustering
3.an-introduction-to-clustering-and-different-methods-of-clustering

However In our case there can be two different types of clustering in the data
a) Based on the speakers also called as the speaker diarization
b) Based on the pronounciation

Speaker diarization involves identifing the speaker along with identification of the boundary/frame of the speech spoken by a particular speaker. It enhances the readability of an automatic speech transcription by structuring the audio stream into speaker turns and by providing the speaker’s true identity. It is a combination of speaker segmentation and speaker clustering.
I have referred these materials for understanding the basics of speaker diarization :
1.Speaker Diarization Basic information
2.Speaker Diarization with kaldi

I will be starting with the clustering based on the pronounciation first.


Coding Starts
Link->Image and audio clustering GSOC-2020 Code github
Part1-Data extraction
Lets first understand why we need to automate the data extraction step. In future we will be extracting the particular section from the audio based on the word(as in the case of pronounciation), it will be a time consuming task if we do it manually so me and my mentor decided to first start with the data extraction script.
python script 1 -> that will trim the audio having only 5 words left and 5 words right to the query word(in our first example it is ideology) out of the entire audio. The number of words is a variable and can differ in the future usecases.
The following code shows how first audio start and end time are updated according to the number of words taken into account (in our case five right now), then the url is updated and the new link of the audio file is downloaded by using requests .



Shell script -> This script uses the curl command to fetch the textGrid from the webMAUS server. A TextGrid file is used for labeling certain points or regions in an audio file and WebMAUS is the web application that allows the user to automatically align speech recordings to their corresponding text form. It takes two input files a) the audio file b) the text file containing the textual encoding of the words spoken in the audio (both files should have the same name) the output is the textGrid file having the same filename with the .TextGrid extension. We can also use the webMaus interface for this but automating it with the shell script is always better.
The script uses the curl command to fetch the textGrid file. We have to pass the different audio and text file in each call thus we require loop over all the filenames, for this I stored all the filenames in a txt file.
Note that the output of the curl command is getting saved into demo.html file, yes the curl command will return a html file in which there is a download_link tag which contains the url of the textGrid file. So we also require to first extract the url from the html file and then download the textGrid file from that url so we require another python script for this (called extract.html). We need to call this file after every curl command in loop.
The shell script is :



python-script2 This script is getting called after every curl command from the shell script. The aim of the function is to extract the url from a tag present in the html file. It then uses requests library to extract the textGrid file from the url extracted. It takes savefile parameter as a input which is getting passed from the shell script same named as audio and text files.

This was all about this week. Thanks