Lets get started with week-9
The tasks for this week were
1. Deploying the bio-metric clustering
2. Starting with the Autoencoder joint clustering
Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.
For more details about the singularity containers visit https://singularity.lbl.gov/
The first step is to write the singularity recipe where we write all the modules and libraries we required , my singualrity recipe looks like this
https://github.com/Himani2000/GSOC_2020/blob/master/Singularity.clustering
Then login to the singularity hub for building our container and add the gitub repo in which the singularity recipe exists.
Link for the singularity container https://singularity-hub.org/
After this when we update the singularity recipe to the github then automatic building is performed by the singularity hub.
To pull the singularity image perfom:
singularity pull --name clusteirng-image_and_audio.img shub://Himani2000/GSOC_2020:clustering
The bio-metric deployable code can be found here :
To run the bio-metric code using the singularity run this
singularity exec -B `pwd` biometric_deploy/bio-metric.img python3 biometric_deploy/bio-metric_clustering.py --input_file biometric_deploy/muslim_concordance_rapiannotator.xlsm --output_file clustering.csv
Now lets understand about the deployable bio-metric code
As we have seen in the previous week about the bio-metric clustering(please refer to the week 4 blog post to get the detailed explanation ) now we want to make it deployable such as for each file id it will have three columns named "cluster largest 1" , "cluster largest 2" and "cluster largest 3" . Here the cluster largest 1 represents the largest face present in the image (if more than 1 face is present ) .
The sample tables look like this
File id | Cluster largest 1 | Cluster largest 2 | Cluster largest 3 |
---|---|---|---|
378248 | 1 | 4 | 7 |
471341 | 2 | Nan | Nan |
341434 | 1 | 4 | Nan |
374231 | 3 | Nan | 7 |
2. Starting with the deep clustering
a) Deep embedded clustering
It is a pioneering work on deep clustering, and is often used as the benchmark for comparing performance of other models. DEC uses AE reconstruction loss and cluster assignment hardeining loss. It defines soft cluster assignment distribution q based on Student’s t-distribution with degree of freedom α set to 1. To further refine the assignments, it also defines an auxiliary target distribution derived from this assignment pij, which is updated after every T iterations.
b)Discriminately Boosted Clustering:
It builds on DEC by using convolutional autoencoder instead of feed forward autoencoder. It uses the same training scheme, reconstruction loss and cluster assignment hardening loss as DEC. DBC achieves good results on image datasets because of its use of convolutional neural network.
We will see the other types of the deep clustering on later weeks .
The code (python file) can be found on
https://github.com/Himani2000/GSOC_2020/blob/master/clustering-models/dec.py
The code (slrum file) can be found on
https://github.com/Himani2000/GSOC_2020/blob/master/clustering-models/dec.slrum
See you next week