My mentor and I decided the tasks for this week given as :

  • 1. Exploring the image clustering with different feature extraction and methods
  • 2. Starting with the Bio-metric clustering
  • 3. Extracting other audio clustering features and visualizing the results

Starting with the image clustering
This week our main focus was on the image part of the clustering (general and bio metric ) so I will be starting the blog with the image clustering .

Clustering of the images is done based on the similarity of the images, however the similarity can mean to be similar looking images, similar background images or similar size and etc. Our aim is to cluster based on similar content.

Feature Extraction:: It is the extraction of the useful information from the raw data that solves the problem. In computer vision there are various feature descriptors(color,edge,texture) that transform image from one to another. Feature extraction can be done manually (that we have done on tehe audio data in previous weeks ) or automatically(using transfer learning)

Transfer learning: In transfer learning, a Deep learning model is trained by a large dataset in which thousands or millions of samples exist. The learning of such a trained DL model is transferred using transfer learning to allow the DL model to work on another small dataset with just hundreds or a few thousands of images. In feature extraction we use the representations learned by a previous network to extract meaningful features from new samples.
This week I tried these pretrained models.
1. VGG16
2. VGG19
3. Resnet50
4. Inceptionv3

I added two different algorithms this week in the image clustering which are named as
1. Image-cluster is a python developing library for the image clustering , basically they use pretained vgg16 for the feature extraction and the agglomerative clustering from scipy(not sklearn). I use the agglomerative scipy function in my code (this is without gridsearch).
2. pixplot is also a python developed project which uses HDBSCAN for the image clustering , I took the HDBSCAN clustering function from there (this is also without grid search)
The main reason I perform the image clustering with and without grid search is that in the previous week we saw that audio clustering with grid search (where we used silhoutte coefficient as a scoring function performed better ) so I performed image clustering last week with grid search , but when i was working on the face clustering(next section ) i observed that dbscan without grid search is also performing significantly good, then I thought to run the image clustering without grid search i.e the default sklearn implementation to see the results.
If we compare the results of both the techniques (table 2 below) we can see that significantly without grid search is performing better that is actually a turning point here. I researched about this and found some possible reasons :

1. It can be the possibility that those default paramaters are performing better in this particular dataset and were not performing better on the audio dataset i.e can be the case of luck here .
2. But I think that one of the main reasons can be the scoring function we are using in the grid search. I am using grid search having silhouette score , but on some algorithms(DBSCAN) it return cluster 1 as it has the highest score. For example when I was performing face clustering with default sklearn DBSCAN function it resulted silhoutte score -0.03 and 90+ well defined clusters but when I perform gridsearch it resulted higher silhouette score around 0.123 but only 1 cluster.
Probable solutions according to my understanding can be
1. Try with diverse range of parameters in the grid search, there may be different range of values for the image data than the audio data
2. Try different scoring function in the grid search or the average of some number of scoring functions
3. With the scoring function add that if the number of clusters lies in some range i.e above 10 then it is better . Basically adding the minimum number of clusters along with the scording functions in the grid search, this can prevent the problem of DBSCAN resulting very few clusters(i.e 1/2/3)
I will be discussing these points with my mentors and will update about the solution in the next week blog!
Table 1 showing all the codebase links of the image clustering

S No. Code Type Code link
1 Image feature Extraction Image feature extraction code
2 Image clustering with Gridsearch clustering-code with Grid
3 Image clustering without Gridsearch clustering-code withoutGrid

S No. Evaluation criteria Results link
1 Using grid search resultswithGridsearch.html
2 Without using grid search resultswithGridsearch.html
3 Using dimensionality reduction technique such as pca and with gridsearch results with pca-with grid.html



Getting started with the Bio-metric clustering
I started with the face clustering. First I extracted the face encodings (128 d vector representations) from the images using the face-recognition library in python. The feature extraction code can be found here. I performed the DBSCAN algorithm with and without grid search, both performed good but with grid search is better

S No. Code Type Code link
1 Face Encoding Extraction face-extractioncode
2 Face clustering clustering-code
3 Results visualization results-viz.html
4 Complete code complete_bio-metric.html