April 5, 2022
TED Talks is a fascinating source of content. Kaggle offers a dataset on TED Talks posted on the website. As a TED Talk fan, I wanted to understand the type of resources available.
This is the third article in a series of a larger project that includes several data science techniques, including: scrapping a website, data wrangling, data classification techniques (PCA and clustering) and Machine Learning techniques. Here are the links to all the articles in this series:
To learn more about the full code on this project, please visit the code library. You can also use this Colab notebook to follow along.
With the two dataframes created on previous article, we proceed to apply data science techniques to understand the dataset and explore key variables.
Even through the dataset is intuitively simple to understand, I wanted to practice doing a PCA analysis, which is normally reserved for models with many variables. To do this I followed these steps:
df
and standardized the data using the sklearn.preprocessing library called ‘StandardScaler’ pca = decomposition.PCA()
pca_X = pd.DataFrame(pca.fit_transform(X_std), columns=[f'PC{i+1}' for i in range(len(X.columns))])
pca.explained_variance_ratio_
, we can see that the first 2 components contain almost 50% of the data” print(pca.explained_variance_ratio_)
array([2.61155893e-01, 2.09808747e-01, 1.27262369e-01, 1.19521228e-01,
1.01198399e-01, 9.06161518e-02, 8.05875806e-02, 9.81195975e-03,
3.76713129e-05])
(pd.DataFrame(pca.components_, columns=X.columns)
.iloc[:2]
.plot.bar()
.legend(bbox_to_anchor=(1,1)))
By utilizing three separate methods of visualization, we arrived at the conclusion that 4 clusters would be the best solution for this data.
The first method is a linear representation, nicknamed ‘the elbow’, as you try to search for a break on the lines. This method suggests two potential cuts, one on1 or another at three but does not provide a clear answer.
The second method is the Silouhettes, where the shapes of the clusters do show a reasonable grouping where samples are in 4 clusters. In this case we se that the ‘bellies’ of the clusters are not as pronounced as when using 2 & 3 clusters, without a significant lost in the average score.
The third method using a dendrogram is even more clear: there is a significant increase in clusters after the 4 column. This perhaps is the clearest confirmation of our assumption
Elbow Method | Silouettes Method | Dendrogram Method |
---|---|---|
k9 = cluster.KMeans(n_clusters=4, random_state=42)
k9.fit(X_std)
labels = k9.predict(X_std)
X.assign(label=labels)
print(pd.Series(labels).value_counts().sort_index())
0 1628
1 2063
2 1684
3 65
dtype: int64
(X.assign(label=labels)
.query('label == 0')
.describe()
)
(X.assign(label=labels)
.groupby('label')
.mean()
.T
.style.background_gradient(cmap='RdBu', axis=1)
)
4) Describing the clusters
With the information provided above, we can describe the cluster as follows:
0 - Newer videos released in fall
1 - Newer videos released in earlier in the year
2 - Older videos, longer duration in seg
3 - Highest views & likes
Please read the next article of this series on ML - Predicting Performing Videos