April 5, 2022
TED Talks is a fascinating source of content. Kaggle offers a dataset on TED Talks posted on the website. As a TED Talk fan, I wanted to understand the type of resources available.
This is the second article in a series of a larger project that includes several data science techniques, including: scrapping a website, data wrangling, data classification techniques (PCA and clustering) and Machine Learning techniques. Here are the links to all the articles in this series:
To learn more about the full code on this project, please visit the code library. You can also use this Colab notebook to follow along.
With a more robust dataset obtained in the previous article, we will proceed to clean and generate two base DataFrame to be used in the following steps. If you wish to see the entire code, you can visit this Jupyter notebook.
With final data joined dataset, we proceed to apply data processing steps, including:
Dropped duplicated or useless columns: ‘description2’, ‘Unnamed: 0’, ‘link’, ‘author_y’, ‘title’,’url’, ‘title_1’
Converted to datetime date_recorded and date_released columns, extracting YEAR, MONTH, HOUR & MINUTES into numerical columns to be able to feed it into Machine Learning models later
Remove lists from keywords column in order to later applied get_dummies() panda function.
This step was particularly tricky for me. The original column looks like this:
0 ['TED', ' talks', ' climate change', ' global ...]
1 ['TED', ' talks', ' education', ' women', ' TE...]
2 ['TED', ' talks', ' business', ' leadership', ...]
3 ['TED', ' talks', ' climate change', ' polluti...]
4 ['TED', ' talks', ' climate change', ' Countdo...]
The desired structure looks more like:
0 'ted', 'talks', 'climatechange', 'globalissues...
1 'ted', 'talks', 'education', 'women', 'ted-ed'...
2 'ted', 'talks', 'business', 'leadership', 'wor...
3 'ted', 'talks', 'climatechange', 'pollution', ...
4 'ted', 'talks', 'climatechange', 'countdown', ...
In order to achieve this a For loop was used to remove the list, apply lower case and remove the extra spaces in between the ‘’ that some words had.
Applied the get_dummies function on the new keywords2 column, which allows to apply a sum() function in a group by in order to obtain a simple count on the number of times a keyword was used across the dataset. It’s also important to note that all talks have the keyword ‘ted’ attached to them, which provides no value overall and therefore was removed during this step.
def create_dummies_file(df):
#converting keywords into dummy columns
df2=df.keywords2.str.get_dummies(',')
#joining with df
df=pd.concat([df,df2], axis=1)
#removing 'ted' column
column_to_drop=df.columns[362]
df2=df.drop(column_to_drop, axis=1)
#counting dummies and creating file to rename categories
dummy_columns=pd.Series(np.arange(15,349,1))[1:]
df_dummies=df.iloc[:,dummy_columns].sum().reset_index()
df_dummies.columns=['keyword', 'sum']
df2=df_dummies.copy()
(df2
.groupby(['keyword'])
.agg({'sum':'sum'})
)
df2=df2.sort_values(by='sum', ascending=False)
cwd=os.getcwd()
df2.to_csv(cwd+'/keywords.csv')
return df
The result of this function is: 1) A dataFrame ‘df’ with the main data structure to be later processed.
2) A ‘keywords.csv’ file containing a list of keywords and number of times it appears in the data, example:
keyword | sum |
---|---|
‘talks’ | 5440 |
‘science’ | 1266 |
technology’ | 1200 |
There are 334 keywords, which we will call sub_categories of content. In order to analyze them, we need to consolidate them into bigger buckets. For that reason, I created a one:many relationship between new categories:sub_categories, in order to group the latter into bigger buckets of content that we can understand better. This categorization was done separately and stored in a file called ‘keywords_categories.csv’.
Utilized this matrix, I transformed the ‘keywords.csv’ into a dataFrame with the following structure:
index | sub_category | num_talks | category | likes | views |
---|---|---|---|---|---|
0 | ‘3dprinting’ | 9 | technology | 201574 | 6655100 |
1 | ‘activism’ | 352 | values & emotions | 21752759 | 714057797 |
2 | ‘addiction’ | 20 | health | 1870500 | 60982000 |
3 | ‘africa’ | 197 | global | 9097799 | 299541000 |
4 | ‘aging’ | 93 | society | 8152092 | 269034199 |
Dataframe named: df_dummies
Please read the next article of this series on PCA and Clustering