April 5, 2022
TED Talks is a fascinating source of content. Kaggle offers a dataset on TED Talks posted on the website. As a TED Talk fan, I wanted to understand the type of resources available.
This is the first article in a series of a larger project that includes several data science techniques, including: scrapping a website, data wrangling, data classification techniques (PCA and clustering) and Machine Learning techniques. Here are the links to all the articles in this series:
To learn more about the full code on this project, please visit the code library. You can also use this Colab notebook to follow along.
Kaggle offers a dataset on TED Talks posted on the website. The data has been uploaded to a Dataframe.Looking at a brief set of the data:
From this general view, we can immediately ask some general questions, like:
However, some other interesting questions can be:
To answer some of these questions, the current data set is insufficient. However, this information may be available on each of the video’s website. This poses the opportunity to extract the data using a targeted web scrapping technique.
The goal of this section is to obtain a more robust dataset than the one provided by Kaggle by scrapping the TED Talk’s site. To view the entirety of the code used for this section, please visit this link.
It is important to note that throughout the process of understanding the HTML and partial tweaking of the code, I stored locally the initial HTML and/or utilize samples of 1-10 links at a time with flags in the code oto avoid disrupting the site’s servers while developing the code. I believe this is the code any scrapper should approach a target.
The first step is to understand what kind of HTML we would get back and identify where we could obtain the information to answer the questions above and generate new valuable insights.
Fetch an example
Using one of the links provided in the Kaggle dataset obtained in Step 2, we scrape the HTML using the following code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen ( "https://www.ted.com/talks/ozawa_bineshi_albert_climate_action_needs_new_frontline_leadership" )
soup = BeautifulSoup ( html . read (), 'html.parser')
Parse the desired content
Our goal is to extract the names of the authors and their quotes, to later store them in a dictionary we can later query when we want to.
Now we need to better analyzing the HTML tree using the prettify property. Using the property prettify (print(bsObj.prettify()), we can see the HTML.
In the HTML we can find the information we need is stored in a tag called “meta”.
meta=soup.find_all("meta")
When we study the tag we see that in the elements inside this tag we find the information we need in the following indices:
Position in the ‘meta’ tag | Data |
---|---|
1 | url |
27 | title style 1 |
28 | title style 1 |
29 | description style 1 |
30 | description style 2 |
33 | duration in seg |
34 | keywords |
35 | release date |
37 | author |
Through some data transformation in python we are able to clean the data an obtain a new dataframe we called df_scrapped
. To see the data cleaning process, you can look at the code.
Because we included the initial link
column from df_kaggle
, we can join both data frames using panda merge.
df=df_scrapped.merge(df_kaggle, left_on='link', right_on='link')
The final joined data was stored in this file.
Please read the next article of this series on preprocessing the data here.