Ted Talks - ML- Recommendation Engine

AAAS24

April 5, 2022

Coding

TED Talks is a fascinating source of content. Kaggle offers a dataset on TED Talks posted on the website. As a TED Talk fan, I wanted to understand the type of resources available.

This is the last article in a series of a larger project that includes several data science techniques, including: scrapping a website, data wrangling, data classification techniques (PCA and clustering) and Machine Learning techniques. Here are the links to all the articles in this series:

To learn more about the full code on this project, please visit the code library. You can also use this Colab notebook to follow along.

Building a recommendation engine

Build Model

        #define variables to use in model
        review=df.description_1
        title=df.title

        #Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
        tfidf = TfidfVectorizer(stop_words='english')#max_features=5000

        #Construct the required TF-IDF matrix by fitting and transforming the data
        tfidf_matrix = tfidf.fit_transform(review)

        #Output the shape of tfidf_matrix
        tfidf_matrix.shape

        #create matrix
        cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

        #extract indices
        indices = (pd.Series(df.index, index=title)
        .reset_index()
        .drop_duplicates(subset=['title'], keep='first')
        ).set_index('title')
                
        indices.columns=['index']
        indices=indices.squeeze()


        def get_recommendations(title, cosine_sim=cosine_sim):
                '''
                Function that returns ten indices of top talks based on model created above and a talk title passed
                '''
                # Get the index of the talk that matches the title
                idx = indices[title]

                #     Get the pairwsie similarity scores of all talks with that movie
                sim_scores = list(enumerate(cosine_sim[idx]))
                
                #     Sort the talk based on the similarity scores
                sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

                #     Remove duplicates scores 
                sim_scores=pd.Series(v[0] for v in sim_scores).drop_duplicates()

                # Get the talk indices
                recommendations=(
                        df.title.iloc[sim_scores]
                        .drop_duplicates()
                        [1:11]
                        .reset_index()
                ).drop('index', axis=1)
                
                # Return the top 10 most similar values
                return recommendations

Test Models

        #select talk
        talk_liked='Can machines read your emotions?'


        # change display option to be able to see ful title name
        pd.set_option('display.max_colwidth', None)
        get_recommendations(talk_liked)        

        #compare results of recommendation engine
        df_graph=df.query('title.str.contains("machines", "emotions")', engine='python')
        df_graph[['author', 'title', 'likes']].drop_duplicates(subset='title').sort_values(by=['likes'], ascending=False)
        

FINDINGS

The current model’s outputs does not perform well against a simple df.query using keywords from the title. The model currently used is based on TF-IDF (term frequency and Inverse Document frequency) applied to the talk description. The model could be improved by adding other variables available like: keywords, likes & author.