How to build powerful deep recommender systems using Spotlight.

Building movie recommender systems with deep learning.

Spotlight on a stage. — Photo by Nick Fewings on Unsplash

In my previous article, I demonstrated how to build shallow recommender systems based on techniques such as matrix factorization using Surprise.

https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802

But what if you want to build a recommender system that uses techniques that are more sophisticated than simple matrix factorization? What if you want to build recommender systems with deep learning? What if you want to use a user’s viewing history to predict the next movie that they will watch?

This is where Spotlight, a Python library that uses PyTorch to create recommender systems with deep learning, comes into play. Spotlight features an interface similar to that of Surprise but supports both recommender systems based on matrix factorization and sequential deep learning models.

In this article, I will demonstrate how you can use Spotlight to build deep recommender systems for movie recommendations using both matrix factorization and sequential models.

Installation

The official documentation page provides a command for installing Spotlight with Conda, but I would recommend installing the library directly from Git using the following commands, especially if you have Python 3.7.

git clone https://github.com/maciejkula/spotlight.git
cd spotlight
python setup.py build
python setup.py install

Building a Movie Recommendation System

In this practical example, I decided to use data from The Movies Dataset available on Kaggle. This dataset contains files with 26 million ratings from 270,000 users for 45,000 movies. For the purpose of this example, I used the ratings_small file from the dataset, which contains a subset of 100,000 ratings from 700 users on 9,000 movies. The full dataset takes a much longer time to train on but you can try using it if you have a machine with a powerful GPU. You can find the full code for this practical example on GitHub.

Import Basic Libraries

In the code below, I simply imported the basic libraries that I use for most data science projects.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Reading the Data

In the code below, I read three different data files:

ratings_small.csv — contains the rating data for different users and movies.
movies_metadata.csv — contains the metadata for all the 45,000 movies in the dataset.
links.csv — contains the IDs that can be used to lookup each movie when joining this data with the movie metadata.

ratings_data = pd.read_csv('./data/ratings_small.csv.zip')
metadata = pd.read_csv('./data/movies_metadata.csv.zip')
links_data = pd.read_csv('./data/links.csv')

The columns obtained for each data frame using the info function in Pandas are displayed below.

ratings_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB

metadata

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  revenue                45460 non-null  float64
 16  runtime                45203 non-null  float64
 17  spoken_languages       45460 non-null  object 
 18  status                 45379 non-null  object 
 19  tagline                20412 non-null  object 
 20  title                  45460 non-null  object 
 21  video                  45460 non-null  object 
 22  vote_average           45460 non-null  float64
 23  vote_count             45460 non-null  float64
dtypes: float64(4), object(20)
memory usage: 8.3+ MB

links_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  45843 non-null  int64  
 1   imdbId   45843 non-null  int64  
 2   tmdbId   45624 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.0 MB

Notice the following mappings and column relationships:

The movieId column in links_data maps to the movieId column in ratings_data.
The imdbId column in links_data maps to the imdb_id column in metadata.

Preprocessing the Metadata

In this next section, I preprocessed the data using the steps listed below. Refer to my Jupyter Notebook on GitHub and the dataset description on Kaggle for clarification.

Removed rows from the metadata data frame where the imdb_id was null.
Convert each element of the imdb_id column in metadata to an int by applying a lambda function.
Merge metadata and links_data by joining the data frames on the imbd_id and imdbId columns respectively.

metadata = metadata[metadata['imdb_id'].notna()]

def remove_characters(string):

    return ''.join(filter(str.isdigit, string))

metadata['imdb_id'] = metadata['imdb_id'].apply(lambda x: int(remove_characters(str(x))))

full_metadata = pd.merge(metadata, links_data, left_on='imdb_id', right_on='imdbId')

Running the code above produces a single data frame that we can use to retrieve the metadata for a movie based on the movie ID alone.

Creating a Spotlight Interactions Dataset

Like Surprise, Spotlight also has a dataset object that we need to use in order to train models on our data. In the code below, I created an Interactions object by supplying the following parameters, all of which must be Numpy arrays:

user_ids — the user IDs in the rating data
item_ids — the item IDs in the rating data
ratings — the corresponding ratings in the rating data.
timestamps (optional) — the timestamps for each user/item interaction.

from spotlight.interactions import Interactions

dataset = Interactions(user_ids=ratings_data['userId'].values,
                       item_ids=ratings_data['movieId'].values,
                       ratings=ratings_data['rating'].values,
                       timestamps=ratings_data['timestamp'].values)

Training a Matrix Factorization Model

Now that a dataset has been created, we can train a deep learning-based matrix factorization model using the ExplicitFactorizationModel module from Spotlight as demonstrated below.

from spotlight.cross_validation import random_train_test_split
from spotlight.evaluation import rmse_score
from spotlight.factorization.explicit import ExplicitFactorizationModel

train, test = random_train_test_split(dataset)

model = ExplicitFactorizationModel(n_iter=10)
model.fit(train, verbose=True)

rmse = rmse_score(model, test)
print('RMSE = ', rmse)

Running the code above produced the following output with loss values for each epoch:

Epoch 0: loss 4.494929069874945
Epoch 1: loss 0.8425834600011973
Epoch 2: loss 0.5420750372064997
Epoch 3: loss 0.38652444562064103
Epoch 4: loss 0.30954678428190163
Epoch 5: loss 0.26690390673145314
Epoch 6: loss 0.24580617306721325
Epoch 7: loss 0.23303465699786075
Epoch 8: loss 0.2235499506040965
Epoch 9: loss 0.2163570392770579
RMSE = 1.1101374661355057

Generating Movie Recommendations with the Factorization Model

Now that we have trained a matrix factorization model, we can use it to generate movie recommendations. The predict method takes a single user ID or an array of user IDs and generates predicted ratings or “scores” for each movie item in the dataset.

model.predict(user_ids=1)

The output of the predict method is an array of values that each correspond to the predicted rating or score for an item (in this case a movie) in the dataset.

array([0.42891726, 2.2079964 , 1.6789076 , ..., 0.24747998, 0.36188596, 1.658421  ], dtype=float32)

I created some utility functions below for converting the output of the predict method to actual movie recommendations.

https://gist.github.com/AmolMavuduru/67a1e67171a88055832716f8fcc37e67

We can call the recommend_movies function to generate movie recommendations for the user with a specific ID as demonstrated below.

recommend_movies(1, full_metadata, model)

Calling this function produces the following output, which contains a list of dictionaries with metadata for each recommended movie.

[[{'original_title': '2001: A Space Odyssey',
   'release_date': '1968-04-10',
   'genres': "[{'id': 878, 'name': 'Science Fiction'}, {'id': 9648, 'name': 'Mystery'}, {'id': 12, 'name': 'Adventure'}]"}],
 [{'original_title': 'Rocky',
   'release_date': '1976-11-21',
   'genres': "[{'id': 18, 'name': 'Drama'}]"}],
 [{'original_title': "The Young Poisoner's Handbook",
   'release_date': '1995-01-20',
   'genres': "[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}]"}],
 [{'original_title': 'Thinner',
   'release_date': '1996-10-25',
   'genres': "[{'id': 27, 'name': 'Horror'}, {'id': 53, 'name': 'Thriller'}]"}],
 [{'original_title': 'Groundhog Day',
   'release_date': '1993-02-11',
   'genres': "[{'id': 10749, 'name': 'Romance'}, {'id': 14, 'name': 'Fantasy'}, {'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}]"}]]

Training a Sequence Model

We can take a different approach to building recommender systems by using sequential deep learning models. Rather than learning a matrix factorization, sequence models learn to use a sequence of user-item interactions to predict the next item that a user will interact with. In the context of this example, a sequential model will learn to use the viewing/rating history of each user to predict the next movie that they will watch.

The ImplicitSequenceModel allows us to train a sequence model with deep learning as demonstrated below. Note that I had to convert the train and test datasets to sequence datasets using the to_sequence method.

from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.cross_validation import user_based_train_test_split

train, test = user_based_train_test_split(dataset)

train = train.to_sequence()
test = test.to_sequence()

model = ImplicitSequenceModel(n_iter=10,
                              representation='cnn',
                              loss='bpr')

model.fit(train)

We can evaluate this model on the testing data using the mean reciprocal rank (MRR) metric. The reciprocal rank metric takes a ranking of items and outputs the reciprocal of the rank of the correct item. For example, let’s say that our recommender system produces the following ranking of movies for a user:

Batman Begins
Spider-Man
Superman

Let’s assume that the user actually decides to watch Spider-Man next. In this case, Spider-Man is ranked second in the list above, meaning the reciprocal rank would be 1/2. The MRR averages these reciprocal ranks across all predictions in the testing set. We can compute the MRR for a model in Spotlight using the sequence_mrr_score function as demonstrated below.

from spotlight.evaluation import sequence_mrr_score

sequence_mrr_score(model, test)

The result is a list of MRR scores for each movie in the testing set.

[0.00100402 0.00277778 0.00194932 ... 0.00110619 0.0005305  0.00471698]

Generating Movie Recommendations with a Sequence Model

The sequence model also has a predict method that allows us to generate predictions, but rather than using a user ID, this method accepts a sequence of items and predicts the next item that a general user would likely select.

model.predict(sequences=np.array([1, 2, 3, 4, 5]))

Running the function above produces a Numpy array with scores for each movie in the test set. Higher scores correspond to a higher probability of being viewed next by the user.

array([ 0.      , 16.237215, 11.529311, ..., -2.713985, -2.403066,
       -3.747315], dtype=float32)

As in the previous example with the matrix factorization model, I created some more utility functions below to generate movie recommendations from the predictions generated by sequence models.

https://gist.github.com/AmolMavuduru/966b985252bde7cfcda9cf903909b45d

The recommend_next_movies function uses a list of movie names to recommend the top movies that the user is likely to watch next. The n_movies parameter specifies the number of movies to return. Consider the example code below.

movies = ['Shallow Grave', 'Twilight', 'Star Wars', 'Harry Potter']
recommend_next_movies(movies, full_metadata, model, n_movies=5)

The sequence model produces the following movie recommendations for a user that has watched Shallow Grave, Twilight, Star Wars, and Harry Potter:

[[{'original_title': 'Azúcar amarga',
   'release_date': '1996-02-10',
   'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"}],
 [{'original_title': 'The American President',
   'release_date': '1995-11-17',
   'genres': "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"}],
 [{'original_title': 'Jaws 2',
   'release_date': '1978-06-16',
   'genres': "[{'id': 27, 'name': 'Horror'}, {'id': 53, 'name': 'Thriller'}]"}],
 [{'original_title': 'Robin Hood',
   'release_date': '1973-11-08',
   'genres': "[{'id': 16, 'name': 'Animation'}, {'id': 10751, 'name': 'Family'}]"}],
 [{'original_title': 'Touch of Evil',
   'release_date': '1958-04-23',
   'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}, {'id': 80, 'name': 'Crime'}]"}]]

As you can see from this example, the sequence models provided in Spotlight are quite powerful and can generate recommendations for users that don’t exist in the training dataset. The best part about Spotlight is that it makes it really easy to build these powerful deep recommender systems without having to reinvent the wheel by implementing your own neural networks.

Summary

Spotlight is an easy-to-use and powerful library for building recommender systems using deep learning. It supports both matrix factorization and sequence models for recommender systems, which makes it well-suited for building both simple and more complex recommender systems.

As always, you can find the code for this article on GitHub. If you enjoyed this article, feel free to check out some of my previous articles on deep learning below.

https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802 https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802

Join My Mailing List

Do you want to get better at data science and machine learning? Do you want to stay up to date with the latest libraries, developments, and research in the data science and machine learning community?

Join my mailing list to get updates on my data science content. You’ll also get my free Step-By-Step Guide to Solving Machine Learning Problems when you sign up!

Sources

N. Hug, Surprise: A Python library for recommender systems, (2020), Journal of Open Source Software.
R. Banik, The Movies Dataset, (2017), Kaggle.
M. Kula, Spotlight, (2017), GitHub.