Building movie recommender systems with deep learning.
In my previous article, I demonstrated how to build shallow recommender systems based on techniques such as matrix factorization using Surprise.
But what if you want to build a recommender system that uses techniques that are more sophisticated than simple matrix factorization? What if you want to build recommender systems with deep learning? What if you want to use a user’s viewing history to predict the next movie that they will watch?
This is where Spotlight, a Python library that uses PyTorch to create recommender systems with deep learning, comes into play. Spotlight features an interface similar to that of Surprise but supports both recommender systems based on matrix factorization and sequential deep learning models.
In this article, I will demonstrate how you can use Spotlight to build deep recommender systems for movie recommendations using both matrix factorization and sequential models.
Installation
The official documentation page provides a command for installing Spotlight with Conda, but I would recommend installing the library directly from Git using the following commands, especially if you have Python 3.7.
git clone https://github.com/maciejkula/spotlight.git
cd spotlight
python setup.py build
python setup.py install
Building a Movie Recommendation System
In this practical example, I decided to use data from The Movies Dataset available on Kaggle. This dataset contains files with 26 million ratings from 270,000 users for 45,000 movies. For the purpose of this example, I used the ratings_small file from the dataset, which contains a subset of 100,000 ratings from 700 users on 9,000 movies. The full dataset takes a much longer time to train on but you can try using it if you have a machine with a powerful GPU. You can find the full code for this practical example on GitHub.
Import Basic Libraries
In the code below, I simply imported the basic libraries that I use for most data science projects.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Reading the Data
In the code below, I read three different data files:
- ratings_small.csv — contains the rating data for different users and movies.
- movies_metadata.csv — contains the metadata for all the 45,000 movies in the dataset.
- links.csv — contains the IDs that can be used to lookup each movie when joining this data with the movie metadata.
ratings_data = pd.read_csv('./data/ratings_small.csv.zip')
metadata = pd.read_csv('./data/movies_metadata.csv.zip')
links_data = pd.read_csv('./data/links.csv')
The columns obtained for each data frame using the info function in Pandas are displayed below.
ratings_data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 100004 non-null int64
1 movieId 100004 non-null int64
2 rating 100004 non-null float64
3 timestamp 100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
metadata
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 adult 45466 non-null object
1 belongs_to_collection 4494 non-null object
2 budget 45466 non-null object
3 genres 45466 non-null object
4 homepage 7782 non-null object
5 id 45466 non-null object
6 imdb_id 45449 non-null object
7 original_language 45455 non-null object
8 original_title 45466 non-null object
9 overview 44512 non-null object
10 popularity 45461 non-null object
11 poster_path 45080 non-null object
12 production_companies 45463 non-null object
13 production_countries 45463 non-null object
14 release_date 45379 non-null object
15 revenue 45460 non-null float64
16 runtime 45203 non-null float64
17 spoken_languages 45460 non-null object
18 status 45379 non-null object
19 tagline 20412 non-null object
20 title 45460 non-null object
21 video 45460 non-null object
22 vote_average 45460 non-null float64
23 vote_count 45460 non-null float64
dtypes: float64(4), object(20)
memory usage: 8.3+ MB
links_data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movieId 45843 non-null int64
1 imdbId 45843 non-null int64
2 tmdbId 45624 non-null float64
dtypes: float64(1), int64(2)
memory usage: 1.0 MB
Notice the following mappings and column relationships:
- The movieId column in links_data maps to the movieId column in ratings_data.
- The imdbId column in links_data maps to the imdb_id column in metadata.
Preprocessing the Metadata
In this next section, I preprocessed the data using the steps listed below. Refer to my Jupyter Notebook on GitHub and the dataset description on Kaggle for clarification.
- Removed rows from the metadata data frame where the imdb_id was null.
- Convert each element of the imdb_id column in metadata to an int by applying a lambda function.
- Merge metadata and links_data by joining the data frames on the imbd_id and imdbId columns respectively.
metadata = metadata[metadata['imdb_id'].notna()]
def remove_characters(string):
return ''.join(filter(str.isdigit, string))
metadata['imdb_id'] = metadata['imdb_id'].apply(lambda x: int(remove_characters(str(x))))
full_metadata = pd.merge(metadata, links_data, left_on='imdb_id', right_on='imdbId')
Running the code above produces a single data frame that we can use to retrieve the metadata for a movie based on the movie ID alone.
Creating a Spotlight Interactions Dataset
Like Surprise, Spotlight also has a dataset object that we need to use in order to train models on our data. In the code below, I created an Interactions object by supplying the following parameters, all of which must be Numpy arrays:
- user_ids — the user IDs in the rating data
- item_ids — the item IDs in the rating data
- ratings — the corresponding ratings in the rating data.
- timestamps (optional) — the timestamps for each user/item interaction.
from spotlight.interactions import Interactions
dataset = Interactions(user_ids=ratings_data['userId'].values,
item_ids=ratings_data['movieId'].values,
ratings=ratings_data['rating'].values,
timestamps=ratings_data['timestamp'].values)
Training a Matrix Factorization Model
Now that a dataset has been created, we can train a deep learning-based matrix factorization model using the ExplicitFactorizationModel module from Spotlight as demonstrated below.
from spotlight.cross_validation import random_train_test_split
from spotlight.evaluation import rmse_score
from spotlight.factorization.explicit import ExplicitFactorizationModel
train, test = random_train_test_split(dataset)
model = ExplicitFactorizationModel(n_iter=10)
model.fit(train, verbose=True)
rmse = rmse_score(model, test)
print('RMSE = ', rmse)
Running the code above produced the following output with loss values for each epoch:
Epoch 0: loss 4.494929069874945
Epoch 1: loss 0.8425834600011973
Epoch 2: loss 0.5420750372064997
Epoch 3: loss 0.38652444562064103
Epoch 4: loss 0.30954678428190163
Epoch 5: loss 0.26690390673145314
Epoch 6: loss 0.24580617306721325
Epoch 7: loss 0.23303465699786075
Epoch 8: loss 0.2235499506040965
Epoch 9: loss 0.2163570392770579
RMSE = 1.1101374661355057
Generating Movie Recommendations with the Factorization Model
Now that we have trained a matrix factorization model, we can use it to generate movie recommendations. The predict method takes a single user ID or an array of user IDs and generates predicted ratings or “scores” for each movie item in the dataset.
model.predict(user_ids=1)
The output of the predict method is an array of values that each correspond to the predicted rating or score for an item (in this case a movie) in the dataset.
array([0.42891726, 2.2079964 , 1.6789076 , ..., 0.24747998, 0.36188596, 1.658421 ], dtype=float32)
I created some utility functions below for converting the output of the predict method to actual movie recommendations.
https://gist.github.com/AmolMavuduru/67a1e67171a88055832716f8fcc37e67
We can call the recommend_movies function to generate movie recommendations for the user with a specific ID as demonstrated below.
recommend_movies(1, full_metadata, model)
Calling this function produces the following output, which contains a list of dictionaries with metadata for each recommended movie.
[[{'original_title': '2001: A Space Odyssey',
'release_date': '1968-04-10',
'genres': "[{'id': 878, 'name': 'Science Fiction'}, {'id': 9648, 'name': 'Mystery'}, {'id': 12, 'name': 'Adventure'}]"}],
[{'original_title': 'Rocky',
'release_date': '1976-11-21',
'genres': "[{'id': 18, 'name': 'Drama'}]"}],
[{'original_title': "The Young Poisoner's Handbook",
'release_date': '1995-01-20',
'genres': "[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}]"}],
[{'original_title': 'Thinner',
'release_date': '1996-10-25',
'genres': "[{'id': 27, 'name': 'Horror'}, {'id': 53, 'name': 'Thriller'}]"}],
[{'original_title': 'Groundhog Day',
'release_date': '1993-02-11',
'genres': "[{'id': 10749, 'name': 'Romance'}, {'id': 14, 'name': 'Fantasy'}, {'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}]"}]]
Training a Sequence Model
We can take a different approach to building recommender systems by using sequential deep learning models. Rather than learning a matrix factorization, sequence models learn to use a sequence of user-item interactions to predict the next item that a user will interact with. In the context of this example, a sequential model will learn to use the viewing/rating history of each user to predict the next movie that they will watch.
The ImplicitSequenceModel allows us to train a sequence model with deep learning as demonstrated below. Note that I had to convert the train and test datasets to sequence datasets using the to_sequence method.
from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.cross_validation import user_based_train_test_split
train, test = user_based_train_test_split(dataset)
train = train.to_sequence()
test = test.to_sequence()
model = ImplicitSequenceModel(n_iter=10,
representation='cnn',
loss='bpr')
model.fit(train)
We can evaluate this model on the testing data using the mean reciprocal rank (MRR) metric. The reciprocal rank metric takes a ranking of items and outputs the reciprocal of the rank of the correct item. For example, let’s say that our recommender system produces the following ranking of movies for a user:
- Batman Begins
- Spider-Man
- Superman
Let’s assume that the user actually decides to watch Spider-Man next. In this case, Spider-Man is ranked second in the list above, meaning the reciprocal rank would be 1/2. The MRR averages these reciprocal ranks across all predictions in the testing set. We can compute the MRR for a model in Spotlight using the sequence_mrr_score function as demonstrated below.
from spotlight.evaluation import sequence_mrr_score
sequence_mrr_score(model, test)
The result is a list of MRR scores for each movie in the testing set.
[0.00100402 0.00277778 0.00194932 ... 0.00110619 0.0005305 0.00471698]
Generating Movie Recommendations with a Sequence Model
The sequence model also has a predict method that allows us to generate predictions, but rather than using a user ID, this method accepts a sequence of items and predicts the next item that a general user would likely select.
model.predict(sequences=np.array([1, 2, 3, 4, 5]))
Running the function above produces a Numpy array with scores for each movie in the test set. Higher scores correspond to a higher probability of being viewed next by the user.
array([ 0. , 16.237215, 11.529311, ..., -2.713985, -2.403066,
-3.747315], dtype=float32)
As in the previous example with the matrix factorization model, I created some more utility functions below to generate movie recommendations from the predictions generated by sequence models.
https://gist.github.com/AmolMavuduru/966b985252bde7cfcda9cf903909b45d
The recommend_next_movies function uses a list of movie names to recommend the top movies that the user is likely to watch next. The n_movies parameter specifies the number of movies to return. Consider the example code below.
movies = ['Shallow Grave', 'Twilight', 'Star Wars', 'Harry Potter']
recommend_next_movies(movies, full_metadata, model, n_movies=5)
The sequence model produces the following movie recommendations for a user that has watched Shallow Grave, Twilight, Star Wars, and Harry Potter:
[[{'original_title': 'Azúcar amarga',
'release_date': '1996-02-10',
'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"}],
[{'original_title': 'The American President',
'release_date': '1995-11-17',
'genres': "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"}],
[{'original_title': 'Jaws 2',
'release_date': '1978-06-16',
'genres': "[{'id': 27, 'name': 'Horror'}, {'id': 53, 'name': 'Thriller'}]"}],
[{'original_title': 'Robin Hood',
'release_date': '1973-11-08',
'genres': "[{'id': 16, 'name': 'Animation'}, {'id': 10751, 'name': 'Family'}]"}],
[{'original_title': 'Touch of Evil',
'release_date': '1958-04-23',
'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}, {'id': 80, 'name': 'Crime'}]"}]]
As you can see from this example, the sequence models provided in Spotlight are quite powerful and can generate recommendations for users that don’t exist in the training dataset. The best part about Spotlight is that it makes it really easy to build these powerful deep recommender systems without having to reinvent the wheel by implementing your own neural networks.
Summary
Spotlight is an easy-to-use and powerful library for building recommender systems using deep learning. It supports both matrix factorization and sequence models for recommender systems, which makes it well-suited for building both simple and more complex recommender systems.
As always, you can find the code for this article on GitHub. If you enjoyed this article, feel free to check out some of my previous articles on deep learning below.
https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802https://towardsdatascience.com/how-you-can-build-simple-recommender-systems-with-surprise-b0d32a8e4802
Join My Mailing List
Do you want to get better at data science and machine learning? Do you want to stay up to date with the latest libraries, developments, and research in the data science and machine learning community?
Join my mailing list to get updates on my data science content. You’ll also get my free Step-By-Step Guide to Solving Machine Learning Problems when you sign up!
Sources
- N. Hug, Surprise: A Python library for recommender systems, (2020), Journal of Open Source Software.
- R. Banik, The Movies Dataset, (2017), Kaggle.
- M. Kula, Spotlight, (2017), GitHub.