Building a simple personalised search engine with Spark and Elasticsearch

Image for post
Image for post
Searching for Star on IMDB. Given my movie preferences I would have expected Star Wars or Star Trek.

Introduction

Hmmm, no Star Wars in the IMDB search results when I search for star; these search results aren’t wrong; but they don’t really take my personal movie preferences into account (this makes sense for a general database like IMDB).

But for fun, lets build a simple search that does take personal preferences into account. If I type star I would expect Star Wars or Star Trek to show up as I like science-fiction moveis ! For people who really like movies about music, A Star is born is probably a good suggestion.

Prerequisites

You’ll need Apache spark, Docker (to run a Elasticsearch 7.10 and Cerebro), and Python in order to follow. If you don’t have these installed right now, not a problem, you can just read along. All the code is available in a Github repository.

Getting the data and training the recommendation model

First, let’s get some data. MovieLens 25M movie ratings. 25 million ratings (five-star rating)and one million tag applications applied to 62,000 movies by 162,000 users. You can find this dataset on the MovieLens website.

!unzip -o ml-25m.zip

We’ll use pandas and spark to load the movies and ratings in a Jupyter Notebook.

import findspark
findspark.init()
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('recommender').getOrCreate()# load movies and extract the year as new column
movies = pd.read_csv('ml-25m/movies.csv',',', engine='python')
movies['year'] = movies['title'].str.extract('\(([0-9]{4})\)', expand=False).str.strip()
# load ratings
ratings = spark.read.options(inferSchema=True, header=True) \
.csv('ml-25m/ratings.csv')
# create a training and validation set
(training, validation) = ratings.randomSplit([0.8, 0.2])

With the data loaded, we can start building an movie to user recommendation model using Alternating Least Squares (ALS) matrix factorization (link). Such matrix factorization algorithms work by decomposing the interaction matrix into the product of two reduced dimensionality matrices. One representing the latent factors of the movies and the other representing the latent factors of the users. The challenge with these methods is the sparsity of the interaction matrix. Not all users will have rated all movies, in fact, the opposite it more likely, meaning that most values of the interaction matrix are unknown. The aim is to fill in the missing entries of a user-item association matrix by iterating over possible latent factors for the movies and users and checking the error between true rating and predicted rating at the end of each step.

als = ALS(maxIter=10, regParam=0.05, rank=48, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
model = als.fit(training)itemfactors = spark.createDataFrame(model.itemFactors.rdd)

By default the rank (i.e. the number of latent factors) is 10. More latent factor (i.e. increasing dimensionality) will allow for more nuance in the personalisation; yet be aware from overfitting if the rank it too high. In order to determine the best rank and regularisation parameter (regParam) a quick grid search was performed to select the best model based on RMSE of validation data. This yielded a rank of 48 and regularisation parameter of 0.05. Grid search plus training took about 1 hour on a Google Colab Notebook. (To install Apache Spark here is a nice starter)

Now that we have the latent factors for each movie and each users (a vector of 48 floats) we can start recommending!

Image for post
Image for post
the cosine similarity for similar vectors will be close to 1, close to -1 for opposite vectors and 0 for orthogonal vectors.

A good recommendation (movie to movie, movie to users) is determined by the angle between their respective latent vectors (i.e. dot product). The orange arrow in the figure above represents a random user; the purple vectors are associated with three different movies. The first movie (left) clearly a movie that shares a vast amount of properties with the preferences of the user. The movie (middle) is the opposite of what the user likes, most likely a bad recommendation. The last movie (right) does not really share any property with the user; It’s impossible to tell wether of not this user will like this particular movie.

Loading the data in Elasticsearch and query

First, we need a running Elasticsearch. I’m using Docker for this. Perfect!

# This script pulls the elasticsearch:7.10.0 docker container and runs the databasedocker pull docker.elastic.co/elasticsearch/elasticsearch:7.10.0
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0

First we need to create a new index for our data by setting the index properties, available analysers (analysis) and mapping (link)

Analysis defines how the data should be treated in Elasticsearch; e.g. remove stop-words, tokenize, stemming, etc…

Mapping is the process of defining how a document is stored and indexed in Elasticsearch .

  • which string fields should be treated as full text fields and how should they be indexed
  • which fields contain numbers, dates, or vectors.

Since this is a local setup we’ll set up 1 shard and 0 replicas. We’re assuming the dataset consist of primarily English movie titles so we’re going to use an English stemmer and remove some common English stop words for our search engine. We’ll also add the lowercase filter to avoid case sensitive search. The latent vectors are stored in a dense_vector field with 48 dimentions.

from elasticsearch import Elasticsearch
es_client = Elasticsearch(http_compress=True)
index_name = "movielens"
try:
es_client.indices.delete(index=index_name)
except Exception as e:
print(e)
index_body = {
'settings': {
'number_of_shards': 1,
'number_of_replicas': 0,
'analysis': {
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
},
"english_stemmer":{
"type":"stemmer",
"language":"english"
}
},
"analyzer": {
"stem_english": {
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"english_stop",
"english_stemmer"
]
}
}
}},
'mappings': {
'properties': {
'title': {
'type': 'text',
'analyzer': 'standard',
'fields': {
'english': {
'type': 'text',
'analyzer': 'stem_english'
}
}
},
'year': {'type': 'integer'},
"profile_vector": {
"type": "dense_vector",
"dims": 48
}
}
}
}
es_client.indices.create(index=index_name,body=index_body)

the Python Elasticsearch module comes with some helper methods that make it easy to insert document into Elasticsearch

from elasticsearch import helpers
import uuid
items_frame = itemfactors.select('id','features').toPandas().rename(columns={"id": "movie_id", "features": "features"})# join this with the original dataframe
db_movies = movies.merge(items_frame, left_on='movieId', right_on='movie_id')
# create a dataset for Elasticsearch
es_dataset = [{"_index": index_name, "_id": uuid.uuid4(), "_source" : {"title": doc[1]["title"], "profile_vector": doc[1]["features"]}} for doc in db_movies.iterrows()]
#bulk insert them
helpers.bulk(es_client, es_data)
Image for post
Image for post
Cerebro (link) to interact with Elasticsearch

Time to write an Elasticsearch query; Remember we want to take into account the preferences of the users; but! The results still need to honour the search queries! (In case of the example introduction, all results need to contain the term star in the title. After those matches are found we’re going to rescore these results. This rescoring scripted query will take your personal profile (in the form of a latent vector that describes the users preferences best). This script uses `cosineSimilarity` do determine good recommendations (be aware that Elasticsearch does not allow negative scores, hence the + 1.0). For the final score 90% of the weight lies with the cosine similarity and 10% with the original score based on the query alone. (depending on the use case these value definitely need further tuning)

{
"size": 100,
"query": {
"match": {
"title.english": {
"operator": "or",
"query": "star"
}
}
},
"rescore": {
"window_size": 500,
"query": {
"rescore_query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.query_vector, 'profile_vector') + 1.0)",
"params": {
"query_vector": [PUT LATENT FACTORS]
}
}
}
},
"query_weight": 0.1,
"rescore_query_weight": 0.9
}
},
"_source": {
"include": [
"title"
]
}
}

Results

Let’s do some queries. In the first result the user posting the query likes movies such as Hairspray, La La Land and Bohemian Rhapsody.

Image for post
Image for post
Searching on `star` if the user really likes movies like La La Land and Hairspray

If that query is repeated on the same dataset by somebody who is really into science-fiction movies the results are very different! The first 3 results are all Star Wars movies! Yeah, just like I wanted.

Image for post
Image for post
Searching on `star` if the user really likes science-fiction movies

Written by

👨‍👩‍👧‍👦 Dad of two 🇧🇪 Ghent 🎈 @Balloon_inc 👨‍💻 Coding 🧪 Data science 📈 Graph enthusiast 👨‍💻 Engineer @showpad

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store