A Recommendation Engine for My Bookclub
Being a full-time student and a mostly full time employee, I have very little time left to have a social life, but the one event that I will always make time for is my book club. Half of our six members are English teachers, and the rest of us read like our lives depend on it. It’s one of the few social gatherings were you can talk endlessly about reading without boring anyone.
My husband, a member of a much less rigorous and prestigious book club, likes to tease my club by calling it the rule club. He thinks our club has far too many rules, and is much too strict about our book choosing system, but in actuality rules are quite simple.
Our system for picking books is a round robin, when it is a member’s turn to choose a book they host dinner for the evening at their location of choice.
There are really only three rules for book club:
No picking a book you’ve read before
No picking a book someone else has read without special dispensation from the group
No inviting outside members or guest stars without unanimous group approval
It’s both thoroughly exciting and terrifying when it is your turn to choose a book. Being that it must be a book no one else in the club has read before, it can be really hard to choose a book that will live up to our expectations. In addition to picking a good book, you want to choose a book that will lend itself to discussion. Finally, while there is no hard and fast rule about length, we try not to go far in the the “too-long” category without discussing it first.
Bottom line: you pick a bad book (as we all have done before), you spend the next 5 cycles (up to 8 months) in shame until you have the chance to redeem yourself.
So, as I find myself up, once again, in the picking cycle I am feeling a little bit stressed and a little bit anxious about what I will pick. My first pick, The Turner House, was a huge hit, but one of my other picks, The Shining Girls, lives on in infamy as one of the worst picks of all time. Then the idea hit me: why not make a book recommendation engine, based off some GoodReads data I found on Kaggle, and let something much smarter than me do all the work (and potentially take all the blame).
Building A Book Recommender System
The data I found was very, very messy, and required a lot of clean-up. Here are the column and data types I started out with:
From clearing out newline characters to extracting author names from a url, it took a lot of work to get the data into semi-wrangled shape.
A Quick Write Up of My Cleaning
One issue with the Goodreads dataset is that there are a lot of duplicate entries in the dataset. Each book can be represented four or five times with nearly identical entries. To sort the duplicates out I used the drop_duplicates function. I sorted the dataframe first by score, so that I would keep the most popular entry for a given book.
The author genres category as well as the author names needed some tidying up as well. The dataset had some nasty ‘\n’ characters, and about eight genres smushed into one author category column. I choose the first genre for the author, and cleaned out any ugly characters.
In addition to getting knee deep into some messy data, I also had to wrangle my book club's own dataset. Megan, the bookretary, has recorded each book, who picked it, and the author, in a Google docs file. I had to convert the file to CSV, and then edit entries until they matched the exact format, capitalization, and spelling as the good reads dataset. About 13 of the 53 books on our list were not included in the good reads dataset, so I had to impute the values for all columns for those books as well.
Building The Dataset
Once I had the two files pretty well cleaned, I could set about the task of converting the dataset to merge and append the datasets. I knew that I wanted to include genre in the recommender system that I built, but not all of the genres included in the training set (the goodreads dataset) were included in the book club dataset (we really need to broaden our horizons). I was planning on using natural language processing to stem the genres and then one-hot encoding by genre. Stemming reduces a word down to it’s base stem, so running would become run, that would help clean up the way one genre could potentially be written many different ways (ie thriller vs thrillers). One hot encoding takes a categorical variable like genre, and for each unique value in that the variable creates a new column with a 1 in each line were that value is present and 0 where that value is not present. So if one dataset had values not included in another dataset for any given variable there would be a different number of columns in the training and test datasets, and the model would not be able to run successfully. To work around this, I appended the book club dataset to the bottom of the goodreads dataset prior to one hot encoding.
I also wanted to use the published date in the model, but at over 3000 unique dates that would have resulted in far too large a dataset after one hot encoding, so I simply split the books up by date categories I made myself. The first represented books that have come out very recently (in the last year) so I could potentially weed those out before modeling (buying or loaning new hardcover books is a little cost prohibitive for this teacher’s salary bookclub). The next was books that have been released since bookclub has existed (I joined in 2015 but the club has been around since 2013). Then I included a selection of books that came out while we were in college (and honestly probably not reading a lot). I also included some categories based on recent decades, and separated the remaining books into pre and post modern titles. The books are highly imbalanced by publishing date, so I had to get a little creative with my categories. I also had to do some sampling prior to building the model to adjust for this imbalance.
The Final Columns
The final columns included in the training set and the test set (bookclub picks) were one hot encoded columns for the author gender, author genre, genre 1, genre 2, and date category. Here is a snippet of the one hot encoding process.
A word on recommendation engines
There are two main approaches to recommendation engines, the first is a user-based collaborative filtering system, in this engine users are compared based on previous reviews to find who has the most similar tastes, and then the engine recommends things to a user that their similar counterparts have watched that the user has not. There are some weaknesses of a user-based recommendation engine, namely that users can change tastes over time and your engine might not pick up on that. This type of engine isn’t really an option using the Goodreads dataset. Because this an aggregated dataset, where ratings come in the form of averages rather than by user, we can’t find similar members in the goodreads dataset using this system.
I used the other type of recommendation engine, an content based recommendation engine, to find other items (in this case books) that are similar to a given book I choose. I polled the members to come up with our top picks of all time as well as our biggest failures. Here’s what we ended with for top picks:
A Constellation of Vital Phenomena by Anthony Marra
The Goldfinch by Donna Tartt
All the Light We Cannot See by Anthony Doerr
Station 11 by Emily St John Mandel
The Flamethrowers by Rachel Kushner
Once I had removed the book club picks from the bottom of the training dataset, I just had to build the model and use these books to look up recommendations. Building the model was a little difficult at first, the way I had formatted my data meant that I could not use the item based recommendation engines included in my reference books. I kept thinking of my Machine Learning class unit on kNN clustering. I wanted to find the books from the good reads dataset that had were the most similar to my favorite book club picks. In a way, I was looking for the nearest data points to these books, and it seemed like an easy way to go about this would be to use kNN clustering to determine recommendations.
I knew I would need to use the sklearn NearestNeighbors package, since this would be an unsupervised project rather than the kNN clustering algorithm.
I did some digging online to see if anyone had used kNN for clustering before, and low and behold, I found another book recommendation engine that used kNN clustering. This recommendation engine used a sparse compressed row matrix to fit the model. This helped save time on vectorized calculations, and helped efficiently carve up the matrix when making predictions using the nearest neighbors.
I set k = 5, and fit my model, and then I just had to feed our top picks to the engine and see the nearest neighbors. kNN is lazy learner, which means that won’t learn anything about my data until I enter my test data. It will wait until I am ready to find my book recommendations to build and fit the model, and it will rebuild each time I enter a new title that I want recommendations from.
Let’s look at A Constellation of Vital Phenomena first:
A Stop in the Park
A Thousand Splendid Suns
And the Mountains Echoed
Tell the Wolves I'm Home
The Crying Tree
Part of being a good data scientist is not relying on the model to make decisions for you, it’s applying your own domain expertise to make sense of the results. So for each recommendation, I looked up the book on goodreads. To be honest, A Stop in the Park looks like a bit of a sappy romance novel, but the rest of the books look really interesting to me. I have read A Thousand Splendid Suns before, and after reading the description for Tell the Wolves I’m Home, I am very interested in potentially picking that book.
It’s hard for me to look at just one book’s predictions because books are so wonderfully diverse. A Constellation of Vital Phenomena is a brilliant and complex book that will leave you sobbing and heartbroken. The Goldfinch on the other hand is like 4 books wrapped into one, I love it so much I have read it twice. It’s equal parts love story, heist, and action. Let’s see which picks I get for The Goldfinch:
Recommendations for The Goldfinch:
A Brief History of Seven Killings
The Way the Crow Flies
The Golden Notebook
The Emperor Waltz
After researching each of these picks, I can see how they’re close to The Goldfinch. A Brief History of Seven Killings take a real life event, the attempted assasination of Bob Marley, and weaves into a narrative much like The Goldfinch does with the eponymous painting by Carol Fabritius. After reviewing all my choices, I land on the following list for my picks:
Tell the Wolves I’m Home
A Brief History of Seven Killings
The Emperor Waltz
Before I go, I just wanted to check that these books aren’t similarly rated to any of the least favorite books. Two books live in infamy in our club, The Log from the Sea of Cortez by John Steinbeck, or Log as we call it, and Independent People by Haldor Laxness. Independent People is a long slog through 19th century Iceland that tells the story of the raving sexist Bjartur and his ill-fated flock of sheep. I was the only member to finish this book, and it was a painful day of reading. Log is the literal log Steinbeck kept while sailing. Both were picked by Eric who has the reputation as a bad picker. I don’t want to cross over into Eric territory so a quick check of both reveals the following lists:
The Ragged Trousered Philanthropists:
Look Homeward, Angel
The Tin Drum:
Dianetics: The Modern Science of Mental Health
The Good Soldier Švejk
The 120 Days of Sodom and Other Writings
Well, I guess I’m off the hook for reading Don Quixote, and I always knew Eric was a secret scientologist. The good news is my top three selections are in the clear and certified bad-pick-free. At our July 27th bookclub for Paulette Jiles’ News of the World, I’ll put the choices to the club and we can all select a winner.
I’ll update this blog with the results. Thanks for reading.