A Recommendation Engine for My Bookclub

Being a full-time student and a mostly full time employee, I have very little time left to have a social life, but the one event that I will always make time for is my book club. Half of our six members are English teachers, and the rest of us read like our lives depend on it. It’s one of the few social gatherings were you can talk endlessly about reading without boring anyone.

My husband, a member of a much less rigorous and prestigious book club, likes to tease my club by calling it the rule club. He thinks our club has far too many rules, and is much too strict about our book choosing system, but in actuality rules are quite simple.

Our system for picking books is a round robin, when it is a member’s turn to choose a book they host dinner for the evening at their location of choice.

There are really only three rules for book club:

  1. No picking a book you’ve read before

  2. No picking a book someone else has read without special dispensation from the group

  3. No inviting outside members or guest stars without unanimous group approval

It’s both thoroughly exciting and terrifying when it is your turn to choose a book. Being that it must be a book no one else in the club has read before, it can be really hard to choose a book that will live up to our expectations. In addition to picking a good book, you want to choose a book  that will lend itself to discussion. Finally, while there is no hard and fast rule about length, we try not to go far in the the “too-long” category without discussing it first.

Bottom line: you pick a bad book (as we all have done before), you spend the next 5 cycles (up to 8 months) in shame until you have the chance to redeem yourself.

siora-photography-hgFY1mZY-Y0-unsplash.jpg

So, as I find myself up, once again, in the picking cycle I am feeling a little bit stressed and a little bit anxious about what I will pick. My first pick, The Turner House, was a huge hit, but one of my other picks, The Shining Girls, lives on in infamy as one of the worst picks of all time. Then the idea hit me: why not make a book recommendation engine, based off some GoodReads data I found on Kaggle, and let something much smarter than me do all the work (and potentially take all the blame).

Building A Book Recommender System

The data I found was very, very messy, and required a lot of clean-up. Here are the column and data types I started out with:

Columns as Included in the Dataset from Kaggle

Columns as Included in the Dataset from Kaggle

From clearing out newline characters to extracting author names from a url, it took a lot of work to get the data into semi-wrangled shape.

A Quick Write Up of My Cleaning

One issue with the Goodreads dataset is that there are a lot of duplicate entries in the dataset. Each book can be represented four or five times with nearly identical entries. To sort the duplicates out I used the drop_duplicates function. I sorted the dataframe first by score, so that I would keep the most popular entry for a given book.

dedup.JPG

The author genres category as well as the author names needed some tidying up as well. The dataset had some nasty ‘\n’ characters, and about eight genres smushed into one author category column. I choose the first genre for the author, and cleaned out any ugly characters.

In addition to getting knee deep into some messy data, I also had to wrangle my book club's own dataset. Megan, the bookretary, has recorded each book, who picked it, and the author, in a Google docs file. I had to convert the file to CSV, and then edit entries until they matched the exact format, capitalization, and spelling as the good reads dataset. About 13 of the 53 books on our list were not included in the good reads dataset, so I had to impute the values for all columns for those books as well.


Building The Dataset

Once I had the two files pretty well cleaned, I could set about the task of converting the dataset to  merge and append the datasets. I knew that I wanted to include genre in the recommender system that I built, but not all of the genres included in the training set (the goodreads dataset) were included in the book club dataset (we really need to broaden our horizons). I was planning on using natural language processing to stem the genres and then one-hot encoding by genre. Stemming reduces a word down to it’s base stem, so running would become run, that would help clean up the way one genre could potentially be written many different ways (ie thriller vs thrillers). One hot encoding takes a categorical variable like genre, and for each unique value in that the variable creates a new column with a 1 in each line were that value is present and 0 where that value is not present. So if one dataset had values not included in another dataset for any given variable there would be a different number of columns in the training and test datasets, and the model would not be able to run successfully. To work around this, I appended the book club dataset to the bottom of the goodreads dataset prior to one hot encoding.

A snippet of my stemming code

A snippet of my stemming code

I also wanted to use the published date in the model, but at over 3000 unique dates that would have resulted in far too large a dataset after one hot encoding, so I simply split the books up by date categories I made myself. The first represented books that have come out very recently (in the last year) so I could potentially weed those out before modeling (buying or loaning new hardcover books is a little cost prohibitive for this teacher’s salary bookclub). The next was books that have been released since bookclub has existed (I joined in 2015 but the club has been around since 2013). Then I included a selection of books that came out while we were in college (and honestly probably not reading a lot). I also included some categories based on recent decades, and separated the remaining books into pre and post modern titles. The books are highly imbalanced by publishing date, so I had to get a little creative with my categories. I also had to do some sampling prior to building the model to adjust for this imbalance.

Using datetime to parse the dates

Using datetime to parse the dates

Binning the Dates

Binning the Dates

The Final Columns

The final columns included in the training set and the test set (bookclub picks) were one hot encoded columns for the author gender, author genre, genre 1, genre 2, and date category. Here is a snippet of the one hot encoding process.

onehot enconding the columns

onehot enconding the columns

Setting the title equal to the index

Setting the title equal to the index

A word on recommendation engines 

There are two main approaches to recommendation engines, the first is a user-based collaborative filtering system, in this engine users are compared based on previous reviews to find who has the most similar tastes, and then the engine recommends things to a user that their similar counterparts have watched that the user has not. There are some weaknesses of a user-based recommendation engine, namely that users can change tastes over time and your engine might not pick up on that. This type of engine isn’t really an option using the Goodreads dataset. Because this an aggregated dataset, where ratings come in the form of averages rather than by user, we can’t find similar members in the goodreads dataset using this system.

I used the other type of recommendation engine, an content based recommendation engine, to find other items (in this case books) that are similar to a given book I choose. I polled the members to come up with our top picks of all time as well as our biggest failures. Here’s what we ended with for top picks:

Recommendations

  1. A Constellation of Vital Phenomena by Anthony Marra

  2. The Goldfinch by Donna Tartt

  3. All the Light We Cannot See by Anthony Doerr

  4. Station 11 by Emily St John Mandel

  5. The Flamethrowers by Rachel Kushner

Once I had removed the book club picks from the bottom of the training dataset, I just had to build the model and use these books to look up recommendations. Building the model was a little difficult at first, the way I had formatted my data meant that I could not use the item based recommendation engines included in my reference books. I kept thinking of my Machine Learning class unit on kNN clustering. I wanted to find the books from the good reads dataset that had  were the most similar to my favorite book club picks. In a way, I was looking for the nearest data points to these books, and it seemed like an easy way to go about this would be to use kNN clustering to determine recommendations. 

I knew I would need to use the sklearn NearestNeighbors package, since this would be an unsupervised project rather than the kNN clustering algorithm.

I did some digging online to see if anyone had used kNN for clustering before, and low and behold, I found another book recommendation engine that used kNN clustering. This recommendation engine used a sparse compressed row matrix to fit the model. This helped save time on vectorized calculations, and helped efficiently carve up the matrix when making predictions using the nearest neighbors.

importing the knn package and fitting my model

importing the knn package and fitting my model

I set k = 5,  and fit my model, and then I just had to feed our top picks to the engine and see the nearest neighbors. kNN is lazy learner, which means that won’t learn anything about my data until I enter my test data. It will wait until I am ready to find my book recommendations to build and fit the model, and it will rebuild each time I enter a new title that I want recommendations from.

My recommender interface. Note the query_index = 7 refers to the index of A Constellation of Vital Phenomena.

My recommender interface. Note the query_index = 7 refers to the index of A Constellation of Vital Phenomena.

Let’s look at A Constellation of Vital Phenomena first:

  1. A Stop in the Park

  2. A Thousand Splendid Suns

  3. And the Mountains Echoed

  4. Tell the Wolves I'm Home

  5. The Crying Tree

Part of being a good data scientist is not relying on the model to make decisions for you, it’s applying your own domain expertise to make sense of the results. So for each recommendation, I looked up the book on goodreads. To be honest, A Stop in the Park looks like a bit of a sappy romance novel, but the rest of the books look really interesting to me. I have read A Thousand Splendid Suns before, and after reading the description for Tell the Wolves I’m Home, I am very interested in potentially picking that book.

It’s hard for  me to look at just one book’s predictions because books are so wonderfully diverse. A Constellation of Vital Phenomena is a brilliant and complex book that will leave you sobbing and heartbroken. The Goldfinch on the other hand is like 4 books wrapped into one, I love it so much I have read it twice. It’s equal parts love story, heist, and action. Let’s see which picks I get for The Goldfinch:

Recommendations for The Goldfinch:

  1. A Brief History of Seven Killings

  2. The Corrections

  3. The Way the Crow Flies

  4. The Golden Notebook

  5. The Emperor Waltz

After researching each of these picks, I can see how they’re close to The Goldfinch. A Brief History of Seven Killings take a real life event, the attempted assasination of Bob Marley, and weaves into a narrative much like The Goldfinch does with the eponymous painting by Carol Fabritius. After reviewing all my choices, I land on the following list for my picks:

  1. Tell the Wolves I’m Home

  2. A Brief History of Seven Killings

  3. The Emperor Waltz

Before I go, I just wanted to check that these books aren’t similarly rated to any of the least favorite books. Two books live in infamy in our club, The Log from the Sea of Cortez by John Steinbeck, or Log as we call it, and Independent People by Haldor Laxness. Independent People is a long slog through 19th century Iceland that tells the story of the raving sexist Bjartur and his ill-fated flock of sheep. I was the only member to finish this book, and it was a painful day of reading. Log is the literal log Steinbeck kept while sailing. Both were picked by Eric who has the reputation as a bad picker. I  don’t want to cross over into Eric territory so a quick check of both reveals the following lists:

Log:

  1. Aşk-ı Memnu

  2. The Ragged Trousered Philanthropists:

  3. Two Lives

  4. Look Homeward, Angel

  5. The Tin Drum:

Independent People:

  1. Dianetics: The Modern Science of Mental Health

  2. The Good Soldier Švejk

  3. Don Quixote

  4. The 120 Days of Sodom and Other Writings

Well, I guess I’m off the hook for reading Don Quixote, and I always knew Eric was a secret scientologist. The good news is my top three selections are in the clear and certified bad-pick-free. At our July 27th bookclub for Paulette Jiles’ News of the World, I’ll put the choices to the club and we can all select a winner.

I’ll update this blog with the results. Thanks for reading.

nick-fewings-f2Bi-VBs71M-unsplash.jpg
Kate EytchisonComment