ML - Book Recommendation Engine using KNN

Tell us what’s happening:
Hey, I’ve been working on this project and I am unable to get my recommendations to perform well at all. At least not well enough to pass the validation at the end.

The project description recommends to remove books with < 100 reviews and ratings from users with < 200 reviews to keep statistical relevancy. However, in my first run I decided to not do that. I finished coding the project but I only 1 valid out of 15.

I went back to clean up the data a bit and run it again. Now, I get only 2 (out of 4) recommended books and the distances are way off.

I’ve tried different strategies to clean up and format my data but none have been as successfully as I would hope. I’m unsure where I am going wrong. Thank you for your help!

Your code so far

Your browser information:

User Agent is: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

Challenge: Machine Learning with Python Projects - Book Recommendation Engine using KNN

Link to the challenge:

That’s not a recommendation; it’s a requirement. Since this is algorithmic machine learning, there really is a right and wrong answer. So you will need to look closely at the project instructions and follow all those data cleaning instructions exactly or it won’t pass the final test. There is an old ML project this is based on somewhere on the web if you look and want to compare notes.

Secondly, the test at the end of the boilerplate notebook has a bug; it’s supposed to test for five books in ascending order. You can find the details searching the forum.

Hey Jeremy,

Thanks for your response.

First. I must have misread or misunderstood the language on the instructions. On the paragraph where they mention to filter out data they also wrote (optional) at the beginning so I came to the wrong conclusion.

Secondly, I have in fact found ML projects who’s functionality resembles what I’m working on here. I am posting on the forum because despite that I’m still not able to see clearly where I am going wrong…

My intuition tells me that maybe I’m not cleaning the data properly or there’s something that I am not considering. Looking at other projects for this is not helpful because their data might be malformed in a different way than my own, leaving me with no greater insight than before…

I ultimately need help troubleshooting or advice on my processing of the data so that my model can give me predictions more accurate.

Do you happen to have a link to that forum post where the tests are corrected by any chance?

My approach has been this:

  • Find all duplicate books with different ISBN (based on title and author. There are many)
  • Update the reviews in df_ratings that point to the 2, 3, 5th entry of The Bible to point to the 1st.
  • Delete all the duplicate versions of The Bible from df_books
  • Then apply filters recommended by instructions

I don’t believe this approach is wrong. In the real world I wouldn’t necessarily want to lose 10Ks or 100Ks of rows of data because we have 3 copies of “The Anomous” and people have reviewed all 3. I would consolidate the data into 1 copy to get the most thorough user sentiment DB.

This approach didn’t work and my results got further off. If I don’t try to consolidate and just delete all duplicates of a books and all the reviews that were pointing to the duplicated versions: my distances and recommendations get better but not passing grade.

If I don’t drop duplicates in df_books then my code throws up a RunTimeError because my matrix lookup ends up being broken (due non-unique book titles)

The graphing is optional, not the cleaning.

The necessary steps are not terribly clear as there are several ways this could be implemented and be correct, but generate different results. Your steps are approximately the reverse of the ones done by the original project. Here are some shapes for the data frames through the cleaning process:

# original data
books:  (271379, 3)
ratings:  (1149780, 3)
# ratings per ISBN and user, respectively, to check the 200 and 100 thresholds
ratings per book:  (340556,)
ratings per user:  (105283,)
# filtered copy of original ratings by simultaneously checking that the thresholds are met
cleaned ratings:  (49781, 3)
# combination of the cleaned ratings and original books on ISBN, with duplicates dropped
combined:  (49136, 5)

You’ll know you’re on the right track if your hitting the shapes above. The links I mentioned can be found by searching the forum and since they contain spoilers, I’ll leave finding them to you rather than reposting them.

Hello, I’m also having a hard time with this challenge.
Should we be dropping the ratings equal to 0 beforehand (since the problem states that ratings should go from 1 to 10)? If I do that, just removing the users with >= 200 ratings gives me 69426 rows.
I also don’t understand the outcome of the model - the problem says “distance”, but the resulting scores are in decreasing order. Clearly, they are not distance, closer books should have smaller distances.
Thank you in advance!

That’s not in the project description, so no.

It’s the distance in the “nearest neighbor” space. As mentioned earlier, the test at the end has the wrong number of books in the wrong order; the fix is available by searching the forum.

We may be looking at different project descriptions then because it definitively says that.

Nope, we’re reading the same spec. It says the scale is that. It doesn’t say anything about removing ratings that don’t comply with the scale. But looking at the project, the project from which it’s likely descended, the data, and my version that passes the test I’m not sure that the description of the scale is correct because there are many zero ratings in the set and I didn’t filter them out while passing the test and neither does the antecedent project. The antecedent project also says the scale is from 1-10 and then does nothing with that bit of information. So do you believe the scale or the data? The test passes with code that uses 0-10 and not 1-10, so I believe the data in this case.

Like I said, there’s many different, correct ways to do this kind of thing and this project and its test is asking for a very specific, if difficult to hit upon solution. Clearly there are a lot of rough edges on this project.

I was able to look up the proper test cases and fix mine to solve the challenge. The instructions were definitively poorly written compared to the other ones.

Nonetheless, I also got confused because the instructions explicitly says the ratings are 1-10 but then you open the dataset and see that they are actually 0-10. I was also curious if I was supposed to remove ratings of 0 but I decided against it because it didn’t feel right. However, I can relate to @rflameiro unsure about how to process the data