Tell us what’s happening:
Hello freeCodeCampers! I’m currently trying to pass the Machine Learning certification and so far it’s going well, challenging enough to force me to make progress on my own. However, I’m trying to pass the book recommendation exercise, and I’m completely stuck. My program is functional and gives me book titles and distances, but only one of them corresponds to the list provided in the test. I’ve seen many tutorials and threads about this issue, but none of them seemed to resolve my problem. Is there anyone here able to tell me what I might forgot in my code? Thank you so much!
Your code so far
Your browser information:
User Agent is:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
Challenge: Book Recommendation Engine using KNN
Link to the challenge:
I might be wrong but I think you are missing the step where you remove duplicate reviews from your data (df.drop_duplicates) hence that might mess up some thing.
Also while you are lacking a “get_recommends()” function, a quick heads up already: You will have to return the recommendations in reverse order.
Thanks for the insight! I do use drop_duplicates but I might use it wrong, I tend to get lost when working with dataframes. I’m going to go back to it and come back here to give an update.
So I try looking at my code and comparing it to yours, but I have trouble finding any difference that looks meaningful.
Though one thing that would help me and also give you a better insight in general is getting more “visible” data. Meaning using df.info() at the start, or looking at df.head() from time to time, as well as df.shape after filtering data and whatnot.
In a real-life scenario you’d want to provide more insight into how your actions change the data, plus you propbaly wouldn’t be given that you have to filter exactly at 200 / 100 reviews. So using methods to look at the data is a nice practice.
For example, I could compare if shapes match up. My final dataframe pivot has a shape (673, 888).
First, there is a well documented bug with this project that you can find in the several threads about it; the final test has the books in reverse order. I think the number of returned books and the data structure is wrong too; you’ll need to look in the forums for discussion of those problems and fix the test accordingly.
The problems are in the processing. Two are here:
# Removing users with less than 200 ratings and books with less than 100 ratings
counts1 = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
counts2 = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts2[counts2 >= 100].index)]
counts2 needs to be counted by the book (
'ISBN') and not the rating value. These two counts need to be and’ed together (you want books with 100 ratings and users with 200 ratings). The way you have it, you’re likely pulling in some books that meet one requirement but not the other. Another may be here:
# Combining ratings and books and removing unnecessary columns
combine_book_rating = pd.merge(ratings, books, on='ISBN')
columns = ['yearOfPublication', 'publisher',
'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']
combine_book_rating = combine_book_rating.drop(columns, axis=1)
# Remove rows with no title
combine_book_rating = combine_book_rating.dropna(axis=0, subset=['bookTitle'])
# Adding the total number of ratings and grouping per book
book_ratingCount = (combine_book_rating.
# Merging the previous dataframe with the ratings+books dataframe
rating_with_totalRatingCount = combine_book_rating.merge(
book_ratingCount, left_on='bookTitle', right_on='bookTitle', how='left')
# Removing duclicate ratings
rating_with_totalRatingCount = rating_with_totalRatingCount.drop_duplicates([
# Reshaping rating_with_totalRatingCount to have book titles as indices, user IDs as columns and rating as values
rating_with_totalRatingCount_pivot_with_na = rating_with_totalRatingCount.pivot(
index='bookTitle', columns='userID', values='bookRating')
rating_with_totalRatingCount_pivot = rating_with_totalRatingCount_pivot_with_na.fillna(
All I know you need is to merge the ratings and books, drop the duplicates, and create the pivot table. The rest may or may not be necessary. (I cut out most of it and got the correct results, so…)
@jeremy.a.gray @Jagaya Thank you both!
After trying what you said, I found out that filtering for 200/100 ratings AT THE SAME TIME did the trick!
it was interesting and important for me