Book Recommendation KNN System

Hi all,

I am trying to complete the ML certification project related to the Book recommendation system.

My colab file is here:

I can not pass the test fındıng only one similar book to the test results.

I could not find that is it related to the model training or how the data was processed. Since there is no way to check that from the test cases.

Could you provide some help on this?

3 Likes

https://colab.research.google.com/drive/1btmJ_98LiOhG5YMGC5TtvfBPme-QRrkF?usp=sharing

I am also stuck on this. I basically got to the same point you are but in a slightly different way.

1 Like

Hey. I’m also stuck in exactly the same place… Can’t see any obvious mistakes, maybe we are supposed to use certain parameters (e.g. the amount of neighbours when fitting), that we don’t know about?

1 Like

Same problem here. At first I thought it was going to be easy but I can’t find the way to make it work as its supposed to. I have simmilar results as the ones shared. Has anyone made any progress?

I’ve tried all sorts of things now, but never get the same results as the test case! One point is, that there are “0” ratings, which are not bad ratings, but basically means, the book has been read, but not rated… Usually I would remove these. A different idea would be to replace them by the average value. Also, it makes a difference whether we first remove the users with too few entries, or the books with too few entries. Still nothing… Also na-values can be replaced by zero or the average rating. The cosine similarity should first be de-biased by removing the average rating per user…
Unfortunately none of these things worked, so I think there’s probably some mistake since I’m running out of ideas quite quickly :nauseated_face:

After being stuck on this for what feels like an eternity, I have finally managed to get the required answer (I think that this answer is not the most sensible one though). This post has been around for a while but just in case anyone is still stuck, here are some steps to help you out:

  1. Do NOT remove the 0 values.
  2. Remove users that appear <200 times in the list and books that have <100 users from df_ratings (using the isbn column for the latter). This should be done at the same time (i.e. you can’t just remove users first then books from the resulting dataset).
  3. Now merge df_ratings with df_books.
  4. Drop the title duplicates with the default keep='first' parameter setting (so no need to get the max or the mean rating for the duplicated bunch) when using the drop_duplicates function.
  5. For your final answer, the recommendations sublist should be in backwards order for some reason to pass.
1 Like

Wow, great, congratulations. Will def. try this… Could you please explain, what you mean in 4. (default keep value)? Thanks a lot!

There are many cases where a book title has several different rating values by the same user for different isbn entries, such as the one below.
image
When you get rid of duplicates, you need to keep the first one of the bunch and drop the rest, no matter if the first value is a 0 or an actual rating. If you use the drop_duplicates function (I’ll add this to the original reply), the keep parameter is set to keep='first' by default and you don’t need to change it.

Thanks, I would never have thought of the fact that we can have multiple isbns with the same title :frowning:
Strangely, I still don’t quite get the right result following your instructions as I understand them… Could you please post your code maybe?

Is this, what you did?

df = df_ratings

counts1 = df['user'].value_counts()
counts2 = df['isbn'].value_counts()

df = df[~df['user'].isin(counts1[counts1 < 200].index)]
df = df[~df['isbn'].isin(counts2[counts2 < 100].index)]

df = pd.merge(right=df, left = df_books, on="isbn")

df.drop_duplicates(subset=["title", "user"], inplace=True)

piv.fillna(0,inplace=True) 

Thanks!

The above line does nothing basically to df because it just computes what that would do, but does not save it as a variable. So if you assign that line to a variable

df = df.drop_duplicates(['title','user'])

that should fix your issue. Same with the

piv = piv.fillna(0)

line. I tried the code the way you wrote it and it works fine once you include

piv = df.pivot(index='title', columns='user', values='rating')

before the final line. So you’ve basically done it and just need to add a couple of equals signs.
:+1:

Thanks, but the inplace=True actually has this function, it means, write it back into the variable, so basically e.g. piv = piv.fillna(0) and piv.fillna(0, inplace=True) do exactly the same. Just tried it, to be quite sure, but still get exactly the same, wrong result :frowning:
Really don’t get what I’m doing wrong…

Ah sorry about that – I didn’t know about the inplace part :man_shrugging:. Everything you wrote works just fine apart from the missing

piv = df.pivot(index='title', columns='user', values='rating')

part. What do you afterwards? Have you created a csr_matrix? What is your final output?

Oh, I have that too, not sure, why I didn’t paste it here, sorry…
Then:

titles = list(piv.index.values)
X = piv.values

#m = np.mean(X)
#X=X-m

def title_2_index(title):
  ind = titles.index(title)
  return X[ind,:]

def index_2_title(ind):
  return titles[ind]

nn = NearestNeighbors(metric="cosine",algorithm="brute", p=2)

nn.fit(X)

row = title_2_index("Where the Heart Is (Oprah's Book Club (Paperback))")

dist, inds = nn.kneighbors(np.reshape(row,[1,-1]),len(titles),True)

Looking at the result I get the following:
0.0 Where the Heart Is (Oprah's Book Club (Paperback))
0.7234864 The Lovely Bones: A Novel
0.7677075 I Know This Much Is True
0.7699411 The Surgeon
0.77085835 The Weight of Water
0.8016211 I'll Be Seeing You
0.802759 The Dive From Clausen's Pier (Alex Awards)
0.8060607 Tis: A Memoir
0.8064316 Icy Sparks
0.81043994 Unspeakable
...

So basically “0.7234864 The Lovely Bones: A Novel” should not be ther, apart from that, it seems correct…

I think you got the right answer there, you just need to create top 5 recommendations and reverse the order when you assembe the list. So once you assemble this in reverse, such that the 5-th item, ["I'll Be Seeing You", 0.8016211], appears first and the 1-st item, ['The Lovely Bones: A Novel', 0.7234864], appears last, you will pass the challenge.

It makes no sense to reverse a top 5 list from a logical point of view when thinking about a list of recommendations. I think when they wrote the test function, they got mixed up between the distance between the books (in which case lower is better) and cos(distance) (in which case higher would be better). The test function also only lists 4 items, despite asking for a top 5 list. So that’s why 'The Lovely Bones: A Novel' doesn’t appear in the test function, otherwise it would be the 5-th item in their recommended_books list.

3 Likes

Wow, thanks a lot, this was really driving me crazy for a long time :smiley: But yeah, it seems I had the right answer but didn’t notice it. Reversing the list does seem like cheating / wrong though, becase KNN always gives us the best results first, whether we use a distance or a similarity, just as you said… But I also find it a bit weird to leave the zeros in and also, that we just use the first rating (even if its 0)… The testfunction is more than weird and hacky in my opinion :expressionless: All in all a very disappointing and badly made task imho :nauseated_face: Wonder if the admins look into these things? If so: please change this task, it’s extremely frustrating and challenging, but not in a good way… :wink:
Thank you very much for your help once again, was about to give up :smiley:
PS: The next task also puzzles me, in my first tests I wasn’t even able to get the demanded accuracy on the training data with a linear model. Have you already found out anything there?

I think most of the posts on the forum are for JavaScript and introductory projects. So I doubt that anyone from the fcc staff will see this post.

This task was difficult for all the wrong reasons. The other four were quite well put together, as they provide just enough guidance for you to figure out what you need to do but not to give away the answer. The issue with this particular task is that the instructions can be interpreted in many different ways, without any of them being “wrong”, which leads to many of us doing this in a way that makes sense but results in the wrong answer. There should be a little more clarity (without giving anything away) and the test function at the end should be fixed, because that makes no sense. Overall, I’m very pleased with all 5 projects and they set you up quite well for doing more projects on your own.

For the other task, you just need to build a basic network with tensor flow rather than doing a linear model (I assume you tried this based on the titanic dataset part of the accompanying lectures). They give you a hint in the imports cell

from tensorflow.keras import layers

and the lack of importing of the linear classifier, which suggests that you need to build a neural network with some layers. However, this is probably more suited for a separate post topic or direct messaging. Hope this helps.

Thanks. Yes, using a simple MLP, it’s quite easy to pass the task, but the title of the challenge says " Machine Learning with Python Projects - Linear Regression Health Costs Calculator", but it’s not linear, if we use an arbitrary Neural Network (MLP). It is possible to build a linear Neural Network in Keras of course, but then not even the training data get the required accuracy :expressionless:

1 Like

I tried that too but it got me nowhere so I started fresh with a neural network.

1 Like

Haha, ok, then the “linear” is hopefully just a mistake…

Do we know if this was resolved? Looking for other posts on the topic but can’t seem to find any.