I have difficulty to understand how the distance between two books is defined. It would be helpful to have a simple exaplanation of the definition of the distance.
From my understand, I calculate the rating difference between 2 books from the same reader. The result is obviously different from the example.
(since I cannot post a link, I just past the code of the distance calculation)
def dist(b1, b2):
i1=list(df_books[df_books.title==b1].isbn)
i2=list(df_books[df_books.title==b2].isbn)
dfr1=dfr[dfr.isbn.isin(i1)]
dfr2=dfr[dfr.isbn.isin(i2)]
dfrj=pd.merge(dfr1, dfr2, ‘inner’, ‘user’)
return dfrj
ld= #list of dictionary
def cal(b1):
for b in dfb.title.unique():
if b==b1: continue
dfrj=dist(b1,b)
if len(dfrj)<5: di=10
else: di=sum(i**2 for i in dfrj.rating_x-dfrj.rating_y)
ld.append({‘book’:b, ‘dist’:di})
dfd=pd.DataFrame(ld)
return dfd
dfd=cal(‘The Queen of the Damned (Vampire Chronicles (Paperback))’)
dfd.sort_values(‘dist’).head(5)
Think about what information are provided by the data. There is a table listing users’ ratings on different books. Can you rearranging it into a table where each book’s ratings from every users are listed? What should be the rows of the new table? And what should be the columns? As it can be seen, users do not give ratings for every book, what should we do for those none values in the new table?
Supposed there are only two users giving ratings. Book A gets a ratings of 10 from user 1, but only 0 from user 2. Book B gets a rating of 0 from user 1, but 10 from user 2. Book C gets 8 from user 1, and 6 from user 2. Book D get 7 from user 1, and 7 from user 2 too. In this case, is it reasonable to think that Book C and Book D should be similar in some way?
And to generalize it. If there are n users, and their ratings on every book form a n-dimensional matrix, and each book can be represented by a vector in this n-dimensional vector space. How can we say book C is most similar to book D among all books? By measuring the ‘distance’ of the two end points? Or measure the angle made by the two vectors and using the so called cosine distance or cosine simialrity, to say how close they are pointing to the same direction?
Hello SzeYeung1,
thanks a lot for the detailed explanation!
I calculate the matrix for each book with the following function:
def matrix(b0):
i0=list(dfb[dfb.title==b0].isbn) #book dataframe
dfr0=dfr[dfr.isbn.isin(i0)] #rating dataframe
m0=dfr0.rating.value_counts().rename_axis(‘rating’).reset_index(name=‘counts’)
stdiff=st10.difference(set(m0.rating))
df_diff=pd.DataFrame({‘rating’:list(stdiff), ‘counts’:[0 for _ in stdiff]})
m0=pd.concat([m0, df_diff], ignore_index=True)
m0=m0.sort_values(‘rating’).reset_index(drop=True)
return list(m0.counts) #return sorted rating counts
For the example book ‘The Queen of the Damned (Vampire Chronicles (Paperback))’ and ‘Catch 22’ I got matrix:
[76, 0, 0, 0, 0, 2, 2, 5, 4, 2, 4]
[16, 0, 0, 0, 0, 0, 0, 0, 2, 1, 3]
which means for the first book, 76 people rated 0, 4 people rated 10 etc.
I calculated the cosine distance between these two books and got 0.0145 and 0.2922 (if I ignore rating of 0) but different from the result of the example 0.7939 
Hi CID,
Sorry I cannot fully follow your code. st10
from the line stdiff=st10.difference(set(m0.rating))
is coming from nowhere, seems to be a dataframe defined outside of the function. Anyway I can understand from your description that your function is meant to return the sorted rating counts of a book. Perhaps it is overdone. You may just compare the unsorted ratings vectors (the arrays of ratings of each user after data cleaning) of the two books. I have checked it, and the cosine distance between the two vectors is really 0.7939. I have also checked the sorted rating counts of the two books in my code, which have the same counts as yours. I think that means you have done the cleaning right and on the right track. 
Hello Sze Yeung,
Thanks for your fast reply. It was my fault of missing the definition of st10. It should fill up rating from 0 - 10, in case one book is not rated as 3 for example.
But still I don’t get your idea. It is possible that one book is rated by 100 readers and the other by 150 readers, then the rating vectors don’t have the same length. Shall I fill up with '0’s at the end? Even though I don’t get the same distance.
By the way, I’m not sure if I understand the data cleaning right. Does ‘remove from the dataset users with less than 200 ratings’ mean, the rating from users with less than 200 ratings does not count?
df_ratings.rating=df_ratings.rating.astype(‘int8’)
df_user_counts=df_ratings.user.value_counts().rename_axis(‘user’).reset_index(name=‘counts’)
df_book_counts=df_ratings.isbn.value_counts().rename_axis(‘isbn’).reset_index(name=‘counts’)
dfr=df_ratings[(df_ratings.user.isin(df_user_counts[df_user_counts.counts>=200].user)) &
(df_ratings.isbn.isin(df_book_counts[df_book_counts.counts>=100].isbn))]
#&(df_ratings.rating>0)]
dfb=df_books[df_books.isbn.isin(dfr.isbn.unique())]
You can fill up no-ratings with 0, but should not disturb the order of ratings from users by filling zeros at the beginning or the end, as the ratings of the same user give to different books carry important information. Let’s say there are 4 readers and 4 books, and the ratings of each book given by reader 1 to 4 are listed in an array:
book A: 1, 10, NaN, NaN
book B: 1, 10, 5, NaN
book C: 10, 1, NaN, 5
book D: Nan, Nan, 7, 7
In this hypothetical example, I think you may agree that book A and book B should be similar and very different from book C, as reader 1 gives book A and B ratings of 1, and book C rating of 10, and reader 2 gives book A and B ratings of 10, and book C rating of 1. Reader 1 and reader 2 have totally opposing views of what is a good book, but both identify book A and B as the same type and book C the opposite type. If we disturb the order of the array, this important information will be lost.
Yes, I think you are right by excluding records of users making less than 200 ratings.