Efficient correlation calculation between large number of records

Question

I am reading a book (A Programmer's Guide to Data Mining) which has the following data attached to it BX-Dump, having 100k users' rating, each having some books rated. I'd like to move the whole dataset into pandas dataframe, which loads 10x faster than the author's implementation:

%time r.loadBookDB('/Users/danialt/Downloads/BX-Dump/') 1700018 CPU times: user 16.1 s, sys: 373 ms, total: 16.5 s Wall time: 16.5 s

and mine:

# Mine %time ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\") %time books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\") %time users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\") #Output[5] CPU times: user 484 ms, sys: 73.3 ms, total: 557 ms Wall time: 567 ms CPU times: user 1.28 s, sys: 138 ms, total: 1.41 s Wall time: 1.45 s CPU times: user 148 ms, sys: 25.7 ms, total: 173 ms Wall time: 178 ms #/Output

Now, to calculate the correlation ratings.corr() I need to put users on index, and books on columns, and the ratings as values:

ratings_piv = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

But this is going to fail, since there would be a 100k x 400k matrix formed!

Is there a better/elegant way of calculating correlation this very sparse matrix without looping though each row?

A example code: (do not run the last line, it will kill your RAM)

import numpy as np import pandas as pd import codecs from math import sqrt users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} } class recommender: def __init__(self, data, k=1, metric='pearson', n=5): """ initialize recommender currently, if data is dictionary the recommender is initialized to it. For all other data types of data, no initialization occurs k is the k value for k nearest neighbor metric is which distance formula to use n is the maximum number of recommendations to make""" self.k = k self.n = n self.username2id = {} self.userid2name = {} self.productid2name = {} # for some reason I want to save the name of the metric self.metric = metric if self.metric == 'pearson': self.fn = self.pearson # # if data is dictionary set recommender data to it # if type(data).__name__ == 'dict': self.data = data def convertProductID2name(self, id): """Given product id number return product name""" if id in self.productid2name: return self.productid2name[id] else: return id def userRatings(self, id, n): """Return n top ratings for user with id""" print ("Ratings for " + self.userid2name[id]) ratings = self.data[id] print(len(ratings)) ratings = list(ratings.items()) ratings = [(self.convertProductID2name(k), v) for (k, v) in ratings] # finally sort and return ratings.sort(key=lambda artistTuple: artistTuple[1], reverse = True) ratings = ratings[:n] for rating in ratings: print("%s\t%i" % (rating[0], rating[1])) def loadBookDB(self, path=''): """loads the BX book dataset. Path is where the BX files are located""" self.data = {} i = 0 # # First load book ratings into self.data # f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8') for line in f: i += 1 #separate line into fields fields = line.split(';') user = fields[0].strip('"') book = fields[1].strip('"') rating = int(fields[2].strip().strip('"')) if user in self.data: currentRatings = self.data[user] else: currentRatings = {} currentRatings[book] = rating self.data[user] = currentRatings f.close() # # Now load books into self.productid2name # Books contains isbn, title, and author among other fields # f = codecs.open(path + "BX-Books.csv", 'r', 'utf8') for line in f: i += 1 #separate line into fields fields = line.split(';') isbn = fields[0].strip('"') title = fields[1].strip('"') author = fields[2].strip().strip('"') title = title + ' by ' + author self.productid2name[isbn] = title f.close() # # Now load user info into both self.userid2name and # self.username2id # f = codecs.open(path + "BX-Users.csv", 'r', 'utf8') for line in f: i += 1 #print(line) #separate line into fields fields = line.split(';') userid = fields[0].strip('"') location = fields[1].strip('"') if len(fields) > 3: age = fields[2].strip().strip('"') else: age = 'NULL' if age != 'NULL': value = location + ' (age: ' + age + ')' else: value = location self.userid2name[userid] = value self.username2id[location] = userid f.close() print(i) def pearson(self, rating1, rating2): sum_xy = 0 sum_x = 0 sum_y = 0 sum_x2 = 0 sum_y2 = 0 n = 0 for key in rating1: if key in rating2: n += 1 x = rating1[key] y = rating2[key] sum_xy += x * y sum_x += x sum_y += y sum_x2 += pow(x, 2) sum_y2 += pow(y, 2) if n == 0: return 0 # now compute denominator denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)) if denominator == 0: return 0 else: return (sum_xy - (sum_x * sum_y) / n) / denominator def computeNearestNeighbor(self, username): """creates a sorted list of users based on their distance to username""" distances = [] for instance in self.data: if instance != username: distance = self.fn(self.data[username], self.data[instance]) distances.append((instance, distance)) # sort based on distance -- closest first distances.sort(key=lambda artistTuple: artistTuple[1], reverse=True) return distances def recommend(self, user): """Give list of recommendations""" recommendations = {} # first get list of users ordered by nearness nearest = self.computeNearestNeighbor(user) # # now get the ratings for the user # userRatings = self.data[user] # # determine the total distance totalDistance = 0.0 for i in range(self.k): totalDistance += nearest[i][1] # now iterate through the k nearest neighbors # accumulating their ratings for i in range(self.k): # compute slice of pie weight = nearest[i][1] / totalDistance # get the name of the person name = nearest[i][0] # get the ratings for this person neighborRatings = self.data[name] # get the name of the person # now find bands neighbor rated that user didn't for artist in neighborRatings: if not artist in userRatings: if artist not in recommendations: recommendations[artist] = (neighborRatings[artist] * weight) else: recommendations[artist] = (recommendations[artist] + neighborRatings[artist] * weight) # now make list from dictionary recommendations = list(recommendations.items()) recommendations = [(self.convertProductID2name(k), v) for (k, v) in recommendations] # finally sort and return recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True) # Return the first n items return recommendations[:self.n] r = recommender(users) # The author implementation r.loadBookDB('/Users/danialt/Downloads/BX-Dump/') # The alternative loading ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\") books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\") users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\") pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

Why two downvotes but no constructive comments? Seems clear to me, +1 — Dan Allan
– Dan Allan, Commented Dec 10, 2013 at 13:47

Has QUIT--Anony-Mousse · Accepted Answer · 2013-12-10 12:44:01Z

Be careful with benchmark such as this. Pandas might be using lazy loading, i.e. it may return but not actually have read the data yet. In which case the measured wall time would be worthless. Try performing some simple operation on all the data to ensure it has really been read.

As for correlation: your input matrix may be sparse, but the correlation matrix will likely be not that sparse; as there usually is some minimal correlation between anything.

Note that the correlation matrix will be square, i.e. if you have 100k users, the user-user correlation matrix will be 100k x 100k (you can save half of that memory due to symmetry, but doesn't help that much).

If you want to speed up computations, consider whether you need all the data, need it at full precision, and whether you can exploit data layout in memory to speed up computations. For example covariance (as used in correlation) can be sped up easily by exploiting sparsity, and working on non-zero values in columns only instead of working in rows.

However, to become really fast, you will have to abandon thinking in matrixes. Instead, consider hashing and index structures that avoid comparing everything with everything else (which is naturally quadratic in cost). When thinking in matrixes, always think in sparse matrixes that don't really exist in your memory or on disk.

Can you elaborate on your last paragraph? you meant hashing like this: google.com/trends/correlate/nnsearch.pdf? let's say I have a really this 300k users and 30k items. How should I proceed?
I did not read that article. I was referring to hashing as in "sketching and hashing" and "locality sensitive hashing".

Collectives™ on Stack Overflow

Efficient correlation calculation between large number of records

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related