Netflix Prize Again

| | Comments (0) | TrackBacks (0)

So, I took another look at the training data for this contest. Absolutely freaking enormous. 100 Million ratings from 17000 movies and nearly 500K users. Unfortunately, the user ids run from 1 to 2.6M and have lots of gaps. So, importing the ratings as a matrix in a C app is not an option - the matrix would be over 8.5 GB, even using just chars. So, I thought SQL might be good - the biggest requirements here are being able to index in like a matrix. SQL allows that well enough, especially with indexes on both of those columns. I started with SQLite3, and it seemed slow. So now I'm trying MySQL. It's slower. A lot slower. Either I suck at configuring it (possible, but it shouldn't be this bad out of the box) or it's just slower than Christmas for lots of INSERTs. I may go back to SQLite. Either way, I need an easy read-only DB for this part. CDB seemed like a good option - they're really quick, read-only, and even fast to create. The downside is that is a hash-table only. Since I need a matrix-style system, that seems like a bad plan. I could just use "userid.movieid" as the key, but then I can't read in all the ratings from a user or ratings on a movie. Even duplicating the data in different views doesn't help. This needs to be either SQL or a matrix - there are things I'll need to do that only work with those access methods. The rest of this program can be done in C/C++ without databases, but this really needs to be a DB. Definitely going back to SQLite though. And then I'm going to let this run overnight. Because it's going to take that freaking long to import this data.

0 TrackBacks

Listed below are links to blogs that reference this entry: Netflix Prize Again.

TrackBack URL for this entry: http://blag.dunedain289.com/~dunedain/mt_blag/mt-tb.cgi/3

Leave a comment

About this Entry

This page contains a single entry by Scott published on May 28, 2008 1:11 AM.

Netflix Prize was the previous entry in this blog.

Netflix Update is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.