So, I took another look at the training data for this contest. Absolutely freaking enormous. 100 Million ratings from 17000 movies and nearly 500K users. Unfortunately, the user ids run from 1 to 2.6M and have lots of gaps. So, importing the ratings as a matrix in a C app is not an option - the matrix would be over 8.5 GB, even using just chars. So, I thought SQL might be good - the biggest requirements here are being able to index in like a matrix. SQL allows that well enough, especially with indexes on both of those columns. I started with SQLite3, and it seemed slow. So now I'm trying MySQL. It's slower. A lot slower. Either I suck at configuring it (possible, but it shouldn't be this bad out of the box) or it's just slower than Christmas for lots of INSERTs. I may go back to SQLite. Either way, I need an easy read-only DB for this part. CDB seemed like a good option - they're really quick, read-only, and even fast to create. The downside is that is a hash-table only. Since I need a matrix-style system, that seems like a bad plan. I could just use "userid.movieid" as the key, but then I can't read in all the ratings from a user or ratings on a movie. Even duplicating the data in different views doesn't help. This needs to be either SQL or a matrix - there are things I'll need to do that only work with those access methods. The rest of this program can be done in C/C++ without databases, but this really needs to be a DB. Definitely going back to SQLite though. And then I'm going to let this run overnight. Because it's going to take that freaking long to import this data.
Netflix Prize Again
0 TrackBacks
Listed below are links to blogs that reference this entry: Netflix Prize Again.
TrackBack URL for this entry: http://blag.dunedain289.com/~dunedain/mt_blag/mt-tb.cgi/3

Leave a comment