May 2008 Archives

SQLite is slow

| | Comments (0)

Since this is such breaking news at this point. SQLite is really really slow creating indices on big tables. It’s still trying to create the first index on users. I’m about to kill it, wipe the database out, then set the indices up to be created while inputing the data. Maybe that will go faster.

Because while I love Ruby on Rails, it’s just too damn slow to run my blog on my Slicehost 256 Mb VPS, at least with mod_rails, and I’m too lazy to figure out Mongrel or any of the other Rails systems. It also seems less than stable, especially with how much I’m messing with it. Plus, Movable Type has an iPhone interface plugin.

NSTX

| | Comments (0) | TrackBacks (0)

So, NSTX is awesome. I started playing with this just before I left to come home, but my DNS changes didn't have time to propagate until tonight. It's actually quite easy to set up, and surprisingly fast - at least, from where I'm currently testing it (my home in Midland). Definitely usable. There's also an IP-over-ICMP program, I may look into it next. That system seems to have some advantages over the DNS system, and could be an interesting project to hack on. It looks easier to set up, too - no DNS changes that require propagation.

Netflix update: still creating the users index. I think creating the index while inserting data may have been a better idea after all. Top reports 34:54.75 runtime so far. I'll let it keep running and check it again tomorrow.

Netflix Update

| | Comments (0) | TrackBacks (0)

Importing the ratings took about 2 hours. That was with a little tuning, too. Now I'm trying to generate those indices. It's going to take just as long. Ugh. Too much data.

Netflix Prize Again

| | Comments (0) | TrackBacks (0)

So, I took another look at the training data for this contest. Absolutely freaking enormous. 100 Million ratings from 17000 movies and nearly 500K users. Unfortunately, the user ids run from 1 to 2.6M and have lots of gaps. So, importing the ratings as a matrix in a C app is not an option - the matrix would be over 8.5 GB, even using just chars. So, I thought SQL might be good - the biggest requirements here are being able to index in like a matrix. SQL allows that well enough, especially with indexes on both of those columns. I started with SQLite3, and it seemed slow. So now I'm trying MySQL. It's slower. A lot slower. Either I suck at configuring it (possible, but it shouldn't be this bad out of the box) or it's just slower than Christmas for lots of INSERTs. I may go back to SQLite. Either way, I need an easy read-only DB for this part. CDB seemed like a good option - they're really quick, read-only, and even fast to create. The downside is that is a hash-table only. Since I need a matrix-style system, that seems like a bad plan. I could just use "userid.movieid" as the key, but then I can't read in all the ratings from a user or ratings on a movie. Even duplicating the data in different views doesn't help. This needs to be either SQL or a matrix - there are things I'll need to do that only work with those access methods. The rest of this program can be done in C/C++ without databases, but this really needs to be a DB. Definitely going back to SQLite though. And then I'm going to let this run overnight. Because it's going to take that freaking long to import this data.

Netflix Prize

| | Comments (0) | TrackBacks (0)

So, the Netflix Prize has me intrigued. Really, really intrigued. I'm not good enough at machine learning, stats, or whatever to win, but I'm going to be playing with it in the future. Code to follow. I'll be implementing something like this at first.

Welcome

| | Comments (0) | TrackBacks (0)

So, I finally created a blog. We’ll see how this goes…I’ll post something a little more interesting later today. Just wanted to say Welcome!

About this Archive

This page is an archive of entries from May 2008 listed from newest to oldest.

June 2008 is the next archive.

Find recent content on the main index or look in the archives to find all content.