Thursday, 9 September 2010

Roomba Recon - A musichackday brain dump

So this past weekend I attended the 2nd (annual?) London Music Hackday at the Guardian's offices at King's Cross. For the hack I created an algorithm that generates playlists between arbitrary start and end songs on soundcloud. It does this with almost no pre-indexing, allowing for playlists to cover the entire network and always use an up-to-date graph. It's (mostly) running live if you'd like to play with it.

Briefly, it performs a sort of bastardized A* search, bilaterially from both the start and the end song to form the playlist. There's a parameter to limit the length of the two playlist segments, by default this is 4 so the max playlistlength is 10 (2*4+2 for the end points).

The search algorithm collects social links of the artist corresponding to the given song. For each of these connections (you know, 'friends' or in soundcloud jargon 'followings') a determination of the cost of adding that song is calculated in the following way (for the half built from the start song):
where is the cost to add song m to list after song n, is some measure of distance from song n to song m and is the same measure of distance from song m to song e. Song e is the end song for the whole playlist. So basically the idea is that the cost of moving to a node is a ratio of how far away it is from where you were to how far it is to where you're trying to get. The whole thing is reversed for the other half, so the cost function makes it cheap to move toward the start song. If you simply want to randomly traverse social links the cost can be set to an arbitrary equal value (I used 1) for all links.

This leaves the matter of distance.

Starting with what I know best, I decided to try a content-based distance first. I should say that from the onset I figured this would be insanely slow, but none the less, I gave it a go. I implemented (available directly as well) a little object that will grab the echonest timbre features for any two soundcloud songs, summarize the features into a single multidimensional gaussian (mean and std) then take the cosine distance between the two tracks (other distance metrics could be computed as well, but cosine seemed reasonable). That takes something on the order of 45 seconds to do for every pair of tracks. When using it in the above playlister the whole thing would take maybe 4 hours (I think, I never actually let it complete). Clearly way too slow.

So taking inspiration from my about to be published work at WOMRAD, I thought some NLP could save the day. So the other distance measure I implemented (no direct access yet) is based on a tracks tags and comments. First I tokenize the comments and combine them with the tags to create a vector space model of a track's descriptive text. I then weighted everything using tfidf (the idf was populated with a random sample of tracks from across soundcloud that I gathered over the weekend, about 41,000 tracks in total. This is the only indexing that is done in advance). From the tfidf weighted terms in a vector space, I took the cosine distance. This is both quite quick and gives pretty good results.

Everything was built in python, the app is running in cherrypy, using numpy and scipy for the data handling and gensim for the tfidf related bits. Soundcloud and echonest interaction is all via their respective python wrappers. Also there's a more terse write up over at the musichackday wiki. I'll stick the code on my repository on github once it's cleaned up a bit (though that might be a little while as I seem to be rather busy with something at the moment...)

Right. Back to writing my thesis.

No comments: