Tuesday, 14 July 2009

musichackday

So I spent the weekend holed up in the Guardian offices at the musichackday. I went in with some perhaps overly ambition plans to generate playlists across the SoundCloud user graph, with song selection optimization done with features via theechonest. This might have been barely possible if I had been working with a couple other people of similar background, but circumstances led to me hacking mostly solo at this particular event.

In the end I spent a substantial amount of time beating the SoundCloud python wrapper into being more helpful for what I wanted it to do (which is perhaps not what it's envisioned use was, but hey, that's what hacks are for), namely walking the user (artist) space and creating a Complex Network so I can move the playlist generation tools we've created around myspace crawls over to SoundCloud.

So, to that end, I've created some bits of python that walk through the user graph on the SoundCloud and build a graph using iGraph. This code base is living over at a new github repository I've created called pySomethingClever. Included over there are diff files documenting the changes I made to official SoundCloud-api-wrapper, which will enable any willing victims to grab and run the hacky bits of code I have up.

Once I got the api wrapper in a place where it could do a bit of what I wanted I fired off a crawl. I got through about 4,000 users (of a complete user network of about 170k nodes for ~2.3% of the network) in SoundCloud's network before the presentations started on Sunday. To clarify slightly, the network contains all the users of SoundCloud, but only the outlinks (users a given user follows) from 4,000 nodes. This is to say I had a (mostly) complete vertex list and a very incomplete edge list. With the super great help of kurtjx this sampled network was pushed through the lanet k-core decomposition visualization to draw out some of the community structure and related forms of the sample graph. Here's that graph:



The size of each node is tied to the number of links (either direction) touching that node. The color and placement have to do with how critical the node is to the rest of the network maintaining its current state of connectedness.

Since the hack I've continued gather edges toward a complete representation of SoundCloud. I currently have the out link edges from more than 17,000 SoundCloud users (about 10% of the user base) and should have a full capture in the next few days. Below you can see the same visualization with the edges from 16,000 users (the graph is set to write every 2k):




As the crawl continues, my guess is the middle bits will continue to fill in, which would be expected if the SoundCloud behaves in the usual Power Law fashion (as most of The Internet's networks, social or otherwise, tend to).

It should be noted that these visualizations, while very interesting, are just the beginning of what is possible once the whole user network is captured. I'm going to be building some playlist generators and recommenders around this in the coming weeks. If things look good (and from here I'm quite excited) I'll push some of it to the ISMIR late breaking demos and possibly to AdMIRe. More to come!