Friday 29 June 2012

Licensed listening based on the habits of pirates or lessons from sloppy item resolution

from flickr user nozoomii, CC by-nc-sa
So I spent Thursday and Friday a couple weeks ago at the Barcelona Music Hackday, part of the Sónar Music Festival. There were loads of excellent hacks (full list), including my own, Legalize It!. In this post I'm going to go into a bit more depth about the hack, lessons learned and teasers for things I might do.

Motivation

The core idea is a simple one – straightforward listening to things that are popular on Bittorrent (note that popular on Bittorrent is a slightly fuzzy concept, since Bittorrent is protocol for ad-hoc distribution, but we'll get back to that in a bit), without all the nastiness (and DNS blocking!) of looking at, say, The Pirate Bay's top music torrents (that's a proxy of tpb btw).  And of course this removes any legal trouble that would be associated with gather and listening to music via those torrent charts.

How it all works


tl;dr - It's a torrent chart metadata-based content resolver, written in python and JS, you can fork the code.

Legalize It! has two parts, client and server. The server is a fairly simple pyramid webserver with two main tasks (it's deployed on heroku). The first involves fetching the torrent charts and resolving torrent release groups to legal streaming albums (Spotify, currently).  This is simply a matter of fetching the daily torrent releasegroup chart from the Musicmetric API (full disclosure, they're my employer and I wrote most of the chart endpoints...) then walking through the top N items that look like albums and matching them to spotify albums. The matching is done through a fantastically naive string title + artist search on spotify's metadata API via the very useful Spotimeta python wrapper. This album resolution process has a simple web interface (if it returns an error try a refresh, heroku workers on the free tier sleep a bit too much), that I mostly built for testing, but can be quite useful without the commitment of installing the Spotify App. In addition to the human readable page, you can get the response back as JSON, which is handy on the client side.

The second job for the server is necessary to help select which songs from the top albums we'll be listening to. The final Spotify app will only select on song per album, so users can get a taste of every album in the top N without having to listen to N complete albums.  But this should be done with care and grace to enhance the listening experience.  Thankfully, with the assistance of The Echo Nest's searchable audio summary song-level features this is quite easy.  Using their API we can search for a song by title and artist name (see a pattern of terrible id matching forming? What can I say, it's a hack.) and get back a set of descriptors that looks like this:


Once we have the set of song-level descriptors for every song in the neighbouring albums, it's simply a matter of minimizing the step size.  I've done a bit of work on playlists before, and this seemed like a reasonable approach.  While there are a number of approaches to step-size minimization, for this particular application we're doing a greedy sort of optimization that goes something like:

  1.  Select the song a in album A and the song b in album B, such that for a given audio descriptor (we'll use dancibility by default, but it could just as easily be a different measure, eg. tempo, loudness) the absolute value of the descriptor of a less the descriptor of b is minimized.  That is to say we're looking for the song-pair from these two albums that are closest in terms of whatever descriptor is being used.
  2. Select song c from album C such that the absolute difference from it's audio descriptor to song b's  is similarly minimized.
  3. Repeat (2) for remaining albums, using the last chosen song against the next album.
It's worth noting that this is not the globally optimal shortest path from the first album to the last album, but it comes with a tremendous advantage -- the first two songs are selected without any need to deal with the rest of the playlist, which can be worked on overtime.  This allows for pseudo real time playlist creation, since we just need to know the next song before the current one is done playing.

To facilitate this algorithm the server has a hook that performs step (1) or (2) on on either a pair of albums or a song and album (specified as spotify URIs) can be accessed via a url that looks like  
http://legalize-it.herokuapp.com/paired/[spotify URI of track or album]/[spotify URI of album]
For example, http://legalize-it.herokuapp.com/paired/spotify:album:6LBiuhK7PZKjVXyMfPxPoh/spotify:album:5nVUqrdkEMlWTm9sqjrYBt which returns a bit of JSON showing the closest (by dancability) song pair between My Beautiful Dark Twisted Fantasy by Kanye West and Up All Night by One Direction.  This method also supports using any other audio summary feature being used for distance by adding it to the end of the URI.  For example http://legalize-it.herokuapp.com/paired/spotify:album:6LBiuhK7PZKjVXyMfPxPoh/spotify:album:5nVUqrdkEMlWTm9sqjrYBt/tempo find the closest song pair from the same two records, but by tempo rather than by dancebility.

These two features are laced together in the client, which is a simple spotify app, that's basically just a small bit of jQuery that grabs the list of the top 25 albums then asynchronously gets the selected track for each album updating the playlist as each track is selected.

The Legalize It! app, as demo'd at the Music Hack Day
If you'd like to install the App, here are some instructions for installation in developer mode.  I'll be cleaning up the UI to conform to Spotify's app guidelines and submitting the app to their App finder thing, but that will take a while.

Going Forward

While this is mostly complete for what it does there are a number of feature adds that I'll be slowly dealing with as time goes on. Highest among these is taking the same idea and applying it to different resolvers (maybe use tomahawk rather than picking one...). There are also some smaller feature adds to the app, things like autoplay once the first track has loaded and switching between different echonest summary features or musicmetric charts (eg. P2P release groups by acceleration).

Also, thanks very much to Spotify, who awarded me a prize for my hack!

So what do you think?  Should we care about the taste of a bunch of peers on Bittorrent? Or am I doing it wrong?

Tuesday 22 May 2012

Some brief thoughts on the London homebrew scene

Not that homebrew, the beer one.

A few weeks back I was asked by The Strongroom bar to write up a few words on homebrewing in London for the lit at their London Beer Festival. Unfortunately, it had to be cut to make room for some write-ups from breweries added at the last minute, but it occurs to me that it might be interesting to others, so I'm throwing it up here. Without further ado:

---
The Rise and Rise of Homebrewing

Homebrewing in London is enjoying something of a comeback. Over the last few years, along with the rise (some might say return) of craft and artisanal beer in London, amateur and hobby brewing has been growing mightily. Like the current trends in professional craft brewing, these new hobbyists are focusing on quality and experimentation over matters of the bottom line. In London, this DIY brewing movement is focused in two clubs, each on different sides of the city and each with a different emphasis.

Meeting in East London is the London Amateur Brewers (LAB), of which I am a member. Until its recent closure, LAB met monthly at The Wenlock Arms in Hackney (for current meeting locations, consult the website). LAB has an open structure, operating without formal dues and a very small set of officers. The meetings consist of a short technical talk on some aspect of the brewing process or overview of a particular style of beer, followed by tastings of member’s beers. A typical meeting will involve tasting 8-12 beers, over an hour to hour and a half. These beers are typically very diverse in style, and at a single meeting you can encounter everything from a Best bitter to a new world IPA to a Belgian Saison. In addition to its regular meetings, LAB holds homebrewing competitions and festivals, the last of which was on 12 November 2011 in Wimbledon.

Across London in Durden Park, is the eponymous Durden Park Beer Circle. The Beer Circle is a more formal group then LAB, having both a formal membership process and thematic meetings, where the homebrew tastings are all keeping to a particular style, which will change from month to month. Additionally this group has something of a focus on understanding and preserving the historical beers of Britain. Over the years they have sought, archived, and tested many accurate recreations of style of beer long out of fashion. These recipes have been gathered into a book that group puts out, “Old British Beers and How To Make Them.” This book is an excellent resource in any homebrewer’s library. In fact, you can taste some (slightly modified versions) of the recipes in this book in action at some of London’s fine craft brewers, where it has served as inspiration for novel interpretations of classical local styles, most especially Porters and Stouts.

Think you might want to give homebrewing a try? It’s easier than you might think. Come to a meeting (if you aren’t in London, the Craft Brewers Association can point you in the right direction) or just simply give it a try in your kitchen. Aside from the previously linked webistes, information to get you started can be found at How to Brew, Homebrewing Stackexchange, The Homebrewer's Association, and Jim's Beer Kit, among others. Good luck and happy brewing!


Monday 17 October 2011

Upcoming talks and travels

I'm going to be up to a number of things that may be of interest to the readers of this (rather sparse) blog.

Quick summary


Our workshop (other co-chairs are Amélie Anglade, Òscar Celma, Paul Lamere, and Brian McFee) will cover a diverse array of approaches and angles for music recommendation and discovery. The workshop is runs the full and is part of RecSys 2011, though I sadly can't stay for most of the conference aside from our workshop (see below). It should prove to be an interesting day of research. Are you planning on attending? Let us know.


I'll be dashing off from Chicago to Miami to attend ISMIR, and to present some new (not-yet-released) API features from Musicmetric. While things aren't quite live yet, I can say that in addition to our artist based endpoints, we'll be offering track-based endpoints soon as well, and aligning them with an I-bet-you-can-guess-which large public audio feature test set. More detail on this one to come.


Ignite is a series of lightning talks that have taken place in cities all over the world, unified not by a common theme, but a common format: all talks last five minutes, contain 20 slides, and the slides are automatically advanced every 15 seconds. The matching ethos of this structure is perhaps best seen in the Ignite slogan, "Enlighten us, but make it quick." I'll be speaking about beer, style and critical tasting in a talk titled "Ale or Lager and Other False Choices." Here's a brief description:
In a word, my talk is about beer. In a few more words, the driving narrative behind the talk is a crash course in beer styles and more generally, critical tasting. After an extraordinarily brief description of beer, broad ideas of style and the critical tasting process, the core of the talk will be made up of live lightning tastes of commercial examples of various styles of beer (one slide per style, 12 styles covered with one commercial example each). For coherence these tasting slides will grouped into broader styles, with aim toward width, rather than depth of coverage. The styles will be approximately based on those from the BJCP and the Brewer's Association.
I still haven't sorted out the exact spread of beers or how they will be grouped, though I'm leaning toward something simple and obviously ingredient tied (something like -- Lagers, ales:yeast driven, ales:malt forward, ales:hop heavy, with 3 beers each from a different recognized style in each). If anyone has any thoughts about style divisions or specific examples do let me know. If you'd like to go to ignite (and you know you would) the tickets will be available over this way later this week.


So, lots of things going on. Plus there's this other thing I've been working on.

Right, back to it.

Monday 15 August 2011

A SXSW panel proposal - The Wisdom of Thieves: Meaning in P2P Behavior

So I've submitted a proposal for SXSW interactive 2012 entitled "The Wisdom of Thieves: Meaning in P2P Behavior". If you're the sort of person that might be interested in that sort of thing you can comment and/or vote on it over here. The talk will basically be a tour of all the fun and exciting ways you can use BitTorrent data to make better (mostly music, but also TV, film, and app-store type things) applications, with data sources like this. Here's the abstract and questions:

The act of piracy is typically viewed as devaluing content - the track that wasn’t streamed, the video game that wasn’t purchased. However, peer-to-peer networks of piracy are rich descriptions of fans who are interested enough to find content. By observing these descriptions, artists can better understand their fan base; recommendation and discovery can be better tuned. In this talk we’ll explore the similarities between BitTorrent downloads and a number of other means of online interaction, such as likes, mentions, and scrobbles. We’ll show how interactions vary between popular artists and works versus those found in the long tail, whether they’re emerging artists or niche films. Our audience will leave with a utility belt of tools to leverage data about and around peer-to-peer sharing of music and video. This talk will use data available via the Semetric API and open source Python scripts, freely available for download prior to the talk.
Questions Answered:
  1. How is peer-to-peer activity different from communities on Facebook, Twitter or Spotify?
  2. Can you use location data and a torrent network to optimize a tour schedule?
  3. Which countries should I syndicate my TV show in?
  4. How can you use co-occurrence in piracy to recommend content?
  5. Why should I consider the behavior of roving bands of thieves?
Also, my colleagues have panel proposals for SXSW music and film as well, go check them out here and here.

Monday 16 May 2011

Doing ridiculous things with natural language processing

Over this past weekend before last, while I jealously followed from afar the SF musichackday that I was unable to attend as I'm awaiting the result of a visa application, I started mucking about with the beta musicmetric api (full disclosure, they are my employer), in particular the sentiment analyzer.

So the first thing I put together was a bit of python to fetch the content of a tweet and use the mm_api to determine it's tone. This can be done quite simply (full source):



Which gives a number between 1 and 5, with 1 indicating the text is 'very negative' and 5 indicating the text is 'very positive' (gory details of the sentiment analyzer). While the sentiment analyzer is trained for larger chunks of text (500 word album/movie reviews and that sort of thing) it in fact does fairly well with tweets (though sarcasm is its downfall). So I thought I'd do something a bit silly and built a 'flamewar detector and troll finder' for conversations on twitter.

I've called the initial command line tool firealarm.

To gather the conversations, I'm just piggybacking on @jwheare's great tool exquisite tweets. Once a conversation is archived over on exquisite tweets, the cli can be pointed to it via the conversation's url. Each tweet in the conversation is pushed through the sentiment analyzer; the simple mean (µ) of all the sentiment scores is then dubiously used to determine if the conversation is a flamewar. If the sentiment is generally negative (µ < 3) it's a flamewar, if it's generally positive (µ>3) it's not a flamewar, and if it is exactly neutral (µ ==3) it's declared a tossup. The troll finder is equally straight forward (and equally dubious!). Across the sequence of tweets, the author of the tweet with the highest magnitude negative delta sentiment proceeding it is considered the troll. In the case of a tie the first occurance wins.

Here's an example (note that the linked-to example is a nasty, nasty flame war. If you offend easily, might want to skip it. Also, obviously, views expressed are not mine, etc.):


This generates the following output:


A plot of the sentiments, with the maximum negative delta in red, looks like this:



A quick read of the tweets and you can see that the actual sentiment of the tweets is a bit more negative overall then the analyzer output, but this is good enough for binary classification and a fairly reasonable troll ID mechanism, albeit fairly naive.

The code is over at github if you want to have a look or run it yourself. If you want to run it, you'll need a musicmetric api key which takes about 30 seconds to get (apply here). Eventually I'm going to turn this into a web app, and when that happens I'll let everybody know. Also if you happen to find any really bad mislabels, let us know, as it helps us tune up our process.

Have fun being algorythmically judgemental!


Tuesday 5 April 2011

Free Beer: A Plea for Open Data [About Beer]

(I wrote most of this right after my viva, but got a bit sidetracked...)
Hey look, my first blog post about beer (or at least beer metadata).

So yesterday a few weeks ago, Tim Cowlishaw (@mistertim ) stated this on twitter:
To which I replied with this :


(Scraping is a way to get the info a human reads, say on a website into a format a computer program can read. More on why I'd want to do that in a minute...)
Which was followed by what I thought was a reasonable request:

Now at this point I had figured that was the end of it. Both rate beer or beer advocate ignored my requests for data a year or so ago, I was expecting the same this time. However, beer advocate responded via twitter:

(Note that the link to the tweet no longer resolves, because beeradvocate decided to delete this tweet a couple hours later. The screen capture was taken from my twitter client just after the deletion...)
Now, I hadn't been prepared for such knee-jerk nastiness regarding seemingly reasonable data requests and neither had Tim as he quickly push out this series of messages:

While I and others pushed some similar responses, Tim's summarize things really well: boo, disdain, technical critique. (after this both Tim and I appeared to be blocked from following beeradvocate...)

The crux of all this is that I (and it would appear others, but from here I speak only for myself) would love to have access to structured data (as a service or, better yet, as documents) about beer and the people who drink it.

I'd love to build browser-based applications that do cross-domain recommendation of say beer and music. But in order to do that I'd need data about people's taste, in beer and music. Lots of options to work with in the music domain. But beer? Machine readable beer data is harder to find.

Both ratebeer and beer advocate have a great deal of this data, it's just not (openly) machine readable. In ratebeer's case this is entirely crowdsourced and for beer advocate this is true for their community pages. There's a compelling case that crowdsourced data should be as open as possible, given that the data itself comes from the public at large. But beyond the moral case, opening your data means that the wide-world of evening and weekend software developers/architects/designers/whatevers (many have the same job during the day) will expand what is possible a site's data in a way that will benefit said site (like my half baked idea above). This, in essence, is the commercial argument for supporting open data and has been shown to be extremely effective in other domains (say, to pick one at random, music). And there is a simply massive spread of open data apis (again, both service and document) but barely any covering data about my favourite topic that isn't music, beer. So what do you say ratebeer or beeradvocate? How about some nice strucutured data?

note: I should mention that there are a couple sites that are beer related and open: untappd and beerspotr. Both are good sites, though neither is quite to the point of hitting critical mass in terms of data coverage and usefulness just yet. Either might at some point in the future, but ratebeer and beeradvocate already have, the data just isn't accessable.

Friday 1 April 2011

Viva passed, corrections approved, blog barely updated...

The last couple months have proven me to be a terrible blogger, as I haven't posted at all.

Anyway, that aside, I'm pleased to announce that I have passed my viva with minor corrections (back on march 2nd) and as of about an hour ago, had my submitted corrections approved, which means I'm totally done!

Hoorah!

So before I run off for a bit of celebratory drinking, I thought I'd post the soft copy in the the series of tubes (here's the full pdf) and here is a brief chapter-by-chapter summary:
  • Chapter 1: Introduction. We present the set of problems this thesis will address, through a discussion of relevant contexts, including changing patterns in music consumption and listening. The core terms are defined. Constraints imposed on this work are laid out along with our aims. Finally, we provide this outline to expose the structure of the document itself.
  • Chapter 2: Playlists and Program Direction. We survey the state of the art in playlist tools and playlist generation. A framework for types of playlists is presented. We then give a brief history of playlist creation. This is followed by a discussion of music similarity, the current state of the art and how playlist generation depends on music similarity. The re- mainder of the chapter covers a representative survey of all things playlist. This includes commercially available tools to make and manage playlists, research into playlist generation and analysis of playlists from a selection of available playlist generators. Having reviewed existing tools and gen- eration methods, we aim to demonstrate that a better understanding of song-to-song relationships than currently exists is a necessary underpin- ning for a robust playlist generation system, and this motivates much of the work in this thesis.
  • Chapter 3: Multimodal Social Network Analysis. We present an exten- sive analysis of a sample of a social network of musicians. First we analyse the network sample using standard complex network techniques to verify that it has similar properties to other web-derived complex networks. We then compute content-based pairwise dissimilarity values using the musical data associated with the network sample, and the relationship between those content-based distances and distances from network the- ory are explored. Following this exploration, hybrid graphs and distance measures are constructed and used to examine the community structure of the artist network. We close the chapter by presenting the results of these investigations and consider the recommendation and discovery applications these hybrid measures improve.
  • Chapter 4: Steerable Optimizing Self-Organized Radio. Using request radio shows as a base interactive model, we present the Steerable Opti- mizing Self-Organized Radio system as a prototypical music recommender system along side robust automatic playlist generation. This work builds directly on the hybrid models of similarity described in Chapter 3 through the creation of a web-based radio system that interacts with current lis- teners through the selection of periodic requests songs from a pool of nominees. We describe the interactive model behind the request system. The system itself is then described in detail. We detail the evaluation process, though note that the inability to rigorously compare playlists creates some difficulty for a complete study.
  • Chapter 5: A Method to Describe and Compare Playlists. In this chapter we survey current means of evaluating playlists. We present a means of comparing playlists in a reduced dimensional space through the use of aggregated tag clouds and topic models. To evaluate the fitness of this measure, we perform prototypical retrieval tasks on playlists taken from radio station logs gathered from Radio Paradise and Yes.com, using tags from Last.fm with the result showing better than random performance when using the query playlist’s station as ground truth, while failing to do so when using time of day as ground truth. We then discuss possible applications for this measurement technique as well as ways it might be improved.
  • Chapter 6: Conclusions. We discuss the findings of this thesis in their to- tality. After summarizing the conclusions we discuss possible future work and directions implied by these findings.
Enjoy!

(Also, if you find any deep hiding typos, I'd love to know about them. Not sending it to the printer/binder till Monday...)