Monday, 17 October 2011

Upcoming talks and travels

I'm going to be up to a number of things that may be of interest to the readers of this (rather sparse) blog.

Quick summary


Our workshop (other co-chairs are Amélie Anglade, Òscar Celma, Paul Lamere, and Brian McFee) will cover a diverse array of approaches and angles for music recommendation and discovery. The workshop is runs the full and is part of RecSys 2011, though I sadly can't stay for most of the conference aside from our workshop (see below). It should prove to be an interesting day of research. Are you planning on attending? Let us know.


I'll be dashing off from Chicago to Miami to attend ISMIR, and to present some new (not-yet-released) API features from Musicmetric. While things aren't quite live yet, I can say that in addition to our artist based endpoints, we'll be offering track-based endpoints soon as well, and aligning them with an I-bet-you-can-guess-which large public audio feature test set. More detail on this one to come.


Ignite is a series of lightning talks that have taken place in cities all over the world, unified not by a common theme, but a common format: all talks last five minutes, contain 20 slides, and the slides are automatically advanced every 15 seconds. The matching ethos of this structure is perhaps best seen in the Ignite slogan, "Enlighten us, but make it quick." I'll be speaking about beer, style and critical tasting in a talk titled "Ale or Lager and Other False Choices." Here's a brief description:
In a word, my talk is about beer. In a few more words, the driving narrative behind the talk is a crash course in beer styles and more generally, critical tasting. After an extraordinarily brief description of beer, broad ideas of style and the critical tasting process, the core of the talk will be made up of live lightning tastes of commercial examples of various styles of beer (one slide per style, 12 styles covered with one commercial example each). For coherence these tasting slides will grouped into broader styles, with aim toward width, rather than depth of coverage. The styles will be approximately based on those from the BJCP and the Brewer's Association.
I still haven't sorted out the exact spread of beers or how they will be grouped, though I'm leaning toward something simple and obviously ingredient tied (something like -- Lagers, ales:yeast driven, ales:malt forward, ales:hop heavy, with 3 beers each from a different recognized style in each). If anyone has any thoughts about style divisions or specific examples do let me know. If you'd like to go to ignite (and you know you would) the tickets will be available over this way later this week.


So, lots of things going on. Plus there's this other thing I've been working on.

Right, back to it.

Monday, 15 August 2011

A SXSW panel proposal - The Wisdom of Thieves: Meaning in P2P Behavior

So I've submitted a proposal for SXSW interactive 2012 entitled "The Wisdom of Thieves: Meaning in P2P Behavior". If you're the sort of person that might be interested in that sort of thing you can comment and/or vote on it over here. The talk will basically be a tour of all the fun and exciting ways you can use BitTorrent data to make better (mostly music, but also TV, film, and app-store type things) applications, with data sources like this. Here's the abstract and questions:

The act of piracy is typically viewed as devaluing content - the track that wasn’t streamed, the video game that wasn’t purchased. However, peer-to-peer networks of piracy are rich descriptions of fans who are interested enough to find content. By observing these descriptions, artists can better understand their fan base; recommendation and discovery can be better tuned. In this talk we’ll explore the similarities between BitTorrent downloads and a number of other means of online interaction, such as likes, mentions, and scrobbles. We’ll show how interactions vary between popular artists and works versus those found in the long tail, whether they’re emerging artists or niche films. Our audience will leave with a utility belt of tools to leverage data about and around peer-to-peer sharing of music and video. This talk will use data available via the Semetric API and open source Python scripts, freely available for download prior to the talk.
Questions Answered:
  1. How is peer-to-peer activity different from communities on Facebook, Twitter or Spotify?
  2. Can you use location data and a torrent network to optimize a tour schedule?
  3. Which countries should I syndicate my TV show in?
  4. How can you use co-occurrence in piracy to recommend content?
  5. Why should I consider the behavior of roving bands of thieves?
Also, my colleagues have panel proposals for SXSW music and film as well, go check them out here and here.

Monday, 16 May 2011

Doing ridiculous things with natural language processing

Over this past weekend before last, while I jealously followed from afar the SF musichackday that I was unable to attend as I'm awaiting the result of a visa application, I started mucking about with the beta musicmetric api (full disclosure, they are my employer), in particular the sentiment analyzer.

So the first thing I put together was a bit of python to fetch the content of a tweet and use the mm_api to determine it's tone. This can be done quite simply (full source):



Which gives a number between 1 and 5, with 1 indicating the text is 'very negative' and 5 indicating the text is 'very positive' (gory details of the sentiment analyzer). While the sentiment analyzer is trained for larger chunks of text (500 word album/movie reviews and that sort of thing) it in fact does fairly well with tweets (though sarcasm is its downfall). So I thought I'd do something a bit silly and built a 'flamewar detector and troll finder' for conversations on twitter.

I've called the initial command line tool firealarm.

To gather the conversations, I'm just piggybacking on @jwheare's great tool exquisite tweets. Once a conversation is archived over on exquisite tweets, the cli can be pointed to it via the conversation's url. Each tweet in the conversation is pushed through the sentiment analyzer; the simple mean (µ) of all the sentiment scores is then dubiously used to determine if the conversation is a flamewar. If the sentiment is generally negative (µ < 3) it's a flamewar, if it's generally positive (µ>3) it's not a flamewar, and if it is exactly neutral (µ ==3) it's declared a tossup. The troll finder is equally straight forward (and equally dubious!). Across the sequence of tweets, the author of the tweet with the highest magnitude negative delta sentiment proceeding it is considered the troll. In the case of a tie the first occurance wins.

Here's an example (note that the linked-to example is a nasty, nasty flame war. If you offend easily, might want to skip it. Also, obviously, views expressed are not mine, etc.):


This generates the following output:


A plot of the sentiments, with the maximum negative delta in red, looks like this:



A quick read of the tweets and you can see that the actual sentiment of the tweets is a bit more negative overall then the analyzer output, but this is good enough for binary classification and a fairly reasonable troll ID mechanism, albeit fairly naive.

The code is over at github if you want to have a look or run it yourself. If you want to run it, you'll need a musicmetric api key which takes about 30 seconds to get (apply here). Eventually I'm going to turn this into a web app, and when that happens I'll let everybody know. Also if you happen to find any really bad mislabels, let us know, as it helps us tune up our process.

Have fun being algorythmically judgemental!


Tuesday, 5 April 2011

Free Beer: A Plea for Open Data [About Beer]

(I wrote most of this right after my viva, but got a bit sidetracked...)
Hey look, my first blog post about beer (or at least beer metadata).

So yesterday a few weeks ago, Tim Cowlishaw (@mistertim ) stated this on twitter:
To which I replied with this :


(Scraping is a way to get the info a human reads, say on a website into a format a computer program can read. More on why I'd want to do that in a minute...)
Which was followed by what I thought was a reasonable request:

Now at this point I had figured that was the end of it. Both rate beer or beer advocate ignored my requests for data a year or so ago, I was expecting the same this time. However, beer advocate responded via twitter:

(Note that the link to the tweet no longer resolves, because beeradvocate decided to delete this tweet a couple hours later. The screen capture was taken from my twitter client just after the deletion...)
Now, I hadn't been prepared for such knee-jerk nastiness regarding seemingly reasonable data requests and neither had Tim as he quickly push out this series of messages:

While I and others pushed some similar responses, Tim's summarize things really well: boo, disdain, technical critique. (after this both Tim and I appeared to be blocked from following beeradvocate...)

The crux of all this is that I (and it would appear others, but from here I speak only for myself) would love to have access to structured data (as a service or, better yet, as documents) about beer and the people who drink it.

I'd love to build browser-based applications that do cross-domain recommendation of say beer and music. But in order to do that I'd need data about people's taste, in beer and music. Lots of options to work with in the music domain. But beer? Machine readable beer data is harder to find.

Both ratebeer and beer advocate have a great deal of this data, it's just not (openly) machine readable. In ratebeer's case this is entirely crowdsourced and for beer advocate this is true for their community pages. There's a compelling case that crowdsourced data should be as open as possible, given that the data itself comes from the public at large. But beyond the moral case, opening your data means that the wide-world of evening and weekend software developers/architects/designers/whatevers (many have the same job during the day) will expand what is possible a site's data in a way that will benefit said site (like my half baked idea above). This, in essence, is the commercial argument for supporting open data and has been shown to be extremely effective in other domains (say, to pick one at random, music). And there is a simply massive spread of open data apis (again, both service and document) but barely any covering data about my favourite topic that isn't music, beer. So what do you say ratebeer or beeradvocate? How about some nice strucutured data?

note: I should mention that there are a couple sites that are beer related and open: untappd and beerspotr. Both are good sites, though neither is quite to the point of hitting critical mass in terms of data coverage and usefulness just yet. Either might at some point in the future, but ratebeer and beeradvocate already have, the data just isn't accessable.

Friday, 1 April 2011

Viva passed, corrections approved, blog barely updated...

The last couple months have proven me to be a terrible blogger, as I haven't posted at all.

Anyway, that aside, I'm pleased to announce that I have passed my viva with minor corrections (back on march 2nd) and as of about an hour ago, had my submitted corrections approved, which means I'm totally done!

Hoorah!

So before I run off for a bit of celebratory drinking, I thought I'd post the soft copy in the the series of tubes (here's the full pdf) and here is a brief chapter-by-chapter summary:
  • Chapter 1: Introduction. We present the set of problems this thesis will address, through a discussion of relevant contexts, including changing patterns in music consumption and listening. The core terms are defined. Constraints imposed on this work are laid out along with our aims. Finally, we provide this outline to expose the structure of the document itself.
  • Chapter 2: Playlists and Program Direction. We survey the state of the art in playlist tools and playlist generation. A framework for types of playlists is presented. We then give a brief history of playlist creation. This is followed by a discussion of music similarity, the current state of the art and how playlist generation depends on music similarity. The re- mainder of the chapter covers a representative survey of all things playlist. This includes commercially available tools to make and manage playlists, research into playlist generation and analysis of playlists from a selection of available playlist generators. Having reviewed existing tools and gen- eration methods, we aim to demonstrate that a better understanding of song-to-song relationships than currently exists is a necessary underpin- ning for a robust playlist generation system, and this motivates much of the work in this thesis.
  • Chapter 3: Multimodal Social Network Analysis. We present an exten- sive analysis of a sample of a social network of musicians. First we analyse the network sample using standard complex network techniques to verify that it has similar properties to other web-derived complex networks. We then compute content-based pairwise dissimilarity values using the musical data associated with the network sample, and the relationship between those content-based distances and distances from network the- ory are explored. Following this exploration, hybrid graphs and distance measures are constructed and used to examine the community structure of the artist network. We close the chapter by presenting the results of these investigations and consider the recommendation and discovery applications these hybrid measures improve.
  • Chapter 4: Steerable Optimizing Self-Organized Radio. Using request radio shows as a base interactive model, we present the Steerable Opti- mizing Self-Organized Radio system as a prototypical music recommender system along side robust automatic playlist generation. This work builds directly on the hybrid models of similarity described in Chapter 3 through the creation of a web-based radio system that interacts with current lis- teners through the selection of periodic requests songs from a pool of nominees. We describe the interactive model behind the request system. The system itself is then described in detail. We detail the evaluation process, though note that the inability to rigorously compare playlists creates some difficulty for a complete study.
  • Chapter 5: A Method to Describe and Compare Playlists. In this chapter we survey current means of evaluating playlists. We present a means of comparing playlists in a reduced dimensional space through the use of aggregated tag clouds and topic models. To evaluate the fitness of this measure, we perform prototypical retrieval tasks on playlists taken from radio station logs gathered from Radio Paradise and Yes.com, using tags from Last.fm with the result showing better than random performance when using the query playlist’s station as ground truth, while failing to do so when using time of day as ground truth. We then discuss possible applications for this measurement technique as well as ways it might be improved.
  • Chapter 6: Conclusions. We discuss the findings of this thesis in their to- tality. After summarizing the conclusions we discuss possible future work and directions implied by these findings.
Enjoy!

(Also, if you find any deep hiding typos, I'd love to know about them. Not sending it to the printer/binder till Monday...)