Tuesday, 5 April 2011

Free Beer: A Plea for Open Data [About Beer]

(I wrote most of this right after my viva, but got a bit sidetracked...)
Hey look, my first blog post about beer (or at least beer metadata).

So yesterday a few weeks ago, Tim Cowlishaw (@mistertim ) stated this on twitter:
To which I replied with this :

(Scraping is a way to get the info a human reads, say on a website into a format a computer program can read. More on why I'd want to do that in a minute...)
Which was followed by what I thought was a reasonable request:

Now at this point I had figured that was the end of it. Both rate beer or beer advocate ignored my requests for data a year or so ago, I was expecting the same this time. However, beer advocate responded via twitter:

(Note that the link to the tweet no longer resolves, because beeradvocate decided to delete this tweet a couple hours later. The screen capture was taken from my twitter client just after the deletion...)
Now, I hadn't been prepared for such knee-jerk nastiness regarding seemingly reasonable data requests and neither had Tim as he quickly push out this series of messages:

While I and others pushed some similar responses, Tim's summarize things really well: boo, disdain, technical critique. (after this both Tim and I appeared to be blocked from following beeradvocate...)

The crux of all this is that I (and it would appear others, but from here I speak only for myself) would love to have access to structured data (as a service or, better yet, as documents) about beer and the people who drink it.

I'd love to build browser-based applications that do cross-domain recommendation of say beer and music. But in order to do that I'd need data about people's taste, in beer and music. Lots of options to work with in the music domain. But beer? Machine readable beer data is harder to find.

Both ratebeer and beer advocate have a great deal of this data, it's just not (openly) machine readable. In ratebeer's case this is entirely crowdsourced and for beer advocate this is true for their community pages. There's a compelling case that crowdsourced data should be as open as possible, given that the data itself comes from the public at large. But beyond the moral case, opening your data means that the wide-world of evening and weekend software developers/architects/designers/whatevers (many have the same job during the day) will expand what is possible a site's data in a way that will benefit said site (like my half baked idea above). This, in essence, is the commercial argument for supporting open data and has been shown to be extremely effective in other domains (say, to pick one at random, music). And there is a simply massive spread of open data apis (again, both service and document) but barely any covering data about my favourite topic that isn't music, beer. So what do you say ratebeer or beeradvocate? How about some nice strucutured data?

note: I should mention that there are a couple sites that are beer related and open: untappd and beerspotr. Both are good sites, though neither is quite to the point of hitting critical mass in terms of data coverage and usefulness just yet. Either might at some point in the future, but ratebeer and beeradvocate already have, the data just isn't accessable.

Friday, 1 April 2011

Viva passed, corrections approved, blog barely updated...

The last couple months have proven me to be a terrible blogger, as I haven't posted at all.

Anyway, that aside, I'm pleased to announce that I have passed my viva with minor corrections (back on march 2nd) and as of about an hour ago, had my submitted corrections approved, which means I'm totally done!


So before I run off for a bit of celebratory drinking, I thought I'd post the soft copy in the the series of tubes (here's the full pdf) and here is a brief chapter-by-chapter summary:
  • Chapter 1: Introduction. We present the set of problems this thesis will address, through a discussion of relevant contexts, including changing patterns in music consumption and listening. The core terms are defined. Constraints imposed on this work are laid out along with our aims. Finally, we provide this outline to expose the structure of the document itself.
  • Chapter 2: Playlists and Program Direction. We survey the state of the art in playlist tools and playlist generation. A framework for types of playlists is presented. We then give a brief history of playlist creation. This is followed by a discussion of music similarity, the current state of the art and how playlist generation depends on music similarity. The re- mainder of the chapter covers a representative survey of all things playlist. This includes commercially available tools to make and manage playlists, research into playlist generation and analysis of playlists from a selection of available playlist generators. Having reviewed existing tools and gen- eration methods, we aim to demonstrate that a better understanding of song-to-song relationships than currently exists is a necessary underpin- ning for a robust playlist generation system, and this motivates much of the work in this thesis.
  • Chapter 3: Multimodal Social Network Analysis. We present an exten- sive analysis of a sample of a social network of musicians. First we analyse the network sample using standard complex network techniques to verify that it has similar properties to other web-derived complex networks. We then compute content-based pairwise dissimilarity values using the musical data associated with the network sample, and the relationship between those content-based distances and distances from network the- ory are explored. Following this exploration, hybrid graphs and distance measures are constructed and used to examine the community structure of the artist network. We close the chapter by presenting the results of these investigations and consider the recommendation and discovery applications these hybrid measures improve.
  • Chapter 4: Steerable Optimizing Self-Organized Radio. Using request radio shows as a base interactive model, we present the Steerable Opti- mizing Self-Organized Radio system as a prototypical music recommender system along side robust automatic playlist generation. This work builds directly on the hybrid models of similarity described in Chapter 3 through the creation of a web-based radio system that interacts with current lis- teners through the selection of periodic requests songs from a pool of nominees. We describe the interactive model behind the request system. The system itself is then described in detail. We detail the evaluation process, though note that the inability to rigorously compare playlists creates some difficulty for a complete study.
  • Chapter 5: A Method to Describe and Compare Playlists. In this chapter we survey current means of evaluating playlists. We present a means of comparing playlists in a reduced dimensional space through the use of aggregated tag clouds and topic models. To evaluate the fitness of this measure, we perform prototypical retrieval tasks on playlists taken from radio station logs gathered from Radio Paradise and Yes.com, using tags from Last.fm with the result showing better than random performance when using the query playlist’s station as ground truth, while failing to do so when using time of day as ground truth. We then discuss possible applications for this measurement technique as well as ways it might be improved.
  • Chapter 6: Conclusions. We discuss the findings of this thesis in their to- tality. After summarizing the conclusions we discuss possible future work and directions implied by these findings.

(Also, if you find any deep hiding typos, I'd love to know about them. Not sending it to the printer/binder till Monday...)