Sunday, 26 September 2010

womrad live blog part the last

last session: Long tail stuff:

Kibeom Lee (presenting), Woon Seung Yeo and Kyogu Lee
  • focusing on popularity bias - referencing oscar's thesis work (Help! I'm stuck in the head)
  • Goal: keep the awesome of collaborative filtering but sort out popularity bias
  • the mystery of unpopular but 'loved' songs on last.fm -- shouldn't loved songs be played frequently... perhaps an area of music the user likes but doesn't venture very far into
  • 'My tail is your head' - find the users who have a 'head' that overlaps with your 'tail' to draw recs from
  • personal story about how this idea came about -- one person's popularity bias is another person's novel rec.
  • refs oscar and paul's ISMIR 07 rec tutorial - this system is geared toward the top half of the user type pyramid
  • scraped last.fm to get more tracks per user (API gives 50/user scrape gives 500)
  • lots of tracks (about 9million)
  • eval by asking users how things worked out comparing recs from proposed algor v. trad model rate; used a 1-5 rating scale
  • promo'd the website in various ways, but not too much response
  • but, the limited response did show some improvement over traditional approach
  • overall - some improvement, much potential
Q how many users?: see above
Q so were your recs in the global head?:
sorta, mostly in the midsection


Mark Levy (presenting) and Klaas Bosteels
  • an overview of lit showing various rec bias especially the idea of positive feedback reinforcing the head (not this kind of bias though)
  • this work looks at 7 billion scrobbles all scrobbles from Jan - Mar this year (holy crap, that's some scale)
  • recs just from the last.fm radio
  • how do you define the long tail? use a fixed ref of overall artist ranks (number of listeners from last) + a fit model ~50-60k artists in the 'head'
  • looked at rec radio, non-rec radio, all music
  • the last.fm radio has less head bias then general listening, but only just
  • used an experimental cohort of listeners: new, active, but not insane spamming amounts of scrobbling. two subsets : radio users and not so much
  • this shows very little difference in the non-radio long tail listening among those who use last.fm radio v. those who don't
  • but: perhaps there's some demographic trouble
  • so split radio users into high users and low users
  • still no tail bias to speak of
  • perhaps from the fact that real systems only rec new tracks, mitigating reinforcement
  • so: built a simple item-based rec which limited candidates to the 'play direct-from-artist' scheme, not allowed to give artists with more than 10000 fans
  • deployed on playground.last.fm
  • eval based on a sample of the last.fm user traffic
  • effectively pushes curve out another order of magnitude
  • try online
  • [me: this is great!]
Q Do you see a problem, in terms of scholarship, with the fact that in practice you have access to all this data and the public does not?
well, hrm. how about being an intern
Q Does this make better recs?
Better, eh, interesting sure.

And WOMRAD done. feedback is elicited

afternoon papers


content-based stuff now:

Dmitry Bogdanov, Martín Haro, Ferdinand Fuhrmann, Emilia Gómez and Perfecto Herrera

Dmitry presenting
  • Sim is not rec. need similarity
  • can we improve content based rec by merging pref data?
  • gmm + pref model
  • process:
  1. ask user for small set of tracks that specify the user's preference by example
  2. get bag of frames on these
  3. SVMs to get sematics (probablistic)
  4. in this semantic space, search for tracks
  • can search in a variety of ways (use of Pearson's correlation is taken from prev work)
  • for eval compare our method to a bunch of existing methods, content-based , contextual, random
  • some users did a test get pref set (varies form 19 to 178 tracks for a user) this takes a long time
  • get lots of tracks from all the methods, shuffle, stick in front of user ask lots of Qs per track
  • created three categories based on the evals: Hits, trusts, fails
  1. Hits -user likes, is new
  2. trusts - user likes, is not new
  3. fail - no to all
  4. unclear - the rest (18%)
  • A good system should provide many hits and some trusts avoiding fails
  • in the results, last.fm (via api) is very good for hits and trusts
  • everyone else was bad at trusts
  • the new method was best for non-last.fm with hits, but last.fm is different drawing set of music so they're better
  • proposed semantics offer an improvement over pure timbral features
  • but still inferior to industrial approaches, though this proposed work improves considerably, a good way to cold start perhaps
Q (oscar) I dont' understand the last.fm? why didn't you use for sim?
we tried, couldn't get enough info
(oscar follow up) low trust on the content, do you think it's tied to a lack of transparency?
maybe, but our definition of trust just meant user likes and knows.

Q() was the SEM-ALL about finding songs that are close to any or all?
any


UPDATE (~5pm):

Pedro Mercado and Hanna Lukashevich
Hannah is presenting

  • clustering can help you swim in the sea of data
  • users can fix incorrect clusters, positive feedback
  • system diagram:

  • similarity can be given considered as a graph, then you can do random walks, calc eigen values etc.
  • but, what if this user doesn't care about somethings? User pref based feature selection.
  • in the given space, you can then find distance (paper uses Pearson's but other dist could be used)
  • contraint the space (tricky math, see paper...)
  • eval: used the MIREX 04content description data
  • constraints from genre labels
  • using test train as an example: what's in contraint space, what isn't
  • mutual information, something else I didn't catch
  • some graphs showing that there's more awesome with presented method
  • when looking at outliers, things are less clear but still seem positive
  • [graphs are page 6 of the pdf, have a look for details]
  • to wrap up: ML approaches can improve recs at least with our simulated user...
  • our clustering methods are speedy, though scale is tricky but since our matrix sparse should be doable
  • Way better than random constraints
  • future work: stick constraints in feature selector, we did this, to appear in ICML, gives significant imporvement, but causes some trouble, read paper for detail [excellent ICML tease...]
-- coffee and demos now...

WOMRAD Afternoon live blog

Afternoon live blog for WOMRAD. Intro in first post.

Afternoon session.

1500: Industrial panel

panelists:
  • Òscar Celma (BMAT, Spain), moderator
  • Tom Butcher (Microsoft Bing, US)
  • Mark Levy (last.fm, UK)
  • Michael S. Papish (Rovi Corporation) subing for Henri-Pierre Mousset (Yozik, France)
  • Gilad Shlang (Meemix, Israel)

Q (asked by OC): Do we need recommenders anymore? Are they relevant? (SFMusicTech quote about music rec only needed for people w/o friends)

TB: still valid, but personally more interested in discovery now...
ML: don't sell things, but tremendous effort in this direction, important to users, builds trust. Plus we compliment not replace social connections
MP: no need to draw lines, reinforces complimentary service idea. Many users may not need them, but perhaps that's not who these systems are for
GS: What's wrong with not having friends? Also, a tight group of friends may not have discovery, as a group. The removal of place opens more possibility to access long tail or different parts of the head. Perhaps more personalization than rec, but this a fine line
MP: the opposition of the individual v. group. If you listen to music without a community you loss the social experience.
GS: fair enough. some points about individual optimization of education. Group important, but also personal growth.

Q (asked by OC): Netflix prize. What is a good recommendation (in music)? How do we evaluate (in music)?
GS: a good rec will get people interested. wow factor. acknowledge you and surprise you at same time. music is short , which makes it easier to tune a rec profile. sharing implies liking, that's useful. tagging; more tags=more popular
(...small aside...)
ML: we run controlled experiments. quietly divide users to test different methods. Netflix fails in that it evaluates with data that's already been seen not new data.
MP: good rec means different things at different times. gives an example of a good rec that is not interesting: an artist you've previously bought releases a new album. not interesting but good. this would be bad for radio.
TB: in industry there are many ways to test. more purchase is different than more enjoyment.
ML: we at last would love for some theory to be developed for rec based on user logs.
OC: more data please
ML: you can always ask...
(this goes back and forth)
MP: yet there will never be good data. sparse data is hard, but makes your a better human (eat your green)

Q (OC) discuss user interfaces, user exerience etc. :
TB: pandora is a winner, don't ignore the interaction
ML: thesixtyone is great. interface v. playful good long tail, would love for a last.fm interaction, but discussing issues
MP: name checks Paul lamere, who he cites saying thesixtyone is an exotic rec, but MP thinks this is the way people normally use music, we should work to have systems that act like this. Need a toolbelt not one ubiquitous tool
GS:we tackle similar things. in B2B you need systems that complete clients' existing systems. If you over rec, you can scare people off, social dynamics
MP: think about the inverse rec - what should you not rec? Also from a UX stand point, to build trust, change recs over time. General to personal as a user interacts with a system
GS: different recs can be very personal
ML: last.fm takes a sort of opposite
-crosstalk-
MP: this is possible from last.fm's transparency

Q (OC) What do you want solved?

TB: we're hiring. Also, algorithms must scale or they're scalable
ML: how to merged datasources? how to use human-to-human recs
MP: see my keynote, exploit user psych, What are good Qs to ask users to build profiles
GS: more info for recs, params in audiofiles. map user params to extractable params Moving techniques to non-western musics. What about china and india? We should be serve them.
MP: is the sonic data really the key? I don't think so, too much effort in this direction sim is different than rec
GS: but sim is a good start

Q(OC) Do you use content-based features (y/n please)? ISMIR fun, glass ceiling, do you follow this work in academics, do you think it's solved?
GS: Yes. see last discussion core to our business, vital to start a relationship, move to social and such over time.
OC: what sorts of things do you use
GS: of things. 10 (does not list). aggressiveness very important.
ML: I come from the MIR community. we do content-based ID, have tried to intro content-based stuff and it's never been successful. But our hearts are in it. We have enough users that cold start doesn't matter. auto-tagging would be sweet though for the holes in our social data (musicological tags for instance). maybe youtube
TB: yes for the most part to ML's comments. content-based is too costly, tags and metadata are super effective
GS: what about the new company, that doesn't have lots of data. is you're just getting into the game. these people need results. can't tell them go gather data for a year and we'll sort it out
MP: item to item is very different than personalized
ML: check that P2P paper from ismir
Hannah from Franhauffer: we have clients (like film producers) no data,what then?
MP: exactly.
(Eugenio Tacchini): GS is the DNA all of it? really?
GS: no, not really. music DNA for rich space, but still need personalized info

Q (OC) if you were to hire a researcher (aside: research cannot program) what kind skills do you want (not resume skills, fancy skills)?
TB: domain experience, audio music, computer vision, breadth better than depth. production coding skill in some language
ML: we're hiring as well. If you don't want to code, probably won't work. CS skills really important. Big database skills. hadoop win. strong C++ and research also python. data and viz as well.
MP: we are hiring as well. growing r & d group. we have offices all over. we like building things. though we have room for research. we like solving problems. again broad. can you pivot. don't need a PhD to be useful.
GS: we're also hiring (that's 4 for 4) we're a start up. data analysis and mining. core CS + creative skills, willing to sweat.
OC: perhaps also adoptability
GS: yes. you're there to invent. plus we're in Tel Aviv and that's sweet

Q (claudio): What is the relationship your company has with musicians are they just a commodity?
OC: our missing speaker (Henri) does this
GS: I spoke with him, he thinks: for young musicians it's hard to reach your audience.
OC: BMAT does this with jamendo. when they type in 'Michael Jackson' what do you do?
MP: but don't sell recsys as a way to push new artists only. In a certain context, ie. neg search, but careful. Don't exploit users or artists
GS: but the state of the art pushes new bits

(from audience) What about piracy?
GS: it's not good.
TB: there are 2 view points: piracy increases consumption. otherside: do we now that?
OC: now we're over time sorry.

---
in light of the near transcript I just typed, I'm starting a new post for the afternoon talks.

Updated (5:33) : corrected questioner ID

A womrad live blog

I'm in Barcelona today for the Workshop on Music Recommendation and Discovery (WOMRAD). The theme is 'Is Music Recommendation Broken? How Can We Fix It?'
I'm giving a talk at 11am ( in about 2 hours ) and I'll be doing some (mostly) live updates about the program...

Update (10:05am):
UPDATE ( 28 Sept 2010, 11:52am): Michael has posted the slides to his talk.
  • The view from outside, as his industrial has used and observed recs
  • Been there since the beginning (which appears to be about 2000)
  • Recommenders must combine humans and machines
  • understand both content and listeners, transparency, embrace emosocio aspects, optimize trust
  • What is science? Must be falsifiable (Popper) or Solvable, reproducible puzzles (gah, missed name)
  • Puzzle - understand the listeners preferences -- foundations (ISMIR 2001 resolution) - testable reusable
  • Lots of metrics though (too many?) (do we need a metric for metrics?)
  • MIREX (summary of AMS task) (haha it's automated, tell that to andy and mert) - very acoustically focused, not exactly recommendation similarity != recommendation
  • use of statistical measures across datasets e.g. Netflix prize -- but what about discovery? -- Netflix produces better numbers but does it produce better recommendations?
  • More holistic measures -- survey users about trust and satisfaction (Swearingen & Sinha) -- may miss UI issues -- practical 'business' metrics -- bottomline measurements -- does this remove the science?
  • appreciated history of MIR (from a rec POV) will stick pic here -- currently hitting 'Wall of Good Recs' since recs don't suck it's no harder to test
  • easy to test for bad recs -- hard to test for good recs
  • What if the emerging problems (like UI and trust) are no longer measurable
  • Is user preference too variable and unstable to be useful?
  • from science to art?
  • 2 options:
  • 1: focus on unsolved MIR: better encoding of preference (more socio-cultural research)
  • What are the limits of the avg listener (hey it's our playlist survey!)--playlist turing tests, understand artist v. album v. tracks -- can we build tools/games to expand this
  • listener profile -- can you quantify the sonic v. social preference -- add relevance layers to search and retrieval
  • 2: adjourn to the Beach
  • Questions:
  • Mark Levy: Do you think you're too embarrassed about good engineering? What about controlled experiments by people like google/last? -- Move from science to engineering (this confuses me slightly ISMIR has alway been Engineering not Pure Science) It is fruitful but is it science.
  • Claudio: Can you speak a bit about your experience combining human knowledge vs. algorithms --- yes. what do you do with human knowledge? it's tricky. look for the ideal rec experience - sit around with your friends and play records: how do you scale that in a system? It's not about classification - humans are good at putting things together - train people to be qualitative assessors
  • Oscar: Since you used to be in college radio, how do you think this experience could inform playlist? Do you use playlisters? Well only a 1.5yr experience, but made me think about the groups of listeners. Name checks John Peel. What about presentation - In terms of what rovi does: Minimally we can stop making bad playlists: gives example then breaks - v. hard to differentiate btwn good - v. good - excellent
  • Me: what about bypassing order by selecting good sets:
  • (Eugenio Tacchini): how much is the expert transparency necessary? yes give justification but need to avoid the feeling of stereotyping, weird vague directions, not just look at this user but look at this part of this user.
  • tom butcher: Is music rec really a unique snowflake? - Every domain is unique. -- One thing: a bad user rec in music costs 2 minutes, a bad film rec costs you 2hrs music has a lower penalty cost for bad recs. Also diff in features will sonic features get you to pref, prob not in music [I think this is a think which may improve...]
(update 2 10:31am)
session 1:
Time Dependency
  • personal ex. showing diff between early day v. late night playlist
  • trying to link 2 concepts - Day- hour - (weather?) and Music track selection
  • few papers on this idea -- take things from Human Dynamics -- trying to enable playing music 'at the right moment' -- explore circular stats
  • Circular stats (eqs in paper at link) basically transform raw data by a perodicity (days, weeks)
  • Circular stats have analogous tools to trad stats - hyp tests for instance
  • Data for eval is full listening history of 992 unique last.fm users with artist/title + time of day (ToD) also got genre via track.getTopTags, keeping genre -- discarded users w/o enough data
  • scraped about half the data
  • attempt to make predictions - use two years of data to predict the ToD of play in next year
  • results: by day about 2.5x better than chance, by hour about 3-5x better than chance (move from half hour to hour tolerance doubles data
  • note that the figures are overall, some users are v. predictable in this way, some are not.
  • Concl: temporal patterns can be predicted - not just what but when. plugs the last.fm clocks
  • Q (dunno who asked): what about user to user offsets (eg. if a user gets up at 6 v 8am 8 am means something different)? Currently can't do this, need sensor data. Would be sweet if we could, though not tha tthe predictions are peruser, so this is to some effect already dealt with
  • Q (again, people say who you are): Method issue - when comp day v. hour there's a percentile diff in the err tolerance? Sure this could work look at baseline compare...
  • Q (Eugenio Tacchini): I tried this awhile ago, aggregated data, didn't find much spread do you think aggregation is the issue? yeah, must be specific to the user, right time + right user not just right user
  • Q (Klaas):do you think it would work with less data (can't wait 2 years)? Probably. This was a very conservative methodology, could probably get by with maybe three months. For this work we wanted lots of data to make things clear
  • Q (seriously ID guys): did you use a popularity filter? No. tested if pref for a genre is different than the average for that genre
Break time then my talk. no notes for my talk as I'm talking...

Update (12:16): I was without my machine for the social tag session, not just my talk. I'll get my hand notes in another post but for now here are the papers:
next paper is being skipped since the author was unable to attend due to illness:

Now joining the presentation already in progress by Audrey Laplante:
  • qualitative study of adolescents
  • 'Did your music taste change significantly in the last three years?' Yes, whys: New boyfriend, New school therefore new friends, important discussion topic
  • "who in your 'gang' or group exert the most influence on others in terms of music?" -- 3 self-identified. Characteristics: highly invested in music, good comm, willing to share info. People who are opinion leaders want to stay opinion leaders, will invest heavily in effort
  • in other domains work shows that weak ties are more important then strong ties in finding new information works almost all the time -- for 2 participants weak ties important -- for others strong ties with significantly different social network are important -- music as vehicle for social interaction
  • strong ties have different roles -- not important for discover, but critical for legitimization of musical taste
  • similar and reliable social connections are critical
  • social network maps (pic forthcoming...)
  • unknown how common these results are (same survey) as yet unknown exact implications for recommenders
  • Q (unknown)- Weak ties v. Strong ties -- how do you define the difference?: not really about newness, but it's entirely possible with new detail
  • Q (claudio) - What kinds of systems are implied with this work? Not necessarily a different system for adolescents. tight connectivity is critical, perhaps the difference is that strong ties may become more critical
  • (claudio) - does the notion that music describes you change as you get older?: not really actually, adolescent are interested in individual uniqueness
  • Q (Mark L) are social networks online somewhat different?: yes and no. in facebook you can find relatives, but noise is a big problem. But trust is not known
  • I asked about using graph difference. Answer could work, also other automatic methods...
lunch now. I'll make a new post for the afternoon session.

updated again (5:14pm) Eugenio Tacchini is Italian not Finnish (oops)

Thursday, 9 September 2010

Roomba Recon - A musichackday brain dump

So this past weekend I attended the 2nd (annual?) London Music Hackday at the Guardian's offices at King's Cross. For the hack I created an algorithm that generates playlists between arbitrary start and end songs on soundcloud. It does this with almost no pre-indexing, allowing for playlists to cover the entire network and always use an up-to-date graph. It's (mostly) running live if you'd like to play with it.

Briefly, it performs a sort of bastardized A* search, bilaterially from both the start and the end song to form the playlist. There's a parameter to limit the length of the two playlist segments, by default this is 4 so the max playlistlength is 10 (2*4+2 for the end points).

The search algorithm collects social links of the artist corresponding to the given song. For each of these connections (you know, 'friends' or in soundcloud jargon 'followings') a determination of the cost of adding that song is calculated in the following way (for the half built from the start song):
where is the cost to add song m to list after song n, is some measure of distance from song n to song m and is the same measure of distance from song m to song e. Song e is the end song for the whole playlist. So basically the idea is that the cost of moving to a node is a ratio of how far away it is from where you were to how far it is to where you're trying to get. The whole thing is reversed for the other half, so the cost function makes it cheap to move toward the start song. If you simply want to randomly traverse social links the cost can be set to an arbitrary equal value (I used 1) for all links.

This leaves the matter of distance.

Starting with what I know best, I decided to try a content-based distance first. I should say that from the onset I figured this would be insanely slow, but none the less, I gave it a go. I implemented (available directly as well) a little object that will grab the echonest timbre features for any two soundcloud songs, summarize the features into a single multidimensional gaussian (mean and std) then take the cosine distance between the two tracks (other distance metrics could be computed as well, but cosine seemed reasonable). That takes something on the order of 45 seconds to do for every pair of tracks. When using it in the above playlister the whole thing would take maybe 4 hours (I think, I never actually let it complete). Clearly way too slow.

So taking inspiration from my about to be published work at WOMRAD, I thought some NLP could save the day. So the other distance measure I implemented (no direct access yet) is based on a tracks tags and comments. First I tokenize the comments and combine them with the tags to create a vector space model of a track's descriptive text. I then weighted everything using tfidf (the idf was populated with a random sample of tracks from across soundcloud that I gathered over the weekend, about 41,000 tracks in total. This is the only indexing that is done in advance). From the tfidf weighted terms in a vector space, I took the cosine distance. This is both quite quick and gives pretty good results.

Everything was built in python, the app is running in cherrypy, using numpy and scipy for the data handling and gensim for the tfidf related bits. Soundcloud and echonest interaction is all via their respective python wrappers. Also there's a more terse write up over at the musichackday wiki. I'll stick the code on my repository on github once it's cleaned up a bit (though that might be a little while as I seem to be rather busy with something at the moment...)

Right. Back to writing my thesis.