Monday, 16 May 2011

Doing ridiculous things with natural language processing

Over this past weekend before last, while I jealously followed from afar the SF musichackday that I was unable to attend as I'm awaiting the result of a visa application, I started mucking about with the beta musicmetric api (full disclosure, they are my employer), in particular the sentiment analyzer.

So the first thing I put together was a bit of python to fetch the content of a tweet and use the mm_api to determine it's tone. This can be done quite simply (full source):

#grab some data from twitter
data = loads(urllib2.urlopen("http://api.twitter.com/1/statuses/show/{0}.json".\
format(twitter_id)).read())
tweet_content = data["text"]
#push to sentiment analysis
raw_senti = loads(urllib2.urlopen(\
"http://apib1.semetric.com/musicmetric/sentiment?token="+\
API_KEY, data = tweet_content).read())


Which gives a number between 1 and 5, with 1 indicating the text is 'very negative' and 5 indicating the text is 'very positive' (gory details of the sentiment analyzer). While the sentiment analyzer is trained for larger chunks of text (500 word album/movie reviews and that sort of thing) it in fact does fairly well with tweets (though sarcasm is its downfall). So I thought I'd do something a bit silly and built a 'flamewar detector and troll finder' for conversations on twitter.

I've called the initial command line tool firealarm.

To gather the conversations, I'm just piggybacking on @jwheare's great tool exquisite tweets. Once a conversation is archived over on exquisite tweets, the cli can be pointed to it via the conversation's url. Each tweet in the conversation is pushed through the sentiment analyzer; the simple mean (µ) of all the sentiment scores is then dubiously used to determine if the conversation is a flamewar. If the sentiment is generally negative (µ < 3) it's a flamewar, if it's generally positive (µ>3) it's not a flamewar, and if it is exactly neutral (µ ==3) it's declared a tossup. The troll finder is equally straight forward (and equally dubious!). Across the sequence of tweets, the author of the tweet with the highest magnitude negative delta sentiment proceeding it is considered the troll. In the case of a tie the first occurance wins.

Here's an example (note that the linked-to example is a nasty, nasty flame war. If you offend easily, might want to skip it. Also, obviously, views expressed are not mine, etc.):


This generates the following output:

fetching the conversation at http://www.exquisitetweets.com/collection/RodBegbie/402
Is the conversation a flame war?
yes
Where's the troll?
http://twitter.com/JohnONolan/status/63027551760691200
Author: John O'Nolan username: JohnONolan
Now stop feeding the trolls!
view raw firealarm.out hosted with ❤ by GitHub

A plot of the sentiments, with the maximum negative delta in red, looks like this:



A quick read of the tweets and you can see that the actual sentiment of the tweets is a bit more negative overall then the analyzer output, but this is good enough for binary classification and a fairly reasonable troll ID mechanism, albeit fairly naive.

The code is over at github if you want to have a look or run it yourself. If you want to run it, you'll need a musicmetric api key which takes about 30 seconds to get (apply here). Eventually I'm going to turn this into a web app, and when that happens I'll let everybody know. Also if you happen to find any really bad mislabels, let us know, as it helps us tune up our process.

Have fun being algorythmically judgemental!