So the first thing I put together was a bit of python to fetch the content of a tweet and use the mm_api to determine it's tone. This can be done quite simply (full source):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#grab some data from twitter | |
data = loads(urllib2.urlopen("http://api.twitter.com/1/statuses/show/{0}.json".\ | |
format(twitter_id)).read()) | |
tweet_content = data["text"] | |
#push to sentiment analysis | |
raw_senti = loads(urllib2.urlopen(\ | |
"http://apib1.semetric.com/musicmetric/sentiment?token="+\ | |
API_KEY, data = tweet_content).read()) |
Which gives a number between 1 and 5, with 1 indicating the text is 'very negative' and 5 indicating the text is 'very positive' (gory details of the sentiment analyzer). While the sentiment analyzer is trained for larger chunks of text (500 word album/movie reviews and that sort of thing) it in fact does fairly well with tweets (though sarcasm is its downfall). So I thought I'd do something a bit silly and built a 'flamewar detector and troll finder' for conversations on twitter.
I've called the initial command line tool firealarm.
To gather the conversations, I'm just piggybacking on @jwheare's great tool exquisite tweets. Once a conversation is archived over on exquisite tweets, the cli can be pointed to it via the conversation's url. Each tweet in the conversation is pushed through the sentiment analyzer; the simple mean (µ) of all the sentiment scores is then dubiously used to determine if the conversation is a flamewar. If the sentiment is generally negative (µ < 3) it's a flamewar, if it's generally positive (µ>3) it's not a flamewar, and if it is exactly neutral (µ ==3) it's declared a tossup. The troll finder is equally straight forward (and equally dubious!). Across the sequence of tweets, the author of the tweet with the highest magnitude negative delta sentiment proceeding it is considered the troll. In the case of a tie the first occurance wins.
Here's an example (note that the linked-to example is a nasty, nasty flame war. If you offend easily, might want to skip it. Also, obviously, views expressed are not mine, etc.):
$ python firealarm.py http://www.exquisitetweets.com/collection/RodBegbie/402
This generates the following output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fetching the conversation at http://www.exquisitetweets.com/collection/RodBegbie/402 | |
Is the conversation a flame war? | |
yes | |
Where's the troll? | |
http://twitter.com/JohnONolan/status/63027551760691200 | |
Author: John O'Nolan username: JohnONolan | |
Now stop feeding the trolls! |
A plot of the sentiments, with the maximum negative delta in red, looks like this:
A quick read of the tweets and you can see that the actual sentiment of the tweets is a bit more negative overall then the analyzer output, but this is good enough for binary classification and a fairly reasonable troll ID mechanism, albeit fairly naive.
The code is over at github if you want to have a look or run it yourself. If you want to run it, you'll need a musicmetric api key which takes about 30 seconds to get (apply here). Eventually I'm going to turn this into a web app, and when that happens I'll let everybody know. Also if you happen to find any really bad mislabels, let us know, as it helps us tune up our process.
Have fun being algorythmically judgemental!