Sunday, 12 October 2008

mypyspace status update

So MyPySpace has been getting a facelift.  Kurt with some input from users (we apparently have users!  Who knew.) has been refactoring the rdf translators and fine tuning the myspace ontology as well.  Most (all?) of these changes are also being reflected into the live service.  While all of this has been going on, I've refactored the page scraping, crawling and downloading into a much more sensible class architecture from it former stream of consciousness implementation (I believe the polite description is 'research code').  It still has quite a ways to go (alpha!) but it's starting to resemble an actual library.  If you want to play with my refactored bits you can check them out like this:

> svn co https://mypyspace.svn.sourceforge.net/svnroot/mypyspace/myspaceCrawler/trunk/ myspaceCrawler

then you can do nifty things like the following (inside your favorite python interpreter or script, I'm using ipython here):

In [1]: import mpsUser

In [2]: gearmonkey = mpsUser.mpsUser('http://www.myspace.com/gearmonkey')

You simply give the class a valid myspace user url to initialize it (this is my artist page.  If you want to play with this, don't feel the need to listen to my music...)

In [3]: gearmonkey.isArtist
Out[3]: True


In [4]: gearmonkey.downloadTracks('~/Music/mpsUsertest/gearmonkey/')
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/gearmonkey/1_Cheeky.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/gearmonkey/2_TrainTune.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/gearmonkey/3_Give Way.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/gearmonkey/4_En La Selva Mvt II GMO vip.mp3; creating tag from scratch
Out[4]: (4, 4)

Then you can find out if your user is an artist by checking the boolean isArtist.  If it's an Artist, you can download their songs.  That return value is a tuple of (songs successfully downloaded, downloads attempted). 

In [5]: gearmonkey.songs[0].title
Out[5]: u'Cheeky'

As part of the download process, each song is an instance of the class mpsSong (more on that class in a bit).

You can use the mpsUser class to crawl the artist network like this:

In [6]: artistFriends = gearmonkey.findArtistTopFriends()

In [7]: artistFriends
Out[7]: 
[mpsUser.mpsUser instance at 0x1a3a7b0,
 mpsUser.mpsUser instance at 0x1c5d788,
 mpsUser.mpsUser instance at 0x1a3ad50,
 mpsUser.mpsUser instance at 0x1a3a968,
 mpsUser.mpsUser instance at 0x1c7fc88]

In [8]: artistFriends[0].artist
Out[8]: u'Mike'

In [9]: for entry in artistFriends:
   ...:     print entry.artist
   ...:    
Mike
Otto Von Schirach
GEIN
The Dead Hookers' Bridge Club
EVOL


In [10]: artistFriends[2].downloadTracks('~/Music/mpsUsertest/GEIN/')
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/1_Life Of Sin GEIN edit.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/2_Deadly Algorhythm GEIN Remix.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/3_GEIN KJ Sawka Break the Enemy.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/4_GEIN  Warden.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/5_GEIN vsThe ChosenAbomination.mp3; creating tag from scratch
INFO:root:No ID3 header found for /Users/bfields/Music/mpsUsertest/GEIN/6_GEIN  Hell Audio rmx.mp3; creating tag from scratch
Out[10]: (6, 6)

In [11]: artistFriends[2].topFriends
Out[11]: 
[u'11187934',
 u'2123795',
 u'2177245',
 u'706581',
 u'20492111',
 u'66601290',
 u'5017015',
 u'207669100',
 u'52365642',
 u'2186134',
 u'3378431',
 u'55609497',
 u'30244',
 u'26629700',
 u'80613962',
 u'74772580',
 u'28841051',
 u'317327']

In [12]: geinArtistFriends = artistFriends[2].findArtistTopFriends()

In [13]: geinArtistFriends
Out[13]: 
[mpsUser.mpsUser instance at 0x1d83a30,
 mpsUser.mpsUser instance at 0x1d95710,
 mpsUser.mpsUser instance at 0x1d99198,
 mpsUser.mpsUser instance at 0x1d95be8,
 mpsUser.mpsUser instance at 0x1d956c0,
 mpsUser.mpsUser instance at 0x1e6fa80,
 mpsUser.mpsUser instance at 0x1e81b98,
 mpsUser.mpsUser instance at 0x1da7080,
 mpsUser.mpsUser instance at 0x1e81b70,
 mpsUser.mpsUser instance at 0x1f2dee0]

In [14]: for friend in geinArtistFriends:
   ....:     print friend.artist
   ....:    
EVOL
GUERILLA®
THE GUN
Habit Recordings
Mumblz / Delusional
Tech Itch
Lost Soul Recordings
None
Donny
NECRO THE SEXORCIST SPECIAL EDITION CD/DVD SOON!!!


and so on and so forth.  Once you've initialized the songs for an artist you can use the mpsSong class structure to find things out about the songs as well:

In [15]: gearmonkey.songs
Out[16]: 
[mpsUser.mpsSong instance at 0x1a155d0,
 mpsUser.mpsSong instance at 0x1a2ff08,
 mpsUser.mpsSong instance at 0x1a338f0,
 mpsUser.mpsSong instance at 0x1a338c8]

In [17]: for song in gearmonkey.songs:
   ....:     print song.title + " by " + song.parent.artist + " has been played " + song.playcount + " times." 
   ....:    
Cheeky by G_M_O has been played 117 times.
TrainTune by G_M_O has been played 168 times.
Give Way by G_M_O has been played 88 times.
En La Selva Mvt II GMO vip by G_M_O has been played 9 times.

In [18]: 


There are also some simple hooks to call fftExtract on the songs of an artist but I'll save those bits for another post.   One quick note, I don't believe we've fixed the bug that prevents song downloads in the US (and maybe Canada), but the url requests have been changed slightly so if anyone tries it over there let me know.  All the scraping should be fine in the States and everything should work everywhere else.  Also, you need the mutagen ID3 tag library installed prior to using this.  

If any readers do give this a try let me know if you have any thoughts (especially interface related) down below.