Tuesday, 18 March 2008

how much data?

In my work with kurt on MyPySpace we've been dealing with fairly large amounts of data, at least compared to average loads for content based MIR research. As a point of reference, I'd say a fairly standard music similarity or classification study will have a data set of something on the order of 10^3 songs, while our initial research research efforts have had a data set comprised of approximately 16,000 artists over about 55,000 songs. Further, there have been some studies (this one for instance) that use entire commercial mp3 datasets (said paper used Yahoo!'s digital download library of order 10^7 songs). These papers tend to deal particularly with the issues of large datasets as when things get that large it becomes impossible to brute force your way out of the situation.

So anyway, all of this has me thinking, how much music data is out there? How many musical recordings exist? Anybody know? I could google it a bit but I'm lazy.