How many MusicBrainz pages will have RDFa?

  • user warning: Unknown column 'captcha_type' in 'field list' query: SELECT module, captcha_type FROM captcha_points WHERE form_id = 'comment_form' in /var/www/drupal/sites/all/modules/captcha/captcha.inc on line 64.
  • user warning: Unknown column 'captcha_type' in 'field list' query: SELECT module, captcha_type FROM captcha_points WHERE form_id = 'user_login_block' in /var/www/drupal/sites/all/modules/captcha/captcha.inc on line 64.
 Update: we grossly over-estimated RDFa pages at first - there's no multiplication by artist numbers, only an additional summation of recordings, releases, works, and release groups 

The question was recently posed, how many pages will contain RDFa in the new MusicBrainz release?  Here are some "back of the napkin" calculations.

In the most recent NGS database dump we have:

  • 539,988 artists
  • 710,665 release groups
  • 850,853 releases
  • 9,138,660 recordings
  • 190,103 works
  • 42,300 labels

First we can sum the release groups, releases, recordings, works, and labels and multiply by two because each of these entities has RDFa on their main page and their /details page.  

But then with artist pages, it starts to really explode.  The pagination stipulates listings of 50 entities per page.  Every release group, release, recording and work is credited to an artist, so we count these again.  We can estimate RDFa pages associated with artists as:

 (release groups/50) + (releases/50) + (recordings/50) + (works/50) 

That results in a whopping 12.1 million pages of RDFa!!!  Note that there is a fair amount of overlap in terms of triples, however.  For example, most of the triples in the release pages are also present in the artist /releases pages.  Please let me know if I'm calculating this incorrectly, I know it seems really high!  But if my mathing is correct - this is actually an underestimate because some artists will have less than 50 releases, recordings, works, etc.

Wow!! Any more news on this?

Wow!! Any more news on this?

I'm presuming this may well point to scalability and resulting performance issues? Or maybe not?

Are you able to provide any more info on the potential implications of this? Sounds like it's an important issue to track.

Cheers, Adrian
JiscEXPO Synthesis Liaison

Cool to have all that data,

Cool to have all that data, but perhaps it suggests some need to make sure there's a less redundant view of it. Perhaps a sitemap (see http://sindice.com/developers/publishing ) to help crawlers be more focussed and not grab the same thing many times over?

Very impressive guys, a real

Very impressive guys, a real achievement for the first half of the project. Question is now, what are you going to do to top this?! I look forward to reading more :) /dff

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options