At the RSS Industry Night Roundtable (thanks Rok for setting this up), most of the discussion centered around individualized RSS, including both truly individual feeds and also feeds that contain instrumentation for metrics gathering. Both kinds of feeds exist in the wild today, and both cause some problems for hosted aggregators like Yahoo!, NewsGator, and others, and feed search engines (Technorati, Feedster, NewsGator, and others).
In both cases, we have a situation where we have a number of independent feeds, which all contain the same or similar content. Some of the content may not be intended for public consumption (for example, it might have the user’s name in it as a personalized message), and other content may be duplicated but slightly different (think per-feed marked-up URL’s used for click-through tracking).
We need a way to avoid indexing this data. [aside: actually, other publishers have reasons for not wanting their content indexed as well – this solution will cover this third case also] Today, if you want Yahoo! to stop indexing your feeds, you call them and they mark your domain as such. If you want NewsGator to stop indexing, you call us and we mark your domain. And so on…which turns into a long list of calls you need to make. :-) At the industry meeting the other night, nearly everyone in the room agreed we needed a no-index indicator.
So here’s a proposal. Let’s kick it around and hammer something out quickly.
<rss version=”2.0″ xmlns:r=”urn:anzu-industry-meeting-2005-12″>
<channel>
<title>My Feed</title>
…
<r:index allowIndex=”false” />
<item>
<title>My article</title>
…
<r:index allowIndex=”false” /> *** see note below
</item>
This shows an “index” element at the feed level, which controls index-ability for the entire feed. If the element is not present, allowIndex is implied to be true.
I also show an item-level “index” element (***), which could specify the index-ability settings for a specific item. I’m less sure about this one…but at the meeting, Eric Hayes at Attensa mentioned it, so I put it in for discussion. I’d love to hear some thoughts about this one, including some use cases.
Implied behavior when a feed like this is encountered would be to a) not index the content, and potentially b) don’t archive the content if you normally do archive content.
So anyway…this is all pretty simple, and it solves an immediate problem that the whole industry is seeing. Please comment, either here or on your own blog (add a trackback), and let’s see if we can agree on something quickly.
Greg,
Looks straight forward and simple to implement. Would it be possible to provide the URL that can be index by search engine and used in aggregators directorys. So something like:
Thanks for kicking the process off.
Oops – its now part of the HTML. So let me try again:
r:index IndexURL=[Insert URL including HTTP]
Thanks
Stuart – not sure I understand. Would the URL you’re talking about be different than the channel URL in the feed? If so, how?
Greg, thanks for starting the discussion.
What do you think about using the META tag, used in HTML for this purpose?
<meta name=”robots” content=”noindex” />
Or, in order for it to be used within RSS, with a namespace:
<meta xmlns=”http://www.w3.org/1999/xhtml” name=”robots” content=”noindex” >
It addresses the same use-case and is already widely used, although not in the RSS/Atom context.
thanks,
alex
I know that Yahoo respects robots.txt files for not indexing feeds:
http://publisher.yahoo.com/help.php#optout
robots.txt doesn’t really fall into this discussion too much because that’s simply a file exclusion. It tells the viewer not even to look in a specific location, not to look but don’t index.
The meta tag is the closest case, however I don’t think it’s really appropriate. In its existing HTML form it means “don’t index this page” and doesn’t have an item-level version.
Using the RSS/XML formats would really lend itself better to using a proper namespace such as Alex used in the example, though I’d suggest something more identifiable than ‘r’.
‘store’ or ‘storage’ would seem to be more appropriate.
Transposed names – “as Greg used in the example” that was supposed to be.
Greg, why wouldn’t adding a “disallow” for a feed’s URL, in the robots.txt file, not solve the problem for an entire feed? Obviously, that won’t work for partial feeds. But neither your suggestion nor the robots.txt method will work if spiders don’t respect a “no index” directive.
Greg,
Enjoyed meeting you at the Syndicate dinner, and thanks for taking the lead on this.
We think your no index proposal looks great. We did, however, want to throw a suggestion out for comment.
It seems that there could be benefits to both the rss community as well as the consumer if we move from a boolean true/false index flag to one that could enumerate three different categories of flag — no index, global index and personal index.
Consider the scenario of a consumer using a web-based reader that offers the functionality to search for previously viewed items (a la YahooMail, Gmail, etc). Adding the “private index” flag would enable providers to strengthen personal account searchability without having to publicly index the irss feed. For providers that don’t support personal account search, the personal_index flag could be treated the same as no_index.
This may be something to consider down the road, but we figured it might be worth a quick discussion. Thoughts?
I can’t believe that! It’s just i’ve been looking for! Thanks much!
Bill Nussey of Silverpop kicked off an interesting debate on Personalized RSS or Individualized RSS. Dick Costollo and Fred Wilson commented on this approach and the potential problems. Brad Feld also posted on the subject a while back – and…[more]
Pingback
Pingback
…[more]
Pingback
Pingback
Pingback