How to customize your full-text RSS feeds in Plone, and discourage spam content harvesters at the same time

by Rudd-O published 2008/12/12 13:56:36 GMT+0, last modified 2013-06-26T03:24:20+00:00
Earlier on, we discussed how to enable full-text RSS feeds. Now we'll discuss how to improve on that by preventing your full-text feeds from being harvested and posted on blogspammers' sites. The nice thing about this trick is that you can use it to include any sort of text on your RSS feeds, while leaving your site content completely unaffected.

Remember how we discussed the topic of enabling full-text RSS feeds?  Well, full-text RSS feeds contain the full text of your articles.  But you might also want to include some text in the RSS feeds that does not show in the articles of the site itself.  Using this trick, you can include practically anything you want in your RSS feeds; for the purposes of this article we're going to take on spammers as an example of that.

Suppose there is a spammer harvesting your RSS feeds for content on his site.  You don't want to help him, but you don't want to remove full-text RSS feeds on your site.

So how do we discourage spammers from harvesting our content?  We ameliorate their content-reproduction scams basically by tacking a link to the original article onto the RSS feed text.  This has an  effect that goes against the interests of the spammer: every time they harvest content from your site, that gives your site backlinks while increasing your search engine rank.

And, oh, this hack is very simple.

Injecting a custom script into the RSS generation template

Remember how we had customized the rss_template?  Here is the snippet we added:

<content:encoded 
  xmlns:content="http://purl.org/rss/1.0/modules/content/" 
  tal:condition="obj_item/getText | nothing"
  tal:content="structure python: '&lt;![CDATA[' + obj_item.getText() + ']]&gt;' ">blah
</content:encoded>

Well, we're going to change that a bit now:

<content:encoded 
  xmlns:content="http://purl.org/rss/1.0/modules/content/" 
  tal:condition="obj_item/getText | nothing"
  tal:content="structure python: '&lt;![CDATA[' + context.rss_antispam(obj_item) + ']]&gt;' ">blah
</content:encoded>

Creating the antispam script that modifies the RSS content on the fly

You'll note that we've now included a call to context.rss_antispam().  This is a Script (Python) that you're going to add to the same folder where your customized rss_template lives.  The contents of the script are straightforward:

text = object.getText()
try:
   m = unicode(text,"utf-8") # if this is unicode, the next line does not execute
   text = text.decode(object.getCharset()) # convert to unicode
except TypeError,e: pass

title = object.pretty_title_or_id()
try:
   m = unicode(title,"utf-8") # if this is unicode, the next line does not execute
   title = title.decode(object.getCharset()) # convert to unicode
except TypeError,e: pass

link = object.absolute_url()
pattern = u'<p><small>This article was culled from <a href="%s">%s</a></small></p>'
preface = pattern%(link,title)

return u"\n" + preface + u"\n" + text

Once you have added this script and set its ID to rss_antispam, add object to its parameter list. 

You'll note that the script contains several casts to Unicode text for your article's fields.  This has a rationale behind it: Plone sometimes returns Unicode objects which, when concatenated to straight text objects, produce an UnicodeEncodeError.  All we do here is convert Unicode objects to UTF-8 encoded straight text ones.  This prevents the error.

And that's it.

Now, when an RSS feed is accessed on your site, every object to be "feedified" will be run through the rss_antispam script, which will prepend a nice direct link in small typeface to the text of your article.

You'll notice that you can add anything in your antispam script, not just the link to the original article.  Original ideas for you to add:

  1. advertising
  2. advertising
  3. a link to the comments anchor on your article
  4. advertising
  5. articles on your site related to the text in question
  6. links to other feeds or topic search results
  7. advertising

Nice, huh?