Official Google Reader Blog - News, Tips and Tricks from the Reader team

XML Errors in Feeds

12/23/2005 09:50:00 AM
Posted by Mihai Parparita, Software Engineer

Dealing with the millions of RSS and Atom feeds out there is hard work. We're not trying to make you feel sorry for the Reader team, but as anyone who has attempted to implement a feed parser knows, there are many subtle deviations from the spec that you have to handle if you want to have any hope of satisfying the needs of your users (who shouldn't have to care about such things).

The feed generating/parsing world has had the debate about Postel's Law, as it applies to XML and feeds, several times. We are not here to weigh in on either side of the argument. Instead, we hope to provide some data so that such discussions can be made on more than philosophical grounds. Without further ado, here are the top XML errors that we have encountered when parsing all of the feeds that our users have added to Reader (and there are a lot of them):

% of errors Error description
15.6%Input claims to be UTF-8 but contains invalid characters.
14.9%Opening and ending tags mismatch
13.9%An undefined entity is used (e.g.   in an XML document without importing the HTML set)
7.8%Documented expected to begin with a start tag, but no < was found
5.7%Disallowed control characters present
5.5%Extra content at the end of the document
4.2%Unterminated entity reference (missing semi-colon)
4.2%Unquoted attribute value
3.8%Premature end of data in tag (truncated feed)
3.3%Naked ampersand (should be represented as &amp;)
2.1%XML declaration allowed only at the start of the document
1.8%Namespace prefix is used but not defined
0.75%Comment not terminated
0.64%Attribute without value
0.17%Unescaped < not allowed in attributes values
0.11%Malformed numerical entity reference
0.11%Unsupported/invalid encoding
0.10%Comment must not contain '--'
0.10%Attribute defined more than once
0.07%Char out of allowed range
0.03%Comment not terminated
0.02%Sequence ]]> not allowed in content

As a whole, about seven percent of all feeds that we know about have some of these errors (this data is based on a one-day snapshot, so transient errors may be present). Note that these are all XML errors, meaning that the feed is not well-formed. We are not talking about complying with and validating against the RSS or Atom specs - that is an even higher bar than we have set here. In general, our recommendation to feed producers is to use the work that the community has put into the feed validator.

On a related note, we're aware that Reader has some issues with titles. It's great that there are test cases, and we will add this bug to our to-do list.

Why should text have all the fun?

12/15/2005 11:58:00 AM
Posted by Mihai Parparita, Software Engineer

We at the Reader team like to receive some visual stimulation with our reading, so we're subscribed to a bunch of photo feeds. It's great that RSS and Atom can deliver more than just text, but it gets boring to view everything in the exact same fashion.

We've therefore come up with what we call "photo templates," which is a special display mode we have for photo sites. When it's triggered, we try our best to expand thumbnails to full-size photos. Additionally, on the right side of the screen we display a list of clickable thumbnails of other photos from that feed, so that you can cherry-pick the best ones to view. Right now we support the feeds from a few sites; here's a list of them and a sample feed from each one:

This is great if you use one of these photo services, but what about other sites or self-hosted photo blogs? For now we've specifically whitelisted the above five sites for photo template support. This doesn't scale that well - there's thousands of sites and only a few overworked Reader engineers.

Our plan is to support the Media RSS extension to RSS and Atom (the thumbnail and content tags are most relevant to photo feeds). This way, if you include the right tags, Reader will be able to display your feed with the photo template without us having to do any work. The Media RSS spec is pretty thorough, and you can use Flickr's feeds as examples of usage.