Feed polling and caching
As SparkleMuffin periodically makes HTTP requests to update Atom and RSS feeds, we need to ensure:
- we do not put unnecessary load on the remote servers;
- we do not perform unnecessary database updates if the remote content has not changed.
To this effect, we leverage features from the HTTP specification to benefit from remote server caching, and perform additional checks on the feed content.
HTTP Conditional Requests
When responding to an HTTP request, a remote server may set the following headers:
ETag
: the current entity tag for the selected representation (usually a hash of the feed data));Last-Modified
: a timestamp indicating the date and time at which the origin server believes the selected representation was last modified.
When present, we store these values in the database, and use them to set the following headers in subsequent requests:
If-None-Match
: the value of theETag
header from the previous response;If-Modified-Since
: the value of theLast-Modified
header from the previous response.
Depending on whether the feed has changed since the last request, the remote server will then respond with:
200 OK
: the content has changed, we update the feed and its entries;304 Not Modified
: there are no changes, we only update the feed'sETag
andLast-Modified
headers.
Feed content hash
As a remote server may send a different ETag
or Last-Modified
value without the feed content being modified,
or not send any of these headers at all, we:
- compute and store a hash of the feed data using the xxHash non-cryptographic hash function;
- compare the hash of the feed data with what we already have in the database;
- return early if the hashes match, to avoid unnecessary database updates.
Reference
Feed caching
- feed reader score project
- A sysadmin's rant about feed readers and crawlers
- Feeds, updates, 200s, 304s, and now 429s
- So many feed readers, so many bizarre behaviors
- The feed reader score service is now online
RFCs
- RFC 7232 - Hypertext Transfer Protocol (HTTP/1.1) - Validators - Last-Modified
- RFC 7232 - Hypertext Transfer Protocol (HTTP/1.1):- Validators - ETag
- RFC 9110 - HTTP Semantics
HTTP Conditional Requests
- HTTP Conditional Requests Explained
- Bret Simmons - NetNewsWire and Conditional GET Issues
- John Brayton - Feed Polling for Unread Cloud
- Jeff Kaufman - Looking at RSS User-Agents
- Chris Siebenmann - The case of the very old If-Modified-Since HTTP header
- ETag and HTTP caching
- Caching - What takes precedence: the ETag or Last-Modified HTTP header?
Non-cryptographic hash functions
- xxHash, an extremely fast non-cryptographic hash algorithm
- cespare/xxHash library for Go