Spider a Site With wget Using sitemap.xml

Published on Sun, 17 May 2009

On a number of sites at work we employ a static file caching extension to do just that: create static files that are served until the cache is invalidated. One of things that will invalidate the cache is deploying a new release of the code. This means that many of the requests after deploying will need to be generated from scratch, often causing the full Rails stack to be started (via Passenger) each time. To get around this I came up with the following to use wget to spider each of the URLs listed in the sitemap.xml. This ensures each of the major pages has been cached so most requests will be cache hits.

wget --quiet http://www.example.com/sitemap.xml --output-document - | egrep -o "http://www\.example\.com[^<]+" | wget --spider -i - --wait 1

That should all be executed on one line. There’s a one second wait in there to spread out the requests a bit but you can remove it if you like.

Stay in touch!

Follow me on Twitter or Mastodon, subscribe to the feed, or send me an email.