A number of service provider cloud services have been vanishing recently and have in some cases left me without the opportunity to retrieve the data beforehand. Take the Mobile Internet Access Wiki that I started many years ago as an example, as it was just turned off without any notice. I think there is an old saying that goes along the lines that one is allowed to make an error once but not twice. Following that mantra I started thinking which other service provider hosted cloud services I use and how to backup my data – just in case.
The most important one is Typepad, who hosts my blog since 2005. They do a good job and I pay an annual fee for their services. But that does not necessarily mean I will have access to my data should something go wrong. Typepad offers an option to export all blog posts to a text file and I've been making use of this feature from time to time already. There are also WordPress plugins available to import these entries into a self-managed WordPress installation. I haven't tried the later part so far but the exported text file is structured easily enough for me to believe that importing to WordPress is something that can be done. The major catch, however, is that the export does not include pictures. And I have quite a lot of them. So what can be done?
At first I searched the net for a solution and they range from asking Typepad for the images to Firefox plugins that download all images from a site. But none of them offered a straight forward solution to retrieve the full content of my blog including images to create regular backups. So I had a bit of fun lately to create a Python script that scans the Typepad export file for URLs of images I have uploaded and that ignores links to external images. Piped into a text file, that list can then be used with tools such as wget to automatically download all images. As the script could be useful for others out there as well I've attached it to this post below. Feel free to use and expand as you like and please share it back with the community.
Over the years Typepad has changed the way uploaded images are embedded in blog posts and also the directory structure in which images are saved. I have detected four different ways ranging from simple HTML code to links to further HTML pages and Java Script that generate a popup window with the image. In some cases the Python script just copies the URL straight out of the text file while in other places the URL for the popout HTML is used to construct the filename of the image which can then be converted into a URL to download the file. Yes, it required a bit of fiddling around to get this working. This has resulted into a number of "if/elseif" decisions in the script with a number of string compare/copy/insert/delete functions. In the end the script was giving me close to 900 URLs to images and their thumbnails I have uploaded over the years.
And here's the full procedure of how to backup your Typepad blog and images on a Linux machine. It should work similarly on a windows box but I leave it to someone else to describe how to install Python and to get 'wget' working on such a box:
- Login to Typepad, go to "Settings – Import/Export" and click on "Export"
- This will start the download of a text file with all blog entries. Save with a filename of your choice, e.g. blog.txt
- Use the Python script as follows to get the image URLs out of the export file: './get_links.py blog.txt domainname > image-urls.txt'. The domainname parameter is the name under which the blog is available (e.g. http://myblogname.typepad.com). This information is required so the script can distinguish between links to uploaded images and links to external resources which are excluded from the result.
- Once done, check the URLs in 'image-urls.txt' and make spot checks with some of your blog posts to get a feeling for whether anything might be missing. The script gets all images from my blog but that doesn't necessarily mean it works equally well on other blogs as there might be options to upload and embed images that I have never used and result in different HTML code in the Typepad export file that are missed by the script.
- Once you are happy with the content of 'image-urls.txt', use wget to retrieve the images: 'wget -i image-urls.txt'.
- Once retrieved ensure that all files that were downloaded are actually image files and again perform some spot checks with blog entries.
- Save the images together with the exported text file for future use.
Should the day ever come when I need this backup some further actions are necessary. Before importing the blog entries into another blog, the existing HTML and JavaScript code for embedded images in the Typepad export files need to be changed. That's more tricky than just to replace URLs because in some cases the filename of the thumbnails of images are different and in other cases indirect links and JavaScript code has to be replaced with HTML code to directly embedd thumbnails and full images into posts. In other words, that's some more Python coding fun.