What’s the best way to get a 120-terabyte web archive, sitting on around 70 hard drives in the west London suburb of Kew, into the cloud? It’s a pretty specific challenge, admittedly, but it’s one that a team at the UK’s National Archives had to tackle last summer.
Since 2003, the National Archives has been taking regular ‘snapshots’ of UK central government websites on a massive scale, and making them publicly available through the online UK Government Web Archive (UKGWA), although some archived websites go back as far as 1996. Over the intervening years, over 5,000 government websites have been archived as part of the government department’s statutory duty to preserve public records. New sites are added to the collection all the time and the UKGWA attracts some 375,000 visits per month.
In fact, the actual process of web archiving has been outsourced to third-party suppliers – and a switch from one provider to another in 2016 represented a good point to move away from running the UKGWA from a physical data centre in Paris belonging to the previous provider, Internet Memory Research (IMR), to the Amazon Web Services (AWS) platform used by the new provider, Manchester-based MirrorWeb.
Under the previous IMR contract, regular transfers of data were made by courier between Paris and Kew, so that the National Archives had a full copy of the archive for long-term preservation purposes. It was the contents of that full copy, based on 72 individual 2TB hard drives, that needed to be sent to AWS, explains Tom Storrar, head of web continuity at the National Archives.
The transition [between suppliers] was quite daunting. I think it’s safe to say that there was potential for a lot to go wrong but, in the event, and with a lot of validation checks, it was a relatively straightforward process.
No fight with Snowball
In part, that’s down to the National Archives’ use of AWS’s Snowball device. Introduced by Amazon in 2015, the Snowball is a secure data storage appliance used to transfer large amounts of data into and out of the AWS cloud. By physically moving a Snowball loaded with data between a customer site and an AWS data centre, or vice versa, customers get around the problems associated with large-scale electronic data transfers: high network costs, long transfer times, security concerns and so on. All data transferred onto a Snowball is encrypted in transfer.
The National Archives set itself the goal of completely loading the data in two weeks. The team at Mirrorweb, meanwhile, calculated that with eight USB drives feeding Snowball in parallel, given the expected transfer speeds of these USB-3 drives, two Snowballs would be needed to get the job done in time.
Each Snowball, meanwhile, came with a single 10GB network connection and MirrorWeb purchased and configured two PCs, customized for high-speed input/output. From there, it was just a case of feeding the system with new hard drives on a regular basis. The data was successfully transferred in the allotted fortnight and the two Snowballs were shipped off to Amazon.
Massive data, minimized costs
Within a matter of weeks, the UKGWA was being served from the cloud, with no interruption to the service at all, says Storrar. In fact, he adds, being in the cloud means it will be much easier to improve and expand the service.
For example, there’s the issue of search capabilities: these have been dramatically improved in the shift to the cloud, using the Amazon Elasticsearch service, says Storrar.
Then there’s the business of how many sites are indexed. Typically, around 100 government websites are captured every month and for each, a crawl order is sent to MirrorWeb. There, the team spin up multiple AWS S3 instances for storage and points a web crawler at the relevant domain to collect the information on the website. As Storrar explains:
So there’s around 100 of these crawls per month, but we could do 1,000 per month or even 10,000 per month, because you can scale up resources to run a lot more crawlers. If you don’t have to update your data centre or anything like that, it’s very easy to allocate significantly more crawlers to the collection effort.”
I think the possibilities of cloud in this kind of archiving are almost infinite when you take into consideration tasks such as indexing, search and OCR [optical character recognition], which used to be very expensive to perform on in-house archives. We can also take advantage of ‘spot instances’, where you can buy cloud compute power for a limited time at 10% of the regular price, and so on.
If you can take away the massive costs that used to be attached to dealing with massive data volumes, all sorts of things become viable. We can start to do more interesting work and provide additional services to the user and, in that way, the UKGWA becomes an even more valuable resource.
Image credit - Image sourced via National Archives