The archive of the UK central government’s online presence has been indexed and digitally archived to the cloud by Manchester-based tech firm MirrorWeb. The cloud-native archiving company devised a new portal to create a more accessible, searchable and user-friendly resource for the public.
Comprising all captured government web-published content, The National Archives’ gigantic 120TB web archive encompasses billions of web pages from 1996 to the present.
It took MirrorWeb just two weeks to transfer the data from 72 hard drives at The National Archives to internal hard drives, using devices called AWS Snowballs, before transferring more than two decades of government internet history to the cloud.
The four-year contract was awarded to MirrorWeb, which was tasked with both moving the data to the cloud using Amazon Web Services (AWS) as well as indexing it. Indexing the data meant that MirrorWeb had to write a complete replacement for the UK Government Web Archives’ previous search functionality.
As a result, 1.4 billion documents were indexed and are now accessible and searchable to researchers, students and the members of the public who need to use them, enabling them to view websites and social media content in their original form as well as search for content on specific topics.
John Sheridan, Digital Director at The National Archives said, “We are preserving 1,000 years of British history and a big part of that is preserving the digital record of government today.
“MirrorWeb has brought some outstanding technical capabilities, in particular data migration, cloud computing, search, new ways of harvesting and crawling content and new ways of presenting content and making it available. I have been most impressed by MirrorWeb’s use of cloud computing technologies. For example, to index the entire 120TB collection they were able to spin up 1000 node plus cluster of computers to process the entirety of that collection, and in just a couple of days.”
To carry out the indexing MirrorWeb built its own software, WarpPipe, which has the ability to index a large number of small files and indexed all The National Archives’ documents in just ten hours.
Philip Clegg, Chief Technical Officer at MirrorWeb explained, “The files within The National Archives are relatively small but in terms of numbers the volume is huge. This posed a problem for the big data processing tools already on the market, which were quoting us a timeframe of six to eight weeks. This is why we built WarpPipe, enabling the documents to be indexed in ten hours.”
The history of the UK central government’s online presence can now be searched by any user. The search functionality is provided by Elasticsearch, which was chosen because it improves on The National Archives’ previous search engine in terms of speed, flexibility and reliability. The index will eventually be updated monthly as opposed to quarterly, giving the end-user more up to date archive content.
MirrorWeb’s Clegg explained, “In under a second the public-facing website can bring up results from every UK government website which has been preserved and can be viewed just as it was for any chosen date.
In this information age, it is vital that our digital history is preserved and this resource will help educate future generations to come.”
The 120TB of data was backed up in a data centre in The National Archives across 72 USB-3 hard drives. MirrorWeb transferred the data using devices called AWS Snowballs which connect to the local network, copy and encrypt the data to internal hard drives, and can then be shipped to an AWS data centre for transfer into the cloud. MirrorWeb used its two custom-built computers that allowed it to move data from up to sixteen of the USB-3 hard drives at a time.
Key facts:
- 120TB of website data – far bigger than the average consumer hard-drive size of between 500GB and 1TB.
- Every preserved central government website has been indexed to make it searchable.
- Social media archiving is carried out to preserve government digital communications across Twitter and YouTube.
- 4 billion documents indexed that can be searched, refined and accessed through a public-facing website.
The web archive can be viewed and searched here: http://www.nationalarchives.gov.uk/webarchive/
About The National Archives
The National Archives is one of the world’s most valuable resources for research. As the official archive and publisher for the UK government, and England and Wales they are the guardians of some of the UK's most iconic national documents, dating back over 1,000 years. Their role is to collect and secure the future of the government record, both digital and physical, to preserve it for generations to come, and to make it as accessible and available as possible. The National Archives brings together the skills and specialisms needed to conserve some of the oldest historic documents as well as leading digital archive practices to manage and preserve government information past, present and future.