Skip to the content

Cabinet Office applies algorithm for data disposal

Digital bin icon over laptop
Image source:

The Cabinet Office has begun to use an algorithm to identify data that has no perceived value and could be deleted.

Its head of digital knowledge and information management David Canning said it is beginning to share the algorithm with other departments in an effort to deal with the growing “digital heap” of unorganised, unstructured data in government.

He was speaking at the Think Data for Government conference in London yesterday, where he said there is evidence of the algorithm beginning to produce results by halting the continued rise in the number of digital objects stored by the Cabinet Office each year.

Between 2018-19 and 2021-22 the number increased from 305 million to 634 million, but it was reduced slightly to 630 million in 2022-23 and a further reduction is expected for this year.

The algorithm was developed in-house using a tool from software solutions provider Automated Intelligence and a small language model that tells it what to select.

The initiative has arisen from the growing need in government to sift through enormous quantities of data and decide which pieces should be retained and which could be deleted.

Millions of files

“We’re left with millions and millions of files, and we estimated it would take 59 years for someone to read every piece and make decisions about it,” Canning said. “We did some studies to figure out how people make the decisions, and turned the decision making process of a human being into an algorithm.

“We then automated that and found the algorithm outperformed the humans by a considerable margin. We’re now using it to with all our datasets to keep control of our data systems.”

Canning said the development has made it possible to restore the momentum to a programme led by the Cabinet Office to deal with the digital heap problem that was dropped due to other priorities during the Covid-19 pandemic.

“Departments have been left to do their own thing, but we’ve run with this and I’m beginning to share it around other departments and some are probably about to start on this journey. Anyone can use this algorithm as it essentially a mathematical model to help you make decisions.”

He added: “This where we want to be, with a knowledge base that is accessible and understandable and where we can be accountable.”

Disposal methodology

Earlier this year he published a blogpost on the issue of the digital heap, explaining that the data disposal methodology relies on a series of filtration stages to take out redundant, outdated and trivial (ROT) information.

It begins with a classification of sections of data that may be ROT and removing them if confirmed, followed by the removal of all file formats in which the team does not believe valuable information is stored. Then comes an aggressive reduction through the application of the Lexicon model for coding language in elastic search to metadata, file names, file paths and content. Weighting is applied to this based on a matrix of file format and content to resolve any grey areas.

After the disposal, the records worth retaining are preserved in a specialist software that provides rich metadata and search capability, and the system populates a digital catalogue.

“Having control over a known body of information provides the means to introduce better knowledge management, further reducing cost and inefficiency,” the blog said. “Our Lexicon forms the basic programming language required to develop true artificial intelligence solutions in future. 

“These initiatives will form core parts of our development work in the next few years.”

Register For Alerts

Keep informed - Get the latest news about the use of technology, digital & data for the public good in your inbox from UKAuthority.