Patent database of 15 million chemical structures goes public

The internet’s wealth of free chemistry data just got significantly larger. Today, the European Bioinformatics Institute (EBI) has launched a website — — that allows anyone to search through 15 million chemical structures, extracted automatically by data-mining software from world patents.

The initiative makes public a 4-terabyte database that until now had been sold on a commercial basis by a software firm, SureChem, which is folding. SureChem has agreed to transfer its information over to the EBI — and to allow the institute to use its software to continue extracting data from patents.

“It is the first time a world patent chemistry collection has been made publicly available, marking a significant advance in open data for use in drug discovery,” says a statement from Digital Science — the company that owned SureChem, and which itself is owned by Macmillan Publishers, the parent company of Nature Publishing Group.

Under the agreement, Digital Science retains use of the SureChem software; the company is being wound up because Macmillan wants to focus on serving researchers, not commercial clients such as drug firms, says SureChem’s co-founder, Nicko Goncharoff.

“We are delighted to take on the stewardship of this resource,” says John Overington, head of computational chemical biology at the EBI, which is part of the European Molecular Biology Laboratory in Hinxton, UK. “Scientists are accustomed to doing literature searches, but the patent literature is often where the real gems lie — especially in translational science,” he adds. Published papers lag the patent literature by about two years, he points out.

Overington says that the EBI plans to interlock information on chemical compounds from different public resources. For example, a search on a compound such as Pfizer’s Viagra (sildenafil) will reveal its presence in patents (from SureChemBL), as well as its interactions with potential protein drug targets (from databases such as the EBI’s ChemBL, which catalogues experiments done on compounds).

Later, Overington hopes to apply SureChem software to extract structures mentioned in research papers, starting with open-access papers held in repositories such as Europe PubMed Central. But, he adds, reconstruction of chemical data from papers is harder, because structures are often not named or pictured explicitly, but only alluded to as variants on a common molecular skeleton.

Historically, chemists have not had a wealth of free online data, and have been used to paying to get information from private databases. SureChem released data on 10 million molecules into the public database PubChem last year, but the information was restricted (as the information on links to patents could only downloaded one molecule at a time). But the web’s resources of searchable public chemical data are fast expanding. “I think it’s a really exciting time for chemistry,” Overington says.

Richard Van Noorden