Predicting the solubility of chemical substances in water: What’s my reference?

August 8th 2019

DIFFER researchers Murat Sorkun, Abhishek Khetan and Süleyman Er have used advanced algorithms to compose a large and reliable database that contains the solubility information of chemicals in water. “Strange as it may seem with respect to the huge importance of employing water as a chemical solvent”, says Süleyman Er, “a reliable, sufficiently large and machine-readable dataset for the development of efficient data-driven models was not available. Until now!” The results are published in Nature Scientific Data.

AqSolDB - the aqueous solubility database

Water covers almost three quarters of the Earth’s surface. This nearly colorless and transparent chemical substance is vital for all known forms of life, although the water molecules themselves provide no calories or organic nutrients. Water plays an important role in the world economy, being among many other important uses an excellent solvent for a wide variety of chemical substances used in chemical and industrial processes, including biochemistry, drug-design, agrochemical design, geochemistry, catalysis, and energy storage.

A crucial property of a chemical substance is how good it can be dissolved in a given volume of water. Predicting the solubility of chemical substances in water is usually not a straightforward task. This is essentially because of unsystematic errors between different experiments and the limited prediction capability of data-driven methods that substantially rely on the measured data during training and validation steps. In other words the development of reliable data-driven prediction models for the solubility of compounds in water has been hindered by uncertainties and disagreements in the underlying measured data that have been obtained by disparate sources of experiments and over a large time-span.

Process diagram for data curation and generation of AqSolDB.

To develop generalizable prediction models we need accurate datasets that contain information from many different and diverse chemical substances. Here, a small team of DIFFER researchers has developed the largest and freely available solubility data of compounds in water. DIFFER's aqueous solubility database (AqSolDB) consists of almost 10,000 compounds that have been chosen using advanced computer algorithms that select statistically most reliable experimental values from different publicly available datasets.

Additionally, the AqSolDB contains some complementary topological and physico-chemical descriptors of these compounds, like molecular weight, molar refractivity, number of H acceptors and donors, number of rotatable bonds, polar surface area and the like.  Abhishek Khetan: “We designed the database such that is an easy-to-use and well-structured database of compounds and we expected it to serve a broad community as a reference database.” It serves as an important reference for benchmarking of future experimental and physics-based modelling results. “Moreover”, adds Murat Sorkun, “the AqSolDB dataset will be useful as a machine-readable ancillary resource to improve the prediction capability of new machine learning approaches.”


Murat Cihan Sorkun, Abhishek Khetan and Süleyman Er, AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds, Nature Scientific Data (2019)


Murat Sorkun, Abhishek Khetan and Süleyman Er