The Potential of AI in Chemistry: The Urgent Need for Accurate and Accessible Training Data
January 31, 2024Despite the Promise, Chemists Struggle to Harness the Full Power of AI Due to Data Limitations
As artificial intelligence (AI) continues to make waves across various fields, the world of chemistry finds itself facing a unique challenge — the lack of accurate and accessible training data hindering the realization of AI’s potential in the discipline.
Unlike the concerns raised by figures like Geoffrey Hinton about the broader societal impacts of AI, chemists are quietly expressing frustration over the technology’s slow progress in their field. The crux of the issue lies in the quality and quantity of data available to train machine-learning systems, which are crucial for the success of AI applications in chemistry.
AI systems heavily rely on neural networks, which demand large, reliable, and unbiased training datasets. To revolutionize the way researchers seek and synthesize new substances, generative-AI tools in chemistry require comprehensive and accessible training data sets, including experimental and simulated data, historical records, and even knowledge from unsuccessful experiments.
For example, AI tools specializing in retrosynthesis, a process that works backward from a desired chemical structure to determine the best starting materials and reaction steps, have gained attention. However, their adoption remains limited among chemists due to the lack of exhaustive and accessible training data.
The concept of ‘inverse design,’ wherein AI starts with desired physical properties to identify substances, has shown mixed progress. While computational approaches to inverse design are in use, they require sufficient training data linking chemical structures to properties for AI to outperform existing tools. The data hunger of generalist generative-AI systems, such as ChatGPT, necessitates hundreds of thousands to millions of data points for applications in chemistry.
The successful AlphaFold protein-structure-prediction tool exemplifies the power of AI with a sufficient high-quality dataset. Training on the Protein Data Bank, established in 1971 and containing over 200,000 experimentally determined protein structures, AlphaFold showcases the potential impact of rich and extensive data.
Addressing the data gap in chemistry, researchers explore solutions like algorithms converting chemical names to structures, automated laboratory systems, and the use of both real and simulated data for AI model training. The call for open data in scientific publications gains prominence, emphasizing the need for data accessibility to propel AI applications forward.
As chemists strive for AI tools to outperform even the best human scientists, the critical importance of collecting, sharing, and standardizing data cannot be overstated. Without a concerted effort to bridge the data gap, AI in chemistry risks remaining a case of hype over hope.
Source: Nature 617, 438 (2023) doi: https://doi.org/10.1038/d41586-023-01612-x