Blog

Chemistry duel: man versus machine

LabV chemistry duel - man versus machine

Man versus machine: Who wins in the chemistry laboratory?

Artificial intelligence (AI) put to the test: A new study by Friedrich-Schiller-University Jena, which in Nature Chemistry was published, sheds light on the performance of modern language models in chemistry. Led by Dr. Kevin M. Jablonka, researchers have investigated how powerful modern language models such as GPT-4 really are in chemistry. The result? In many cases, the machines are faster and more precise than human experts — but they also have dangerous weaknesses. The study was recently published in Nature Chemistry.

In a press release, Dr. Kevin Jablonka explains, head of the Carl Zeiss Foundation Junior Research Group at Friedrich Schiller University Jena: “The possibilities of artificial intelligence in chemistry are attracting increasing interest — so we wanted to find out how good these models really are.”

The setting: 2,700 questions, 19 chemists, 1 AI

The study focuses on the new ChemBench benchmark system developed by the Jena research team. It comprises over 2,700 tasks from almost all areas of chemistry: organic, inorganic, analytical, physical and technical. The questions range from school knowledge to university teaching material to complex structural analyses.

A team of researchers compared 19 experienced chemists with modern AI models. Some people were allowed to use tools, but not the AI models. The result: In many cases, the best models delivered more correct answers than the best people.

“The models were therefore able to draw their knowledge exclusively from training with existing data,” explains Jablonka.
A pen and a period system

Between genius and error: Where AI is convincing — and where it isn't

The models performed impressively in many classic knowledge questions. Especially when it comes to assignments from textbooks or regulatory issues, they impressed with their speed and accuracy — often even more than human experts. In a chemical regulation test, GPT-4 achieved a hit rate of 71%, while experienced chemists only achieved 3%. AI models could therefore play an important role as assistance systems in safety assessment in the future, for example when comparing substances with regulatory requirements.

The models had difficulties in predicting NMR spectra and isomers in particular — and gave self-confident but incorrect answers. Especially with NMR spectra, it was clear how the models delivered erroneous results with great conviction.

“A model that provides erroneous answers with a high level of conviction can lead to problems in sensitive areas of research,” warns Jablonka.
Dr Kevin Jablonka Profilfoto
Source: University of Jena

The calculation of isomer numbers also shows a typical weakness of the models: Although they can enter cumulative formulas, they have difficulty identifying all conceivable structural variants. In order to correctly determine the number of possible isomers, they would have to penetrate chemical bonds and spatial arrangements — something that has so far been achieved primarily through experience and structural thinking. The combination of apparent security and a lack of structural understanding makes it clear why such tasks pose a particular challenge for AI.

It is therefore no wonder that the models have so far performed barely better than a random number generator for tasks such as drug development or retrosynthetic analyses, where chemical intuition is crucial. This discrepancy points to a weakness of current evaluation approaches: The success of AI with standardized questions may say more about the nature of the questions than about real understanding of chemistry. A model can correctly represent many facts — but real chemical thinking, which interprets structures, looks through mechanisms and develops creative synthesis routes, remains challenging.

What ChemBench means for education and everyday laboratory work

A key conclusion of the study concerns teaching: If language models are able to solve exam questions faster and better than students, the education system must change. In the future, it will be less about memorizing and more about critical thinking, evaluating uncertainty, and creative chemical problem solving. That the models perform better doesn't necessarily mean that they 'think' chemically — but it shows us that we need to rethink teaching and evaluation criteria.

At the same time, ChemBench shows how important it is to develop wider and deeper evaluation standards for AI. This is because model performance fluctuates significantly depending on the chemical field and the question at hand — and this has a direct impact on its practical applicability. Previous tests have often focused on so-called “property prediction” tasks, i.e. the prediction of simple material properties such as melting point or solubility.

Labor-Flaschen und Periodensystem

But such tasks fall short if AI models should not only be a calculation aid in the future, but also work with experts and prepare real decisions. This also requires better interfaces via which humans and machines communicate reliably — i.e. user-friendly interfaces such as LabV, which present results in an understandable way and enable inquiries. The authors emphasize that benchmarks such as ChemBench are just a first step — user-friendly systems are needed in which AI not only provides answers but also makes uncertainties visible.

A glimpse into the future: What comes after ChemBench?

The study makes it clear that AI is able to solve certain tasks in chemistry faster and more reliably than humans — but its ability to perform structural and intuitive analysis remains limited. The next step is therefore the development of intelligent agent systems that can handle not only text, but also chemical formulas, molecular structures and test data — i.e. with very different types of information that play a role in everyday laboratory life.

In the early phase of materials development, such systems could compare experimental parameters with literature data, suggest alternative synthesis routes or interact directly with laboratory automation systems. This would mean that AI would not only function as a store of knowledge, but also as an active research partner — with the potential to initiate completely new innovation processes.

“The real challenge will be to develop models that not only answer correctly, but also assess when they could be wrong,” the study says.

Man, machine — and material intelligence: How chemistry benefits from AI

The ChemBench study makes it clear that artificial intelligence can complement expertise, but it needs context, control and critical classification. This is exactly where platforms such as LabV come in. As a material intelligence platform, LabV does not aim to replace people — but supports decision-making processes through transparent data integration, comprehensible analyses and clear interfaces. A hybrid approach that combines the strengths of both sides — human intuition and machine efficiency — is the key. And in the future, he will decide whether AI becomes a tool or a black box in the laboratory.

Conclusion: The future is hybrid

ChemBench shows how far AI has come in chemistry — and where it stops understanding. The study is a wake-up call: Anyone who uses AI in the laboratory must understand, control and use it correctly. Then she can be an unbeatable partner. “Our research shows that AI can be an important addition to human expertise — not as a substitute, but as a valuable tool that supports work,” summarizes Kevin Jablonka. “Our study thus lays the foundation for closer cooperation between AI and human expertise in chemistry. ”

“Although today's systems are still a long way from thinking like a chemist, ChemBench can be a building block on the way there.” comments Nature Chemistry the release. AI passed — but still a long way from getting a doctorate.