Oct 27 – 30, 2024
Achat Hotel Karlsruhe City
Europe/Berlin timezone

Reviving the Perovskite Solar Cell Database: Creating a Living Database with Large Language Models

Oct 29, 2024, 5:50 PM
20m
Kurfürstensaal (Achat Hotel)

Kurfürstensaal

Achat Hotel

Talk Machine learning (ML) applications using existing data repositories Session - click on "Detailed view" on the top right to see all contributions

Speaker

Sherjeel Shabih

Description

Structured data, in which properties of materials, systems, or devices, are tabulated in a systematic way is a foundation for the methodical optimization and design of novel materials or devices. One of the most widely known databases in materials science is the metal-halide perovskite solar cells database. While this database found widespread use it is difficult to update and extend as it has been manually curated.
Such manual curation requires tremendous labor and vigilant work and is thus not scalable.
Recent advances in large language models (LLMs) indicate that this manual work might at least partially be replaced using these models. A difficulty however is, that for scientific use cases we have high requirements on robustness. In addition, the relevant information is often dispersed across articles and partially in figures in tables, requiring more reasoning than just mere text extraction.
Given that there is already a large amount of extracted data in the perovskite database along with instructions for human extractors we can leverage this information to bootstrap an automatic and robust extraction pipeline based on large language models. The existing resources additionally, provide us with labeled information we leverage for systematic optimization.
Here we present an end-to-end pipeline that robustly extracts data, with more context than in the current database, in a scalable way. To develop an autonomous system, we couple our extraction pipeline to a paper crawler. With this, we can identify new relevant papers, extract relevant information, and then commit structured data with confidence scores into a staging area of the NOMAD [1, 2] database.
Our work provides a blueprint for the autonomous maintenance of datasets, which we believe is a key enabler for harnessing the collective knowledge of materials science that is currently in the dark.

Primary authors

Presentation materials

There are no materials yet.