A comprehensive Python package for extracting composition-property data from scientific articles for building databases
ComProScanner is a multi-agent framework designed to extract composition-property relationships from scientific articles in materials science. It automates the entire workflow from metadata collection to data extraction, evaluation, and visualization.
Key Features:
- π Multi-publisher support (Elsevier, Springer, Wiley, IOP, local PDFs)
- π€ Agentic extraction using CrewAI framework
- π RAG-powered context retrieval for cost effective automation with accuracy
- π Comprehensive evaluation and visualization tools
- π― Customizable extraction workflows
- π Knowledge graph generation
Install from PyPI:
pip install comproscannerOr install from source:
git clone https://github.com/slimeslab/ComProScanner.git cd comproscanner pip install -e .Here's a complete example extracting piezoelectric coefficient (
from comproscanner import ComProScanner # Initialize scanner scanner = ComProScanner(main_property_keyword="piezoelectric") # Collect metadata scanner.collect_metadata( base_queries=["piezoelectric", "piezoelectricity"], extra_queries=["ceramics", "applications"] ) # Process articles property_keywords = { "exact_keywords": ["d33"], "substring_keywords": [" d 33 "] } scanner.process_articles( property_keywords=property_keywords, source_list=["elsevier", "springer"] ) # Extract composition-property data scanner.extract_composition_property_data( main_extraction_keyword="d33" )The ComProScanner workflow consists of four main stages:
- Metadata Retrieval - Find relevant scientific articles
- Article Collection - Extract full-text from various publishers
- Information Extraction - Use LLM agents to extract structured data
- Post Processing & Dataset Creation - Evaluate, clean, and visualize results
π Full documentation is available at slimeslab.github.io/ComProScanner
- Elsevier (via TDM API)
- Springer Nature (via TDM API)
- Wiley (via TDM API)
- IOP Publishing (via SFTP bulk access)
- Local PDFs (any publication)
- Composition-property relationships
- Material families
- Synthesis methods and precursors
- Characterization techniques
- Synthesis steps
- Semantic Evaluation - Using semantic similarity measures
- Agentic Evaluation - LLM-powered contextual analysis
- Data Visualization
- Evaluation Visualization
scanner.process_articles( property_keywords=property_keywords, source_list=["elsevier", "springer", "wiley"] )scanner.extract_composition_property_data( main_extraction_keyword="d33", rag_chat_model="gemini-2.5-pro", rag_max_tokens=2048, rag_top_k=5 )from comproscanner import data_visualizer, eval_visualizer # Create knowledge graph data_visualizer.create_knowledge_graph(result_file="results.json") # Plot evaluation metrics eval_visualizer.plot_multiple_radar_charts( result_sources=["model1.json", "model2.json"], model_names=["GPT-4o", "Claude-3.5"] )- Python 3.12 or 3.13
- TDM API keys for desired publishers (Elsevier, Springer, Wiley)
- LLM API keys (OpenAI, Anthropic, Google, etc.)
- Optional: Neo4j for knowledge graph visualization
If you use ComProScanner in your research, please cite:
@misc{roy2025comproscannermultiagentbasedframework, title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature}, author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni}, year={2025}, eprint={2510.20362}, archivePrefix={arXiv}, primaryClass={physics.comp-ph}, url={https://arxiv.org/abs/2510.20362}, }See the CHANGELOG for details on what has changed in each version.
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright Β© 2025 SLIMES Lab
Author: Aritra Roy
- π Website: aritraroy.live
- π§ Email: contact@aritraroy.live
- π GitHub: @aritraroy24
- π Twitter: @aritraroy24
Project Links:
- π¦ PyPI: pypi.org/project/comproscanner
- π Documentation: slimeslab.github.io/ComProScanner
- π Issues: github.com/slimeslab/ComProScanner/issues
Made with β€οΈ by SLIMES Lab

