Domain-Aware Document Intelligence for Drug Discovery Through Extraction Driven Retrieval and Agentic Search
Pharmaceutical research and development generate vast volumes of heterogeneous data, ranging from experimental protocols and screening results to decades of scientific knowledge stored in unstructured documents. As multi-institutional collaborations scale, the challenge of interpreting and retrieving relevant information becomes increasingly complex. Traditional document management systems treat these records as static files, while standard Retrieval-Augmented Generation (RAG) approaches lack the domain awareness required to interpret specialized scientific language, molecular diagrams, and quantitative bioactivity measurements.
We present Docu-Store, an open-source document intelligence platform that transforms passive scientific collections into active, queryable knowledge bases. The system is built on a four-layer architectural framework that treats extraction as the foundation for effective retrieval in domain-specific contexts.
Layer 1: Multimodal Extraction. The platform employs purpose-built machine learning libraries to recover information that standard text pipelines miss. structflo-cser detects chemical structure diagrams and their associated labels in PDF pages, linking them and converting structures into machine-readable SMILES representations. In parallel, structflo-nerextracts domain-specific entities, including compounds, protein targets, genes, diseases, and bioactivity measurements.
Layer 2: Progressive Knowledge Enrichment. Extracted entities are propagated as structured metadata across all representations. The system combines asymmetric dense retrieval with domain-adapted sparse encoding, preserving both semantic meaning and the integrity of scientific identifiers that standard tokenization often fragments.
Layer 3: Multi-Stage Hybrid Search. Dense neural embeddings and sparse term vectors are fused using Reciprocal Rank Fusion, followed by cross-encoder reranking. This enables the system to capture both semantic relevance and exact-match precision required for scientific queries, while resolving nuances such as quantitative thresholds, negation, and contextual dependencies.
Layer 4: Agentic Retrieval-Augmented Generation. On top of this retrieval layer, an LLM-driven agent iteratively analyzes queries, seeds retrieval using domain entities, refines its search strategy, and synthesizes grounded answers from multiple sources. This approach improves reliability for complex questions where provenance and evidence aggregation are essential.
Docu-Store is model-agnostic and supports both local and cloud-based LLM deployments.
Key Takeaways
1. A clear understanding of why generic AI and RAG systems fail in domain-specific contexts, and how domain-aware extraction enables meaningful retrieval in scientific document collections.
2. A reusable architectural pattern: an enrichment–retrieval feedback loop where structured extraction continuously improves search quality by combining semantic representations with domain-specific metadata.
3. Practical design principles for building reliable, high-stakes RAG systems, including hybrid retrieval (dense + sparse), multi-granularity indexing, reranking, and agentic query refinement with grounded evidence synthesis.
These insights, while demonstrated in drug discovery, are directly transferable to other document-intensive domains such as healthcare, legal, and financial systems.
Siddhant Rath is a researcher and Lead AI Engineer at Texas A&M AgriLife Research. He is the primary architect of DAIKON (an AI-driven drug discovery platform), CAGE-Fusion (an ML/GNN-based molecular property predictor), and DocuStore. His research focuses on the intersection of deep learning, cheminformatics, and drug discovery informatics.
He has authored several peer-reviewed publications and contributes to the TB Drug Accelerator of the Gates Foundation. Git: https://github.com/sidxzLinkedin: https://www.linkedin.com/in/sidxz
Saswati Panda is an AI Engineer and Researcher at Texas A&M AgriLife Research, focusing on biomedical and life sciences software engineering and research. She specializes in building AI/ML powered platforms and computational tools that accelerate early drug discovery, transforming complex workflows into actionable insights. She is part of the core development team of the DAIKON platform for the TB Drug Accelerator (TBDA) consortium at Gates Foundation, supporting global efforts in tuberculosis research. She is also an active contributor to open-source scientific software and has co-authored multiple publications in drug discovery and cheminformatics. Linkedin: https://www.linkedin.com/in/saswati-panda/
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.