Skip to content

Kjdragan/docsingest

Β 
Β 

Repository files navigation

DocsIngest

Turn any document directory into a prompt-friendly text ingest for LLMs, with a focus on compliance and comprehensive context generation.

πŸš€ Features

  • Multi-Format Document Support

    • Ingests PDF, DOCX, Markdown, TXT files
    • Automatic encoding detection
    • Intelligent file type handling
    • NEW: Extended support for .xlsx, .xls, .pptx, .json, .csv, .xml
  • Compliance-Focused Ingestion

    • Pre-configured Compliance Officer prompt
    • Customizable AI agent context
    • Designed for compliance in mind
  • Smart File Processing

    • Skips system and configuration files
    • Handles temporary and hidden files
    • Supports complex directory structures
  • Metadata and Reporting

    • Generates comprehensive directory structure tree
    • Counts total files and tokens
    • Provides summary statistics
  • Semantic Compression (NEW)

    • Intelligently reduce document size while maintaining core meaning
    • Configurable compression levels
    • Preserves full original content
    • Optional compressed view for AI processing
  • Flexible Usage

    • Command-line interface
    • Importable as a Python package
    • Configurable output options

πŸ“¦ Installation

Using pip

pip install docsingest

From Source

# Clone the repository git clone https://github.com/marc-shade/docsingest.git # Navigate to the directory cd docsingest # Install the package pip install -e .

πŸ’‘ Command Line Usage

# Ingest documents with default Compliance Officer prompt docsingest /path/to/documents # Enable semantic compression docsingest /path/to/documents --compress # Custom compression level docsingest /path/to/documents --compress --compression-level 0.7 # Custom AI agent prompt docsingest /path/to/documents --agent "Financial Auditor" -o financial_report.md

πŸ› Python Package Usage

from docsingest import ingest # Basic usage summary, tree, content = ingest("/path/to/documents") # Custom agent prompt summary, tree, content = ingest( "/path/to/documents", agent_prompt="Specialized Compliance Analyst" )

πŸ› οΈ Supported File Types

  • PDF
  • Microsoft Word (.docx)
  • Microsoft Excel (.xlsx, .xls)
  • Microsoft PowerPoint (.pptx)
  • Markdown (.md)
  • Plain Text (.txt)
  • CSV
  • XML
  • JSON

🚫 Automatically Skipped Files

  • .DS_Store
  • Temporary Office files (~$)
  • Temporary files (.tmp)
  • Log files
  • Git-related files and directories
  • IDE configuration directories
  • Python cache and virtual environment files

πŸ” Regulatory Compliance Framework

DocsIngest provides a robust, multi-layered approach to regulatory compliance and document risk management:

πŸ›‘οΈ Comprehensive Compliance Features

Regulatory Compliance Overview

  • Multi-Jurisdiction Support: Designed to handle compliance requirements across various regulatory landscapes
  • Adaptive Compliance Scanning: Intelligent detection of sensitive information and potential regulatory risks
  • Configurable Compliance Profiles: Customizable settings for different industry standards and regulations

Risk Assessment Workflow

  1. Document Ingestion Analysis

    • Automatic classification of document types
    • Identification of sensitive and regulated content
    • Contextual risk scoring
  2. Compliance Risk Evaluation

    • Detect potential regulatory violations
    • Flag documents with high-risk content
    • Generate detailed compliance reports
  3. Proactive Monitoring

    • Continuous document scanning
    • Real-time alerts for compliance breaches
    • Audit trail generation

πŸ”’ Supported Compliance Domains

  • GDPR (General Data Protection Regulation)
  • HIPAA (Health Insurance Portability and Accountability Act)
  • CCPA (California Consumer Privacy Act)
  • SOX (Sarbanes-Oxley Act)
  • PCI DSS (Payment Card Industry Data Security Standard)
  • NIST Framework
  • ISO 27001 Information Security Management

🚨 Key Compliance Capabilities

  • Advanced PII Detection
    • Identify sensitive personal information
    • Support for multiple PII categories:
      • Names
      • Email addresses
      • Phone numbers
      • Social Security Numbers
      • Credit card numbers
  • Intelligent Redaction
    • Automatic masking of sensitive information
    • Configurable redaction levels
  • Comprehensive Compliance Reporting
    • Detailed risk assessment
    • Actionable compliance recommendations
  • Multi-Regulation Support
    • Compliance checks for GDPR, FERPA, COPPA
    • Proactive regulatory alignment

πŸ” Compliance Verification Process

  1. Document Ingestion
  2. Automated PII Scanning
  3. Risk Assessment and Scoring
  4. Compliance Reporting
  5. Optional Redaction

Note: While DocsIngest provides powerful compliance tools, it is not a substitute for professional legal or compliance advice. Always consult with compliance experts for your specific regulatory requirements.

πŸ”§ Development

# Clone the repository git clone https://github.com/marc-shade/docsingest.git cd docsingest # Create virtual environment python3 -m venv .venv source .venv/bin/activate # Install dependencies pip install -r requirements.txt pip install -r requirements-dev.txt # Run tests pytest tests/

🀝 Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

πŸ“„ License

MIT License

πŸš€ Roadmap

  • Support more file types
  • Enhanced token estimation
  • Web interface
  • Cloud storage integration
  • Advanced AI prompt customization

About

AI-powered document ingestion tool with compliance features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%