A powerful AI-driven web crawling and data extraction toolkit.
Getting Started • Examples • Documentation • Contributing • License
Thothy is a modern, flexible toolkit for web crawling, data extraction, and web automation. It provides an easy-to-use API for scraping websites, with powerful features like:
- Headless browser automation via Chrome/Chromium
- Smart extraction for structured data
- Rate-limiting and polite crawling built-in
- Asynchronous support for high-performance operations
- Caching capabilities to reduce bandwidth usage
- Customizable agents for specialized tasks
# Clone the repository git clone https://github.com/ai.thothy/thothy.git cd thothy # Install dependencies pip install -r requirements.txtHere's a command to get you started:
langgraph dev --config langgraph-develop.jsonThothy uses Langgraph for orchestrating agents. You can run Langgraph in both development and production modes:
Development mode uses the langgraph-develop.json configuration file with authentication enabled:
langgraph dev --config langgraph-develop.jsonProduction mode uses the default langgraph.json configuration file (with authentication disabled):
# Using the default config file langgraph dev # Or explicitly specifying the config file langgraph dev --config langgraph.jsonThe main difference between development and production configurations is the disable_studio_auth option:
- Development: Authentication is enabled (
disable_studio_auth: false) - Production: Authentication is disabled (
disable_studio_auth: true)
Choose the appropriate configuration based on your security requirements and deployment environment.
Auth File Not Covered by Dependencies
If you encounter this error:
ValueError: Auth file '/workspace/thothy/agents/security/auth.py' not covered by dependencies. Add its parent directory to the 'dependencies' array in your config. Make sure to include the security module in your dependencies:
"dependencies": [ "./agents/chat_agent/src/chat_graph", "./agents/research_agent/src/research_graph", "./agents/open_deep_research_agent/src/open_deep_research_graph", "./agents/security" ]Any directory referenced in the configuration must be included in the dependencies array.
Thothy comes with several example scripts that demonstrate its capabilities:
A script that crawls Amazon's clothing section and extracts product information including names, prices, ratings, reviews, and image URLs.
# Navigate to the examples directory cd examples # Run the Amazon clothes crawler python amazon_clothes_crawler.pySee the examples directory for more detailed examples and documentation.
Comprehensive documentation is available in the docs directory. Key topics include:
thothy/ ├── agents/ # Agent implementations │ ├── chat_agent/ # Interactive chat agent │ └── research_agent/ # Research and data collection agent ├── docs/ # Documentation ├── examples/ # Example scripts and use cases ├── frontend/ # Web interface components ├── outputs/ # Default output directory for crawled data ├── tests/ # Test suite └── tools/ # Utility tools and helpers Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Our repository uses develop as the default branch instead of main because the project is currently under active development and not yet ready for production use. This follows the Git Flow branching model where:
develop: Contains the latest development changesmain: Will be used for production-ready releases in the future
When contributing, please:
- Create your feature branches from
develop - Submit PRs targeting the
developbranch - Ensure your changes are up-to-date with
developbefore submitting
We use Git Flow for branch management. Here are the essential commands:
# Initialize Git Flow in your repository git flow init # Start a new feature git flow feature start feature-name # Finish a feature (merges back to develop) git flow feature finish feature-name # Start a bugfix git flow bugfix start bugfix-name # Finish a bugfix git flow bugfix finish bugfix-name # Start a release git flow release start 1.0.0 # Finish a release (merges to main and develop) git flow release finish 1.0.0 # Start a hotfix git flow hotfix start hotfix-name # Finish a hotfix (merges to main and develop) git flow hotfix finish hotfix-name- Title: Use a clear, descriptive title that explains the purpose of the PR
- Description: Provide a detailed description of your changes, including:
- What changes were made
- Why these changes are necessary
- Any related issues or PRs
- Code Quality:
- Ensure all tests pass
- Follow the project's coding style
- Keep PRs focused and small (ideally under 400 lines)
- Include tests for new features
- Review Process:
- Address all review comments
- Keep the PR up to date with the base branch
- Mark the PR as "Ready for Review" when complete
We follow the Conventional Commits specification:
<type> #<issue_number> <description> [optional body] [optional footer(s)] Types:
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Code style changes (formatting, etc.)refactor: Code refactoringperf: Performance improvementstest: Adding or modifying testschore: Maintenance tasks
Examples:
feat #123 add OAuth2 login support fix #123 resolve memory leak in long-running sessions docs #123 update installation instructions Issues will be automatically closed when:
- A PR is merged with the commit message containing
fixes #123orcloses #123 - The PR description includes
Fixes #123orCloses #123 - The PR is linked to an issue and marked as "merged"
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Selenium - WebDriver automation
- Playwright - Browser automation
- Beautiful Soup - HTML parsing
- All contributors who have helped shape this project