Convert HTML Table (with no rowspan/colspan) to JSON using Python
✅ Smart Header Detection - Automatically detects headers in <thead> or first row with <th> elements
✅ Flexible Output Format - Returns list of dictionaries (when headers exist) or list of lists
✅ Robust HTML Parsing - Handles complex nested HTML content within cells
✅ Error Handling - Comprehensive input validation and error reporting
✅ Type Safety - Full type hints for better development experience
✅ Multiple Utility Functions - Various conversion methods for different use cases
pip install beautifulsoup4from tabletojson import html_to_json # Convert HTML table to JSON html_content = "<table>...</table>" json_result = html_to_json(html_content, indent=2) print(json_result)from tabletojson import ( html_to_json, html_table_to_dict_list, html_table_to_list_of_lists, validate_html_table ) # Validate table first if validate_html_table(html_content): # Force dictionary output (with auto-generated headers if needed) dict_result = html_table_to_dict_list(html_content) # Force list of lists output (ignores headers) list_result = html_table_to_list_of_lists(html_content)Input:
| ID | Vendor | Product |
|---|---|---|
| 1 | Intel | Processor |
| 2 | AMD | GPU |
| 3 | Gigabyte | Mainboard |
[ { "product": "Processor", "vendor": "Intel", "id": "1" }, { "product": "GPU", "vendor": "AMD", "id": "2" }, { "product": "Mainboard", "vendor": "Gigabyte", "id": "3" } ]Input:
| 1 | Intel | Processor |
| 2 | AMD | GPU |
| 3 | Gigabyte | Mainboard |
Output:
[ [ "1", "Intel", "Processor" ], [ "2", "AMD", "GPU" ], [ "3", "Gigabyte", "Mainboard" ] ]Input:
Output:
[ { "date": "13 Jul 2022", "transaction description": "SOME WORKPLACESalary", "debit/cheque": "", "credit/deposit": "$3,509.30", "balance": "OD $1,725.53" }, { "date": "12 Jul 2022", "transaction description": "ATM DEPOSITCARD 1605", "debit/cheque": "", "credit/deposit": "$400.00", "balance": "OD $5,234.83" }, { "date": "11 Jul 2022", "transaction description": "Another TransactionAnother Transaction", "debit/cheque": "", "credit/deposit": "$104.00", "balance": "OD $5,634.83" }, { "date": "11 Jul 2022", "transaction description": "MB TRANSFERTO XX-XXXX-XXXXXXX-51", "debit/cheque": "$4.50", "credit/deposit": "", "balance": "OD $5,738.83" } ]- Fixed header detection logic that was incorrectly searching for
thelements - Fixed data structure handling for consistent output format
- Fixed handling of tables with
tbodyelements - Fixed text extraction from nested HTML elements (preserves spaces)
- Type Safety: Added comprehensive type hints for better IDE support
- Input Validation: Robust error handling with descriptive error messages
- Multiple Output Formats: New utility functions for different use cases
- Smart Header Detection: Detects headers in
<thead>or first row with<th>elements - Empty Row Handling: Automatically skips empty rows
- Flexible Cell Handling: Handles mismatched column counts gracefully
- Modular Design: Split functionality into focused helper functions
- Better Error Handling: Comprehensive exception handling with context
- Documentation: Added detailed docstrings and usage examples
- Test Coverage: Enhanced test cases covering edge cases and error scenarios
html_to_json()- Main conversion function (improved)html_table_to_dict_list()- Force dictionary output with auto-generated headershtml_table_to_list_of_lists()- Force list of lists outputvalidate_html_table()- Validate HTML table structure
- Support for nested table
-
Support for buggy HTML table (ie.✅ Fixedtdinstead ofthinthead) -
Improve error handling✅ Added comprehensive error handling -
Add type hints✅ Added full type safety -
Better text extraction from nested elements✅ Improved HTML parsing - Support for rowspan/colspan attributes
- Add support for custom header mappings
- Add CSV export functionality