Skip to content

mdminhazulhaque/html-table-to-json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html-table-to-json

Convert HTML Table (with no rowspan/colspan) to JSON using Python

Features

Smart Header Detection - Automatically detects headers in <thead> or first row with <th> elements
Flexible Output Format - Returns list of dictionaries (when headers exist) or list of lists
Robust HTML Parsing - Handles complex nested HTML content within cells
Error Handling - Comprehensive input validation and error reporting
Type Safety - Full type hints for better development experience
Multiple Utility Functions - Various conversion methods for different use cases

Installation

pip install beautifulsoup4

Usage

Basic Usage

from tabletojson import html_to_json # Convert HTML table to JSON html_content = "<table>...</table>" json_result = html_to_json(html_content, indent=2) print(json_result)

Advanced Usage

from tabletojson import ( html_to_json, html_table_to_dict_list, html_table_to_list_of_lists, validate_html_table ) # Validate table first if validate_html_table(html_content): # Force dictionary output (with auto-generated headers if needed) dict_result = html_table_to_dict_list(html_content) # Force list of lists output (ignores headers) list_result = html_table_to_list_of_lists(html_content)

Examples

Example 1: Table with Headers

Input:

ID Vendor Product
1 Intel Processor
2 AMD GPU
3 Gigabyte Mainboard

Output

[ { "product": "Processor", "vendor": "Intel", "id": "1" }, { "product": "GPU", "vendor": "AMD", "id": "2" }, { "product": "Mainboard", "vendor": "Gigabyte", "id": "3" } ]

Example 2: Table without Headers

Input:

1 Intel Processor
2 AMD GPU
3 Gigabyte Mainboard

Output:

[ [ "1", "Intel", "Processor" ], [ "2", "AMD", "GPU" ], [ "3", "Gigabyte", "Mainboard" ] ]

Example 3: Complex Table with tbody and Nested Content

Input:

Date Transaction Description Debit/Cheque Credit/Deposit Balance
13 Jul 2022 SOME WORKPLACE
Salary
$3,509.30 OD $1,725.53
12 Jul 2022 ATM DEPOSIT
CARD 1605
$400.00 OD $5,234.83
11 Jul 2022 Another Transaction
Another Transaction
$104.00 OD $5,634.83
11 Jul 2022 MB TRANSFER
TO XX-XXXX-XXXXXXX-51
$4.50 OD $5,738.83

Output:

[ { "date": "13 Jul 2022", "transaction description": "SOME WORKPLACESalary", "debit/cheque": "", "credit/deposit": "$3,509.30", "balance": "OD $1,725.53" }, { "date": "12 Jul 2022", "transaction description": "ATM DEPOSITCARD 1605", "debit/cheque": "", "credit/deposit": "$400.00", "balance": "OD $5,234.83" }, { "date": "11 Jul 2022", "transaction description": "Another TransactionAnother Transaction", "debit/cheque": "", "credit/deposit": "$104.00", "balance": "OD $5,634.83" }, { "date": "11 Jul 2022", "transaction description": "MB TRANSFERTO XX-XXXX-XXXXXXX-51", "debit/cheque": "$4.50", "credit/deposit": "", "balance": "OD $5,738.83" } ]

Improvements Made

🐛 Bug Fixes

  • Fixed header detection logic that was incorrectly searching for th elements
  • Fixed data structure handling for consistent output format
  • Fixed handling of tables with tbody elements
  • Fixed text extraction from nested HTML elements (preserves spaces)

New Features

  • Type Safety: Added comprehensive type hints for better IDE support
  • Input Validation: Robust error handling with descriptive error messages
  • Multiple Output Formats: New utility functions for different use cases
  • Smart Header Detection: Detects headers in <thead> or first row with <th> elements
  • Empty Row Handling: Automatically skips empty rows
  • Flexible Cell Handling: Handles mismatched column counts gracefully

🚀 Performance & Code Quality

  • Modular Design: Split functionality into focused helper functions
  • Better Error Handling: Comprehensive exception handling with context
  • Documentation: Added detailed docstrings and usage examples
  • Test Coverage: Enhanced test cases covering edge cases and error scenarios

📋 New API Functions

  • html_to_json() - Main conversion function (improved)
  • html_table_to_dict_list() - Force dictionary output with auto-generated headers
  • html_table_to_list_of_lists() - Force list of lists output
  • validate_html_table() - Validate HTML table structure

TODO

  • Support for nested table
  • Support for buggy HTML table (ie. td instead of th in thead)Fixed
  • Improve error handlingAdded comprehensive error handling
  • Add type hintsAdded full type safety
  • Better text extraction from nested elementsImproved HTML parsing
  • Support for rowspan/colspan attributes
  • Add support for custom header mappings
  • Add CSV export functionality

About

🔢 Convert HTML Table to JSON using BeautifulSoup

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages