High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.
- Fast - Written in Rust with O(n) algorithms, significantly faster than JavaScript alternatives
- Full GFM Support - Tables with alignment, strikethrough, autolinks, fenced code blocks
- Accurate - Handles malformed HTML gracefully via html5ever
- Configurable - Multiple heading styles, link styles, custom selectors
- Zero Dependencies - Single native binary, no JavaScript runtime overhead
- Cross-Platform - Pre-built binaries for Windows, macOS, and Linux (x64 & ARM64)
- TypeScript Ready - Full type definitions included
- Async Support - Non-blocking conversion for large documents
npm install @vakra-dev/supermarkdowncargo add supermarkdownInstall the CLI binary via cargo:
cargo install supermarkdown-cliThe CLI allows you to convert HTML files from the command line or via stdin:
# Convert a file supermarkdown page.html > page.md # Pipe HTML from curl curl -s https://example.com | supermarkdown # Exclude navigation and ads supermarkdown --exclude "nav,.ad,#sidebar" page.html # Use setext-style headings and referenced links supermarkdown --heading-style setext --link-style referenced page.html| Option | Description |
|---|---|
-h, --help | Print help message |
-v, --version | Print version |
--heading-style <STYLE> | atx (default) or setext |
--link-style <STYLE> | inline (default) or referenced |
--code-fence <CHAR> | ` (default) or ~ |
--bullet <CHAR> | - (default), *, or + |
--exclude <SELECTORS> | CSS selectors to exclude (comma-separated) |
import { convert } from "@vakra-dev/supermarkdown"; const html = ` <h1>Hello World</h1> <p>This is a <strong>test</strong> with a <a href="https://example.com">link</a>.</p> `; const markdown = convert(html); console.log(markdown); // # Hello World // // This is a **test** with a [link](https://example.com).When scraping websites, HTML often contains navigation, ads, and other non-content elements. Use selectors to extract only what you need:
import { convert } from "@vakra-dev/supermarkdown"; // Raw HTML from a web scrape const scrapedHtml = await fetchPage("https://example.com/article"); // Clean conversion - remove nav, ads, sidebars const markdown = convert(scrapedHtml, { excludeSelectors: [ "nav", "header", "footer", ".sidebar", ".advertisement", ".cookie-banner", ".social-share", ".comments", "script", "style", ], });When feeding web content to LLMs, you want clean, focused text without HTML artifacts:
import { convert } from "@vakra-dev/supermarkdown"; // Extract just the article content for RAG pipelines const markdown = convert(html, { excludeSelectors: [ "nav", "header", "footer", "aside", ".related-posts", ".author-bio", ], includeSelectors: ["article", ".post-content", "main"], }); // Now feed to your LLM const response = await llm.chat({ messages: [ { role: "user", content: `Summarize this article:\n\n${markdown}`, }, ], });Convert blog HTML while preserving code blocks and formatting:
import { convert } from "@vakra-dev/supermarkdown"; const blogHtml = ` <article> <h1>Getting Started with Rust</h1> <p>Rust is a systems programming language focused on safety.</p> <pre><code class="language-rust">fn main() { println!("Hello, world!"); }</code></pre> <p>The <code>println!</code> macro prints to stdout.</p> </article> `; const markdown = convert(blogHtml); // Output: // # Getting Started with Rust // // Rust is a systems programming language focused on safety. // // ```rust // fn main() { // println!("Hello, world!"); // } // ``` // // The `println!` macro prints to stdout.Handle tables, definition lists, and nested structures common in docs:
import { convert } from "@vakra-dev/supermarkdown"; const docsHtml = ` <h2>API Reference</h2> <table> <tr><th>Method</th><th>Description</th></tr> <tr><td><code>convert()</code></td><td>Sync conversion</td></tr> <tr><td><code>convertAsync()</code></td><td>Async conversion</td></tr> </table> <dl> <dt>headingStyle</dt> <dd>ATX (#) or Setext (underlines)</dd> </dl> `; const markdown = convert(docsHtml); // Output: // ## API Reference // // | Method | Description | // | --- | --- | // | `convert()` | Sync conversion | // | `convertAsync()` | Async conversion | // // headingStyle // : ATX (#) or Setext (underlines)Process multiple documents efficiently with async conversion:
import { convertAsync } from "@vakra-dev/supermarkdown"; const urls = [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3", ]; // Fetch and convert in parallel const markdownDocs = await Promise.all( urls.map(async (url) => { const html = await fetch(url).then((r) => r.text()); return convertAsync(html, { excludeSelectors: ["nav", "footer"], }); }) );import { convert } from "@vakra-dev/supermarkdown"; const markdown = convert("<h1>Title</h1><p>Paragraph</p>");import { convert } from "@vakra-dev/supermarkdown"; const markdown = convert(html, { headingStyle: "setext", // 'atx' (default) or 'setext' linkStyle: "referenced", // 'inline' (default) or 'referenced' excludeSelectors: ["nav", ".sidebar", "#ads"], includeSelectors: [".important"], // Override excludes for specific elements });For large documents, use convertAsync to avoid blocking the main thread:
import { convertAsync } from "@vakra-dev/supermarkdown"; const markdown = await convertAsync(largeHtml); // Process multiple documents in parallel const results = await Promise.all([ convertAsync(html1), convertAsync(html2), convertAsync(html3), ]);Converts HTML to Markdown synchronously.
Parameters:
html(string) - The HTML string to convertoptions(object, optional) - Conversion options
Returns: string - The converted Markdown
Converts HTML to Markdown asynchronously.
Parameters:
html(string) - The HTML string to convertoptions(object, optional) - Conversion options
Returns: Promise - The converted Markdown
| Option | Type | Default | Description |
|---|---|---|---|
headingStyle | 'atx' | 'setext' | 'atx' | ATX uses # prefix, Setext uses underlines |
linkStyle | 'inline' | 'referenced' | 'inline' | Inline: [text](url), Referenced: [text][1] |
codeFence | '`' | '~' | '`' | Character for fenced code blocks |
bulletMarker | '-' | '*' | '+' | '-' | Character for unordered list items |
baseUrl | string | undefined | Base URL for resolving relative links |
excludeSelectors | string[] | [] | CSS selectors for elements to exclude |
includeSelectors | string[] | [] | CSS selectors to force keep (overrides excludes) |
| HTML | Markdown |
|---|---|
<h1> - <h6> | # headings or setext underlines |
<p> | Paragraphs with blank lines |
<blockquote> | > quoted blocks (supports nesting) |
<ul>, <ol> | - or 1. lists (supports start attribute) |
<pre><code> | Fenced code blocks with language detection |
<table> | GFM tables with alignment and captions |
<hr> | --- horizontal rules |
<dl>, <dt>, <dd> | Definition lists |
<details>, <summary> | Collapsible sections |
<figure>, <figcaption> | Images with captions |
| HTML | Markdown |
|---|---|
<a> | [text](url), [text][ref], or <url> (autolink) |
<img> |  |
<strong>, <b> | **bold** |
<em>, <i> | *italic* |
<code> | `code` (handles nested backticks) |
<del>, <s>, <strike> | ~~strikethrough~~ |
<sub> | <sub>subscript</sub> |
<sup> | <sup>superscript</sup> |
<br> | Line breaks |
Elements without Markdown equivalents are preserved as HTML:
<kbd>- Keyboard input<mark>- Highlighted text<abbr>- Abbreviations (preservestitleattribute)<samp>- Sample output<var>- Variables
Extracts alignment from align attribute or text-align style:
<table> <tr> <th align="left">Left</th> <th align="center">Center</th> <th align="right">Right</th> </tr> </table>Output:
| Left | Center | Right | | :--- | :----: | ----: |Respects the start attribute on ordered lists:
<ol start="5"> <li>Fifth item</li> <li>Sixth item</li> </ol>Output:
5. Fifth item 6. Sixth itemWhen a link's text matches its URL or email, autolink syntax is used:
<a href="https://example.com">https://example.com</a> <a href="mailto:test@example.com">test@example.com</a>Output:
<https://example.com> <test@example.com>Automatically detects language from class names:
language-*(e.g.,language-rust)lang-*(e.g.,lang-python)highlight-*(e.g.,highlight-go)hljs-*(highlight.js classes, excluding token classes likehljs-keyword)- Bare language names (e.g.,
javascript,python) as fallback
<pre><code class="language-rust">fn main() {}</code></pre>Output:
```rust fn main() {} ```Code blocks containing backticks automatically use more backticks as delimiters.
Line number gutters are automatically stripped from code blocks. Elements with these class patterns are skipped:
gutterline-numberline-numberslinenolinenumber
Spaces and parentheses in URLs are automatically percent-encoded:
// <a href="https://example.com/path (1)">link</a> // → [link](https://example.com/path%20%281%29)Remove unwanted elements like navigation, ads, or sidebars:
const markdown = convert(html, { excludeSelectors: [ "nav", "header", "footer", ".sidebar", ".advertisement", "#cookie-banner", ], includeSelectors: [".main-content"], });Some HTML features cannot be fully represented in Markdown:
| Feature | Behavior |
|---|---|
| Table colspan/rowspan | Content placed in first cell |
| Nested tables | Inner tables converted inline |
| Form elements | Skipped |
| iframe/video/audio | Skipped (no standard Markdown equivalent) |
| CSS styling | Ignored (except text-align for tables) |
| Empty elements | Removed from output |
supermarkdown handles many edge cases gracefully:
Invalid or malformed HTML is parsed via html5ever, which applies browser-like error recovery:
// Missing closing tags, nested issues - all handled const html = "<p>Unclosed paragraph<div>Mixed<p>nesting</div>"; const markdown = convert(html); // Produces sensible outputNested lists maintain proper indentation:
const html = ` <ul> <li>Level 1 <ul> <li>Level 2 <ul> <li>Level 3</li> </ul> </li> </ul> </li> </ul>`; // Output: // - Level 1 // - Level 2 // - Level 3When code contains backticks, the fence automatically uses more backticks:
const html = "<pre><code>Use `backticks` for code</code></pre>"; // Output uses 4 backticks as fence: // ```` // Use `backticks` for code // ````Empty paragraphs, divs, and spans are stripped to avoid blank lines:
const html = "<p></p><p>Real content</p><p> </p>"; const markdown = convert(html); // Output: "Real content" (empty paragraphs removed)Spaces, parentheses, and other special characters in URLs are percent-encoded:
const html = '<a href="https://example.com/file (1).pdf">Download</a>'; // Output: [Download](https://example.com/file%20%281%29.pdf)Tables missing <thead> use the first row as header:
const html = ` <table> <tr><td>A</td><td>B</td></tr> <tr><td>1</td><td>2</td></tr> </table>`; // Output: // | A | B | // | --- | --- | // | 1 | 2 |List items with mixed block/inline content are handled:
const html = ` <ul> <li>Simple item</li> <li> <p>Paragraph in list</p> <pre><code>code block</code></pre> </li> </ul>`; // Outputs proper markdown with preserved formattingProblem: convert() returns empty string or very little content.
Causes & Solutions:
-
Content is in excluded elements - Check if your content is inside
nav,header, etc. that might match default patterns// Try without selectors first const markdown = convert(html);
-
JavaScript-rendered content - supermarkdown converts static HTML only. If the page uses client-side rendering, you need to render it first (e.g., with Puppeteer or Playwright)
-
Content in iframes - iframe content is not extracted. Fetch iframe src separately if needed
Problem: Code blocks don't have language annotation.
Solution: supermarkdown looks for language-*, lang-*, or highlight-* class patterns. Ensure your HTML uses standard class naming:
<!-- Detected --> <pre><code class="language-python">...</code></pre> <pre><code class="lang-js">...</code></pre> <!-- Not detected --> <pre><code class="python-code">...</code></pre>Problem: Tables appear as plain text or are malformed.
Causes & Solutions:
- Missing table structure - Ensure proper
<table>,<tr>,<td>structure - Nested tables - GFM doesn't support nested tables; inner tables are flattened
- colspan/rowspan - These are not supported in GFM; content goes in first cell
Problem: Links don't appear or have wrong URLs.
Solutions:
-
Relative URLs - Use
baseUrloption to resolve relative links:convert(html, { baseUrl: "https://example.com" });
-
Links in excluded elements - Navigation links are often in
<nav>which may be excluded
Problem: Conversion is slow for very large HTML files.
Solutions:
- Use async -
convertAsync()won't block the event loop - Pre-filter HTML - Remove obvious non-content before conversion
- Stream processing - For very large docs, consider splitting into sections
Problem: Characters like <, >, & appear as entities.
Solution: This is usually correct behavior - these characters need escaping in markdown. If you're seeing & where you expect &, the source HTML may have double-encoded entities.
Add to your Cargo.toml:
[dependencies] supermarkdown = "0.0.2"use supermarkdown::{convert, convert_with_options, Options, HeadingStyle}; // Basic conversion let markdown = convert("<h1>Hello</h1>"); // With options let options = Options::new() .heading_style(HeadingStyle::Setext) .exclude_selectors(vec!["nav".to_string()]); let markdown = convert_with_options("<h1>Hello</h1>", &options);supermarkdown is designed for high performance:
- Single-pass parsing - O(n) HTML traversal
- Pre-computed metadata - List indices and CSS selectors computed in one pass
- Zero-copy where possible - Minimal string allocations
- Native code - No JavaScript runtime overhead
Contributions are welcome! Please feel free to submit a Pull Request.
# Clone the repository git clone https://github.com/vakra-dev/supermarkdown.git cd supermarkdown # Run tests cargo test # Build Node.js bindings cd crates/supermarkdown-napi npm install npm run buildMIT License - see LICENSE for details.