Skip to content

Releases: kepano/defuddle

0.19.1

Choose a tag to compare

@kepano kepano released this 24 Jun 14:11
  • Sanitize extractor HTML #326
  • Update deps

0.19.0

Choose a tag to compare

@kepano kepano released this 16 Jun 16:12
  • Bilibili: video subtitle/transcript extraction (#271)
  • ChatGPT: preserve citation footnotes (#311), fix content after Thought sections (#302)
  • CLI: --frontmatter flag for YAML metadata output (#313), --user-agent flag for fetched URLs, read HTML from stdin (#192)
  • Math: reconstruct MathML from MathJax CHTML render tree (#250), fix Markdown conversion for complex and aligned math (#301), flatten tagged single-equation tables so they render horizontally
  • Tables: preserve trailing columns in ragged tables (#299)
  • X/Twitter: prefer async extraction in Node for X articles (#318), fix non-English aria-label reply extraction (#290), fix FxTwitter facet indices for surrogate-pair emoji (#281)
  • YouTube: fetch videoData from a more reliable source (#211)
  • Fix: don't mutate the live document in parse()
  • Fix: don't strip sections with delimiter-less anchor ids (#303)
  • Fix: author/date byline heuristic no longer matches day-of-week as a date (#291)
  • Markdown: encapsulate links with spaces (#278)
  • Remove subhead (#316); remove <style> from SVGs for security
  • Add CI; dependency upgrades (esbuild, mathml-to-latex, temml)

0.18.1

Choose a tag to compare

@kepano kepano released this 22 Apr 00:00
  • Fix Wikipedia footnotes
  • Fix legitimate link with anchor being removed in headings
  • Don't remove headings based on data attributes

0.18.0

Choose a tag to compare

@kepano kepano released this 21 Apr 20:19
  • New extractors for LinkedIn, Threads, Bluesky, Discourse, Medium
  • Footnotes refactor with sidenote support and more patterns
  • Content boundary detection and eyebrow removals
  • H1 fallback, title normalization
  • Code blocks: fix duplicate language name (#235)
  • Metadata: rel=author fallback, date deduplication, author name cleanup
  • Keep content grids with lots of content
  • Audio/video source parity, empty video placeholder removal
  • X extractor refactored to use comments for replies

0.17.0

Choose a tag to compare

@kepano kepano released this 15 Apr 03:28
  • YouTube: Improved transcript with better break points, CJK support
  • Wikipedia: New minimal extractor, keep phonetic pronunciations, detect math in tables
  • NYT: Improved extractor and additional removals
  • HN: Home page support
  • Math: Extract LaTeX from data-math attributes and from images
  • Footnotes: Fix duplicate backrefs, WordPress fixes (#237)
  • ChatGPT: Update extractor for changed DOM structure (#236)
  • General: Configurable fetch option, remove unnecessary <br> between paragraphs, remove tables with no text or media, replace custom elements with divs during standardization (#247), fix dismiss buttons surviving hidden-content retry (#234), retain ULs against overly aggressive removal
  • Removals: ieee.org, ToC content patterns, breadcrumbs, sidebar/menu checkboxes

0.16.0

Choose a tag to compare

@kepano kepano released this 09 Apr 23:39
  • X/Twitter: Fix inline images in articles, get header images
  • Callouts: Improved Obsidian callout support
  • SVG: Improved conversion to not rely on stylesheets
  • Tailwind: Author pattern, more date/reading time pattern removals, convert block spans to paragraphs
  • Substack: Better handling of Likes at end of post, more removal patterns
  • NYT: Fix authors, additional removals
  • FT.com: New removals
  • General: Remove parentheses around authors, remove URLs in authors, keep initial media in articles, secondary pass cleanup

0.15.0

Choose a tag to compare

@kepano kepano released this 31 Mar 17:18

Improved

  • Add profiler for performance debugging
  • Performance optimizations for math and content patterns (#212)
  • Footnotes: Alternate aside style, inline improvements, false positive fixes, loose footnotes, and HTML named anchor footnotes
  • Code blocks: More syntax highlighting patterns, Chroma and CodeMirror support
  • Improve unique author filtering and deduplication
  • Extract authors and dates from cover elements
  • YouTube: Add timeouts, fallbacks, fix stale metadata after SPA navigation (#174)
  • YouTube Shorts handling (#206)
  • YouTube: Respect preferred transcript language (#202)
  • Reddit: Remove duplication, fix author extraction when comments haven't loaded (#204)
  • Tailwind: Improve patterns for footnotes, metadata, and removals
  • Content pattern removals for newsletters, related posts, breadcrumbs
  • Extract BBcode formatting
  • Substack extractor (#216)
  • Honor proxy settings (#165)

Fixes

  • Fix og:title brand name used as article title (#196)
  • Fix MathJax SVG / MathML-only math rendering (#201)
  • Fix main content embedded into figure elements
  • Fix flex-row line gutters and invalid code>pre nesting in code blocks
  • Fix buttons appearing in code blocks
  • Fix X author fallback on status URLs (#208)
  • Fix charset parsing for quoted and trailing-comma values
  • Fix proper description metadata extraction (#198)
  • Fix bad author strings from broken CMS templates (#207)
  • Preserve text when footnote reference is wrapped around reference

0.14.0

Choose a tag to compare

@kepano kepano released this 17 Mar 00:09
b1983b2

New

  • Add includeReplies option to exclude replies from extractors (Reddit, HN, GitHub, Twitter/X)
  • Standardize callouts (Obsidian Publish, GitHub, Bootstrap) (#182).

Improved

  • YouTube: Improve mobile extraction
  • Truncate descriptions to 300 words
  • Use <base href> to resolve relative URLs (#179)
  • Pass separateMarkdown in CLI when --markdown, --md, or --json is used (#164)
  • Prefer highest resolution image (rel #177)
  • Remove <wbr> tags to prevent unwanted spaces in markdown (rel #172)
  • Standardize removing anchors from headings
  • Remove boundary patterns (#184)
  • Exit with error when no content is extracted from CLI (rel #170)
  • Use last match for metadata tags (#183)

Fixes

  • Fix content extraction for pages without semantic entry points, e.g. Oxygen Builder
  • Fix el.className for SVG elements where className is an SVGAnimatedString (#169)
  • Unwrap <a> tags inside <code> elements to plain text before Markdown conversion (#168)
  • Protect child elements inside code blocks from partial selector removal (#167)

0.13.0

Choose a tag to compare

@kepano kepano released this 13 Mar 17:20

Breaking changes

  • defuddle/node now accepts any DOM Document (linkedom, happy-dom, JSDOM, etc.), not just JSDOM.
  • JSDOM is no longer a peer dependency. linkedom is now the recommended DOM parser
  • Passing a raw HTML string or JSDOM instance to defuddle/node is deprecated and will be removed in the next major version.

Recommended usage

import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';

const { document } = parseHTML(html);
const result = await Defuddle(document, 'https://example.com/article');

Passing a JSDOM instance still works but is deprecated:

// @deprecated — pass dom.window.document directly instead
const result = await Defuddle(dom, url);
// Preferred
const result = await Defuddle(dom.window.document, url);

Improvements

  • Generic document support for non-HTML content (#166)
  • YouTube: Use existing page transcript before fetching via API
  • YouTube: Improved transcript grouping, sentence merging, and cross-environment support
  • YouTube: Fix diarization stripping - speaker markers from auto-captions
  • Add .post-body entry point for Ghost CMS sites
  • Smarter retry for hidden content (#163)
  • CJK word count support (#158)
  • Precompile partial selector regex for faster parsing (#157)

Fixes

  • Fix syntax highlighting for Lean (#159, #160)
  • Fix filenames on Windows (#155)
  • Fix spacing between exclamation and image in markdown
  • Fix newlines in Verso

0.12.0

Choose a tag to compare

@kepano kepano released this 10 Mar 20:41

Features

  • YouTube transcripts with diarization (#149, #153)
  • Standardized comment extraction (#153)
  • Speed up Node.js parsing (#150)

Fixes

  • Fix scoring for tables (#148)
  • More footnote backref removal patterns
  • Timeout error handling
  • Scoring pattern improvements (#152)
  • Small content removal improvements
  • Language variable fix

Docs

  • Add instructions to install globally