Releases · kepano/defuddle

Release list

0.19.1 Latest

Latest

kepano released this 24 Jun 14:11

0.19.1

6c39f09

Sanitize extractor HTML #326
Update deps

Assets 2

0.19.0

kepano released this 16 Jun 16:12

0.19.0

19520ad

Bilibili: video subtitle/transcript extraction (#271)
ChatGPT: preserve citation footnotes (#311), fix content after Thought sections (#302)
CLI: --frontmatter flag for YAML metadata output (#313), --user-agent flag for fetched URLs, read HTML from stdin (#192)
Math: reconstruct MathML from MathJax CHTML render tree (#250), fix Markdown conversion for complex and aligned math (#301), flatten tagged single-equation tables so they render horizontally
Tables: preserve trailing columns in ragged tables (#299)
X/Twitter: prefer async extraction in Node for X articles (#318), fix non-English aria-label reply extraction (#290), fix FxTwitter facet indices for surrogate-pair emoji (#281)
YouTube: fetch videoData from a more reliable source (#211)
Fix: don't mutate the live document in parse()
Fix: don't strip sections with delimiter-less anchor ids (#303)
Fix: author/date byline heuristic no longer matches day-of-week as a date (#291)
Markdown: encapsulate links with spaces (#278)
Remove subhead (#316); remove <style> from SVGs for security
Add CI; dependency upgrades (esbuild, mathml-to-latex, temml)

Assets 2

0.18.1

kepano released this 22 Apr 00:00

0.18.1

7504b6c

Fix Wikipedia footnotes
Fix legitimate link with anchor being removed in headings
Don't remove headings based on data attributes

Assets 2

0.18.0

kepano released this 21 Apr 20:19

0.18.0

c11b9bb

New extractors for LinkedIn, Threads, Bluesky, Discourse, Medium
Footnotes refactor with sidenote support and more patterns
Content boundary detection and eyebrow removals
H1 fallback, title normalization
Code blocks: fix duplicate language name (#235)
Metadata: rel=author fallback, date deduplication, author name cleanup
Keep content grids with lots of content
Audio/video source parity, empty video placeholder removal
X extractor refactored to use comments for replies

Assets 2

0.17.0

kepano released this 15 Apr 03:28

0.17.0

fd92139

YouTube: Improved transcript with better break points, CJK support
Wikipedia: New minimal extractor, keep phonetic pronunciations, detect math in tables
NYT: Improved extractor and additional removals
HN: Home page support
Math: Extract LaTeX from data-math attributes and from images
Footnotes: Fix duplicate backrefs, WordPress fixes (#237)
ChatGPT: Update extractor for changed DOM structure (#236)
General: Configurable fetch option, remove unnecessary <br> between paragraphs, remove tables with no text or media, replace custom elements with divs during standardization (#247), fix dismiss buttons surviving hidden-content retry (#234), retain ULs against overly aggressive removal
Removals: ieee.org, ToC content patterns, breadcrumbs, sidebar/menu checkboxes

Assets 2

0.16.0

kepano released this 09 Apr 23:39

0.16.0

66049af

X/Twitter: Fix inline images in articles, get header images
Callouts: Improved Obsidian callout support
SVG: Improved conversion to not rely on stylesheets
Tailwind: Author pattern, more date/reading time pattern removals, convert block spans to paragraphs
Substack: Better handling of Likes at end of post, more removal patterns
NYT: Fix authors, additional removals
FT.com: New removals
General: Remove parentheses around authors, remove URLs in authors, keep initial media in articles, secondary pass cleanup

Assets 2

0.15.0

kepano released this 31 Mar 17:18

0.15.0

e29efd1

Improved

Add profiler for performance debugging
Performance optimizations for math and content patterns (#212)
Footnotes: Alternate aside style, inline improvements, false positive fixes, loose footnotes, and HTML named anchor footnotes
Code blocks: More syntax highlighting patterns, Chroma and CodeMirror support
Improve unique author filtering and deduplication
Extract authors and dates from cover elements
YouTube: Add timeouts, fallbacks, fix stale metadata after SPA navigation (#174)
YouTube Shorts handling (#206)
YouTube: Respect preferred transcript language (#202)
Reddit: Remove duplication, fix author extraction when comments haven't loaded (#204)
Tailwind: Improve patterns for footnotes, metadata, and removals
Content pattern removals for newsletters, related posts, breadcrumbs
Extract BBcode formatting
Substack extractor (#216)
Honor proxy settings (#165)

Fixes

Fix og:title brand name used as article title (#196)
Fix MathJax SVG / MathML-only math rendering (#201)
Fix main content embedded into figure elements
Fix flex-row line gutters and invalid code>pre nesting in code blocks
Fix buttons appearing in code blocks
Fix X author fallback on status URLs (#208)
Fix charset parsing for quoted and trailing-comma values
Fix proper description metadata extraction (#198)
Fix bad author strings from broken CMS templates (#207)
Preserve text when footnote reference is wrapped around reference

Assets 2

0.14.0

kepano released this 17 Mar 00:09

0.14.0

b1983b2

New

Add includeReplies option to exclude replies from extractors (Reddit, HN, GitHub, Twitter/X)
Standardize callouts (Obsidian Publish, GitHub, Bootstrap) (#182).

Improved

YouTube: Improve mobile extraction
Truncate descriptions to 300 words
Use <base href> to resolve relative URLs (#179)
Pass separateMarkdown in CLI when --markdown, --md, or --json is used (#164)
Prefer highest resolution image (rel #177)
Remove <wbr> tags to prevent unwanted spaces in markdown (rel #172)
Standardize removing anchors from headings
Remove boundary patterns (#184)
Exit with error when no content is extracted from CLI (rel #170)
Use last match for metadata tags (#183)

Fixes

Fix content extraction for pages without semantic entry points, e.g. Oxygen Builder
Fix el.className for SVG elements where className is an SVGAnimatedString (#169)
Unwrap <a> tags inside <code> elements to plain text before Markdown conversion (#168)
Protect child elements inside code blocks from partial selector removal (#167)

Assets 2

0.13.0

kepano released this 13 Mar 17:20

0.13.0

8b00d99

Breaking changes

defuddle/node now accepts any DOM Document (linkedom, happy-dom, JSDOM, etc.), not just JSDOM.
JSDOM is no longer a peer dependency. linkedom is now the recommended DOM parser
Passing a raw HTML string or JSDOM instance to defuddle/node is deprecated and will be removed in the next major version.

Recommended usage

import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';

const { document } = parseHTML(html);
const result = await Defuddle(document, 'https://example.com/article');

Passing a JSDOM instance still works but is deprecated:

// @deprecated — pass dom.window.document directly instead
const result = await Defuddle(dom, url);
// Preferred
const result = await Defuddle(dom.window.document, url);

Improvements

Generic document support for non-HTML content (#166)
YouTube: Use existing page transcript before fetching via API
YouTube: Improved transcript grouping, sentence merging, and cross-environment support
YouTube: Fix diarization stripping - speaker markers from auto-captions
Add .post-body entry point for Ghost CMS sites
Smarter retry for hidden content (#163)
CJK word count support (#158)
Precompile partial selector regex for faster parsing (#157)

Fixes

Fix syntax highlighting for Lean (#159, #160)
Fix filenames on Windows (#155)
Fix spacing between exclamation and image in markdown
Fix newlines in Verso

Assets 2

0.12.0

kepano released this 10 Mar 20:41

0.12.0

0f00e4d

Features

YouTube transcripts with diarization (#149, #153)
Standardized comment extraction (#153)
Speed up Node.js parsing (#150)

Fixes

Fix scoring for tables (#148)
More footnote backref removal patterns
Timeout error handling
Scoring pattern improvements (#152)
Small content removal improvements
Language variable fix

Docs

Add instructions to install globally

Assets 2

Releases: kepano/defuddle

Release list

0.19.1

Uh oh!

0.19.0

Uh oh!

0.18.1

Uh oh!

0.18.0

Uh oh!

0.17.0

Uh oh!

0.16.0

Uh oh!

0.15.0

Improved

Fixes

Uh oh!

0.14.0

New

Improved

Fixes

Uh oh!

0.13.0

Breaking changes

Recommended usage

Improvements

Fixes

Uh oh!

0.12.0

Features

Fixes

Docs

Uh oh!