Releases: kepano/defuddle
Releases · kepano/defuddle
Release list
0.19.1
0.19.0
- Bilibili: video subtitle/transcript extraction (#271)
- ChatGPT: preserve citation footnotes (#311), fix content after Thought sections (#302)
- CLI:
--frontmatterflag for YAML metadata output (#313),--user-agentflag for fetched URLs, read HTML from stdin (#192) - Math: reconstruct MathML from MathJax CHTML render tree (#250), fix Markdown conversion for complex and aligned math (#301), flatten tagged single-equation tables so they render horizontally
- Tables: preserve trailing columns in ragged tables (#299)
- X/Twitter: prefer async extraction in Node for X articles (#318), fix non-English aria-label reply extraction (#290), fix FxTwitter facet indices for surrogate-pair emoji (#281)
- YouTube: fetch videoData from a more reliable source (#211)
- Fix: don't mutate the live document in
parse() - Fix: don't strip sections with delimiter-less anchor ids (#303)
- Fix: author/date byline heuristic no longer matches day-of-week as a date (#291)
- Markdown: encapsulate links with spaces (#278)
- Remove subhead (#316); remove
<style>from SVGs for security - Add CI; dependency upgrades (esbuild, mathml-to-latex, temml)
0.18.1
0.18.0
- New extractors for LinkedIn, Threads, Bluesky, Discourse, Medium
- Footnotes refactor with sidenote support and more patterns
- Content boundary detection and eyebrow removals
- H1 fallback, title normalization
- Code blocks: fix duplicate language name (#235)
- Metadata:
rel=authorfallback, date deduplication, author name cleanup - Keep content grids with lots of content
- Audio/video source parity, empty video placeholder removal
- X extractor refactored to use comments for replies
0.17.0
- YouTube: Improved transcript with better break points, CJK support
- Wikipedia: New minimal extractor, keep phonetic pronunciations, detect math in tables
- NYT: Improved extractor and additional removals
- HN: Home page support
- Math: Extract LaTeX from data-math attributes and from images
- Footnotes: Fix duplicate backrefs, WordPress fixes (#237)
- ChatGPT: Update extractor for changed DOM structure (#236)
- General: Configurable fetch option, remove unnecessary
<br>between paragraphs, remove tables with no text or media, replace custom elements with divs during standardization (#247), fix dismiss buttons surviving hidden-content retry (#234), retain ULs against overly aggressive removal - Removals: ieee.org, ToC content patterns, breadcrumbs, sidebar/menu checkboxes
0.16.0
- X/Twitter: Fix inline images in articles, get header images
- Callouts: Improved Obsidian callout support
- SVG: Improved conversion to not rely on stylesheets
- Tailwind: Author pattern, more date/reading time pattern removals, convert block spans to paragraphs
- Substack: Better handling of Likes at end of post, more removal patterns
- NYT: Fix authors, additional removals
- FT.com: New removals
- General: Remove parentheses around authors, remove URLs in authors, keep initial media in articles, secondary pass cleanup
0.15.0
Improved
- Add profiler for performance debugging
- Performance optimizations for math and content patterns (#212)
- Footnotes: Alternate aside style, inline improvements, false positive fixes, loose footnotes, and HTML named anchor footnotes
- Code blocks: More syntax highlighting patterns, Chroma and CodeMirror support
- Improve unique author filtering and deduplication
- Extract authors and dates from cover elements
- YouTube: Add timeouts, fallbacks, fix stale metadata after SPA navigation (#174)
- YouTube Shorts handling (#206)
- YouTube: Respect preferred transcript language (#202)
- Reddit: Remove duplication, fix author extraction when comments haven't loaded (#204)
- Tailwind: Improve patterns for footnotes, metadata, and removals
- Content pattern removals for newsletters, related posts, breadcrumbs
- Extract BBcode formatting
- Substack extractor (#216)
- Honor proxy settings (#165)
Fixes
- Fix og:title brand name used as article title (#196)
- Fix MathJax SVG / MathML-only math rendering (#201)
- Fix main content embedded into figure elements
- Fix flex-row line gutters and invalid
code>prenesting in code blocks - Fix buttons appearing in code blocks
- Fix X author fallback on status URLs (#208)
- Fix charset parsing for quoted and trailing-comma values
- Fix proper description metadata extraction (#198)
- Fix bad author strings from broken CMS templates (#207)
- Preserve text when footnote reference is wrapped around reference
0.14.0
New
- Add
includeRepliesoption to exclude replies from extractors (Reddit, HN, GitHub, Twitter/X) - Standardize callouts (Obsidian Publish, GitHub, Bootstrap) (#182).
Improved
- YouTube: Improve mobile extraction
- Truncate descriptions to 300 words
- Use
<base href>to resolve relative URLs (#179) - Pass separateMarkdown in CLI when --markdown, --md, or --json is used (#164)
- Prefer highest resolution image (rel #177)
- Remove
<wbr>tags to prevent unwanted spaces in markdown (rel #172) - Standardize removing anchors from headings
- Remove boundary patterns (#184)
- Exit with error when no content is extracted from CLI (rel #170)
- Use last match for metadata tags (#183)
Fixes
- Fix content extraction for pages without semantic entry points, e.g. Oxygen Builder
- Fix
el.classNamefor SVG elements whereclassNameis anSVGAnimatedString(#169) - Unwrap
<a>tags inside<code>elements to plain text before Markdown conversion (#168) - Protect child elements inside code blocks from partial selector removal (#167)
0.13.0
Breaking changes
defuddle/nodenow accepts any DOMDocument(linkedom, happy-dom, JSDOM, etc.), not just JSDOM.- JSDOM is no longer a peer dependency. linkedom is now the recommended DOM parser
- Passing a raw HTML string or JSDOM instance to
defuddle/nodeis deprecated and will be removed in the next major version.
Recommended usage
import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';
const { document } = parseHTML(html);
const result = await Defuddle(document, 'https://example.com/article');Passing a JSDOM instance still works but is deprecated:
// @deprecated — pass dom.window.document directly instead
const result = await Defuddle(dom, url);
// Preferred
const result = await Defuddle(dom.window.document, url);Improvements
- Generic document support for non-HTML content (#166)
- YouTube: Use existing page transcript before fetching via API
- YouTube: Improved transcript grouping, sentence merging, and cross-environment support
- YouTube: Fix diarization stripping
-speaker markers from auto-captions - Add
.post-bodyentry point for Ghost CMS sites - Smarter retry for hidden content (#163)
- CJK word count support (#158)
- Precompile partial selector regex for faster parsing (#157)
Fixes
0.12.0
Features
- YouTube transcripts with diarization (#149, #153)
- Standardized comment extraction (#153)
- Speed up Node.js parsing (#150)
Fixes
- Fix scoring for tables (#148)
- More footnote backref removal patterns
- Timeout error handling
- Scoring pattern improvements (#152)
- Small content removal improvements
- Language variable fix
Docs
- Add instructions to install globally