Skip to content

Conversation

@OGuggenbuehl
Copy link

@OGuggenbuehl OGuggenbuehl commented Jul 29, 2025

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue
@CLAassistant
Copy link

CLAassistant commented Jul 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025
@OGuggenbuehl OGuggenbuehl changed the title Feature/md header splitter Jul 29, 2025
@sjrl sjrl self-assigned this Aug 19, 2025
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2025

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

@sjrl sjrl changed the title feat:MarkdownHeaderSplitter Aug 27, 2025
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from 61a8396 to bcbbf9a Compare September 16, 2025 13:57
@sjrl
Copy link
Contributor

sjrl commented Sep 18, 2025

Thanks for your continued work on this @OGuggenbuehl!

Some general comments. Could you:

  • Add a release note for this PR following the instructions here
  • Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
  • Please make sure to sign the CLA agreement (docs about it here) from this comment
  • If you haven't already please also set up pre-commit hooks using pre-commit install. You can find more info about that in this section of our contribution guidelines.
  • Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.
@coveralls
Copy link
Collaborator

coveralls commented Sep 19, 2025

Pull Request Test Coverage Report for Build 19816136481

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 92.189%

Files with Coverage Reduction New Missed Lines %
core/pipeline/async_pipeline.py 3 65.88%
Totals Coverage Status
Change from base Build 19775495543: 0.0%
Covered Lines: 14174
Relevant Lines: 15375

💛 - Coveralls
@OGuggenbuehl OGuggenbuehl marked this pull request as ready for review September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl requested review from a team as code owners September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl removed the request for review from a team September 19, 2025 16:05
OGuggenbuehl and others added 28 commits December 1, 2025 09:28
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
minor commenting
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from c7264e6 to ad155cc Compare December 1, 2025 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment