Skip to content

feat: locale aware chunking in smoothStream via Intl.Segmenter #7423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Rajaniraiyn
Copy link

Background

Currently smoothStream uses regex based chunking, but for many languages that doesn't use space delimiter it doesn't work well and we need to manually provide chunks. This often times performant and robust then regex based approaches and unlocks new chucking strategies (grapheme and sentence). This API is also widely supported for sometime now.

Summary

Added new chunking options (word-intl, grapheme and sentence) using Intl.Segmenter and also added locale specific options via segmenterOptions option.

smoothStream({
  delayInMs: 10,
  chunking: 'word-intl',
  segmenterOptions: { locale: 'ja' },
})

smoothStream({
  chunking: 'grapheme',
})

Tasks

  • Tests have been added / updated (for bug fixes / features)
  • Documentation has been added / updated (for bug fixes / features)
  • A patch changeset for relevant packages has been added (for bug fixes / features - run pnpm changeset in the project root)
  • Formatting issues have been fixed (run pnpm prettier-fix in the project root)

Future Work

Intl.Segmenter can be made default chunking instead of regex based chunking and overall API can be improved targeting DX

…nd word-intl chunking

- Updated chunking options to include 'grapheme' and 'word-intl' for improved text segmentation.
- Added segmenterOptions for locale-specific chunking configurations.
- Enhanced documentation to reflect new chunking capabilities and usage examples.
- Introduced tests for grapheme and word-intl chunking to ensure correct functionality.
@Rajaniraiyn Rajaniraiyn changed the title feat: enhance smoothStream with Intl.Segmenter support for grapheme and word-intl chunking Jul 20, 2025
@lgrammel
Copy link
Collaborator

Thanks, I am very weary about mixing regexp and intl segmenter and having things such as word.intl

smoothStream supports a chunk detector:

/**
 * Detects the first chunk in a buffer.
 *
 * @param buffer - The buffer to detect the first chunk in.
 *
 * @returns The first detected chunk, or `undefined` if no chunk was detected.
 */
export type ChunkDetector = (buffer: string) => string | undefined | null;

it should be possible to use Intl.Segmenter that way without changing smooth stream.

Can you confirm? ideally then the change would be limited to the documentation for now

@Rajaniraiyn
Copy link
Author

yes, we could do that. I felt its better to have the built-in word and line based chunking to use language agnostic Intl.Segmenter api.

const segmenter = new Intl.Segmenter('ja', { granularity: 'word' });

const intlChunker = (buffer: string) => {
    const segments = Array.from(segmenter.segment(buffer));
    if (segments.length === 0) return null;
    const { segment } = segments[0];
    return segment.length ? segment : null;
};

can create a separate PR just updating the docs o make a PR with Intl.Segmentor as default chunk detector for line and word options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants