langflow/docs/docs/components/text-splitters.mdx
2024-06-07 13:44:50 -04:00

42 lines
2.6 KiB
Text

import Admonition from "@theme/Admonition";
# Text Splitters
A text splitter is a tool that divides a document or text into smaller chunks or segments. This helps make large texts more manageable for analysis or processing.
---
## CharacterTextSplitter
The `CharacterTextSplitter` splits a long text into smaller chunks based on a specified character. It aims to keep paragraphs, sentences, and words intact as much as possible since these are semantically related elements of text.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks. This setting ensures a smoother transition between chunks and prevents information loss. For example, with a `chunk_overlap` of 20 and a `chunk_size` of 100, each chunk will have the last 20 characters overlap with the next chunk's first 20 characters. The default is `200`.
- **chunk_size:** The maximum number of characters in each chunk. If the text exceeds the specified `chunk_size`, it will be divided into multiple chunks of equal size, with the possible exception of the last chunk, which may be smaller if fewer characters remain. The default is `1000`.
- **separator:** The character used to split the text into chunks. The default is `.`.
---
## RecursiveCharacterTextSplitter
The `RecursiveCharacterTextSplitter` functions similarly to the `CharacterTextSplitter` by trying to keep paragraphs, sentences, and words together. It also recursively splits the text into smaller chunks if the initial chunk size exceeds a specified threshold.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separators:** A list of characters used to split the text into chunks. The splitter first tries to split text using the first character in the `separators` list. If any chunk exceeds the maximum size, it proceeds to the next character in the list and continues splitting. The defaults are ["\n\n", "\n", " ", ""].
## LanguageRecursiveTextSplitter
The `LanguageRecursiveTextSplitter` divides text into smaller chunks based on the programming language of the text.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separator_type:** This parameter allows splitting text across multiple programming languages such as Ruby, Python, Solidity, Java, and more. The default is `Python`.