42 lines
2.6 KiB
Text
42 lines
2.6 KiB
Text
import Admonition from "@theme/Admonition";
|
|
|
|
# Text Splitters
|
|
|
|
A text splitter is a tool that divides a document or text into smaller chunks or segments. This helps make large texts more manageable for analysis or processing.
|
|
|
|
---
|
|
|
|
## CharacterTextSplitter
|
|
|
|
The `CharacterTextSplitter` splits a long text into smaller chunks based on a specified character. It aims to keep paragraphs, sentences, and words intact as much as possible since these are semantically related elements of text.
|
|
|
|
**Parameters**
|
|
|
|
- **Documents:** The input documents to split.
|
|
- **chunk_overlap:** The number of characters that overlap between consecutive chunks. This setting ensures a smoother transition between chunks and prevents information loss. For example, with a `chunk_overlap` of 20 and a `chunk_size` of 100, each chunk will have the last 20 characters overlap with the next chunk's first 20 characters. The default is `200`.
|
|
- **chunk_size:** The maximum number of characters in each chunk. If the text exceeds the specified `chunk_size`, it will be divided into multiple chunks of equal size, with the possible exception of the last chunk, which may be smaller if fewer characters remain. The default is `1000`.
|
|
- **separator:** The character used to split the text into chunks. The default is `.`.
|
|
|
|
---
|
|
|
|
## RecursiveCharacterTextSplitter
|
|
|
|
The `RecursiveCharacterTextSplitter` functions similarly to the `CharacterTextSplitter` by trying to keep paragraphs, sentences, and words together. It also recursively splits the text into smaller chunks if the initial chunk size exceeds a specified threshold.
|
|
|
|
**Parameters**
|
|
|
|
- **Documents:** The input documents to split.
|
|
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
|
|
- **chunk_size:** The maximum number of characters in each chunk.
|
|
- **separators:** A list of characters used to split the text into chunks. The splitter first tries to split text using the first character in the `separators` list. If any chunk exceeds the maximum size, it proceeds to the next character in the list and continues splitting. The defaults are ["\n\n", "\n", " ", ""].
|
|
|
|
## LanguageRecursiveTextSplitter
|
|
|
|
The `LanguageRecursiveTextSplitter` divides text into smaller chunks based on the programming language of the text.
|
|
|
|
**Parameters**
|
|
|
|
- **Documents:** The input documents to split.
|
|
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
|
|
- **chunk_size:** The maximum number of characters in each chunk.
|
|
- **separator_type:** This parameter allows splitting text across multiple programming languages such as Ruby, Python, Solidity, Java, and more. The default is `Python`.
|