langflow/docs/docs/components/text-splitters.mdx
Mendon Kissling ba59f077a2
[Docs] - Cleanup Components Folder (#1852)
* inputs

* agents

* chains

* custom-component

* align-admonitions-in-custom

* data-and-embeddings

* experimental

* helpers

* memories

* model_specs

* outputs

* prompts

* retrievers

* textsplitter

* tools

* utilities

* vector-stores
2024-05-07 18:39:40 -03:00

50 lines
2.9 KiB
Text

import Admonition from "@theme/Admonition";
# Text Splitters
<Admonition type="caution" icon="🚧" title="ZONE UNDER CONSTRUCTION">
<p>
Thank you for your patience as we enhance our documentation. It might
currently have some rough edges. Please share your feedback or report any
issues to assist us in improving! 🛠️📝
</p>
</Admonition>
A text splitter is a tool that divides a document or text into smaller chunks or segments. This helps make large texts more manageable for analysis or processing.
---
### CharacterTextSplitter
The `CharacterTextSplitter` splits a long text into smaller chunks based on a specified character. It aims to keep paragraphs, sentences, and words intact as much as possible since these are semantically related elements of text.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks. This setting ensures a smoother transition between chunks and prevents information loss. For example, with a `chunk_overlap` of 20 and a `chunk_size` of 100, each chunk will have the last 20 characters overlap with the next chunk's first 20 characters. The default is `200`.
- **chunk_size:** The maximum number of characters in each chunk. If the text exceeds the specified `chunk_size`, it will be divided into multiple chunks of equal size, with the possible exception of the last chunk, which may be smaller if fewer characters remain. The default is `1000`.
- **separator:** The character used to split the text into chunks. The default is `.`.
---
### RecursiveCharacterTextSplitter
The `RecursiveCharacterTextSplitter` functions similarly to the `CharacterTextSplitter` by trying to keep paragraphs, sentences, and words together. It also recursively splits the text into smaller chunks if the initial chunk size exceeds a specified threshold.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separators:** A list of characters used to split the text into chunks. The splitter first tries to split text using the first character in the `separators` list. If any chunk exceeds the maximum size, it proceeds to the next character in the list and continues splitting. The defaults are ["\n\n", "\n", " ", ""].
### LanguageRecursiveTextSplitter
The `LanguageRecursiveTextSplitter` divides text into smaller chunks based on the programming language of the text.
**Parameters**
- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separator_type:** This parameter allows splitting text across multiple programming languages such as Ruby, Python, Solidity, Java, and more. The default is `Python`.