langflow/docs/docs/components/text-splitters.mdx

import Admonition from "@theme/Admonition";

# Text Splitters

<Admonition type="caution" icon="🚧" title="ZONE UNDER CONSTRUCTION">
  <p>
    Thank you for your patience as we enhance our documentation. It might
    currently have some rough edges. Please share your feedback or report any
    issues to assist us in improving! 🛠️📝
  </p>
</Admonition>

A text splitter is a tool that divides a document or text into smaller chunks or segments. This helps make large texts more manageable for analysis or processing.

---

### CharacterTextSplitter

The `CharacterTextSplitter` splits a long text into smaller chunks based on a specified character. It aims to keep paragraphs, sentences, and words intact as much as possible since these are semantically related elements of text.

**Parameters**

- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks. This setting ensures a smoother transition between chunks and prevents information loss. For example, with a `chunk_overlap` of 20 and a `chunk_size` of 100, each chunk will have the last 20 characters overlap with the next chunk's first 20 characters. The default is `200`.
- **chunk_size:** The maximum number of characters in each chunk. If the text exceeds the specified `chunk_size`, it will be divided into multiple chunks of equal size, with the possible exception of the last chunk, which may be smaller if fewer characters remain. The default is `1000`.
- **separator:** The character used to split the text into chunks. The default is `.`.

---

### RecursiveCharacterTextSplitter

The `RecursiveCharacterTextSplitter` functions similarly to the `CharacterTextSplitter` by trying to keep paragraphs, sentences, and words together. It also recursively splits the text into smaller chunks if the initial chunk size exceeds a specified threshold.

**Parameters**

- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separators:** A list of characters used to split the text into chunks. The splitter first tries to split text using the first character in the `separators` list. If any chunk exceeds the maximum size, it proceeds to the next character in the list and continues splitting. The defaults are ["\n\n", "\n", " ", ""].

### LanguageRecursiveTextSplitter

The `LanguageRecursiveTextSplitter` divides text into smaller chunks based on the programming language of the text.

**Parameters**

- **Documents:** The input documents to split.
- **chunk_overlap:** The number of characters that overlap between consecutive chunks.
- **chunk_size:** The maximum number of characters in each chunk.
- **separator_type:** This parameter allows splitting text across multiple programming languages such as Ruby, Python, Solidity, Java, and more. The default is `Python`.