diff --git a/docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md b/docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md new file mode 100644 index 000000000..1a57da344 --- /dev/null +++ b/docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md @@ -0,0 +1,97 @@ +--- +title: Integrate Nvidia Ingest with Langflow +slug: /integrations-nvidia-ingest +--- + +The **NVIDIA Ingest** component integrates with the [NVIDIA nv-ingest](https://github.com/NVIDIA/nv-ingest) microservice for data ingestion, processing, and extraction of text files. + +The `nv-ingest` service supports multiple extraction methods for PDF, DOCX, and PPTX file types, and includes pre- and post-processing services like splitting, chunking, and embedding generation. + +The **NVIDIA Ingest** component imports the NVIDIA `Ingestor` client, ingests files with requests to the NVIDIA ingest endpoint, and outputs the processed content as a list of [Data](/concepts-objects#data-object) objects. `Ingestor` accepts additional configuration options for data extraction from other text formats. To configure these options, see the [component parameters](/integrations-nvidia-ingest#parameters). + +## Prerequisites + +* An NVIDIA Ingest endpoint. For more information on setting up an NVIDIA Ingest endpoint, see the [NVIDIA Ingest quickstart](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart). + +* The **NVIDIA Ingest** component requires the installation of additional dependencies to your Langflow environment. To install the dependencies in a virtual environment, run the following commands: +```bash +source **YOUR_LANGFLOW_VENV**/bin/activate +uv sync --extra nv-ingest +uv run langflow run +``` + +## Use the NVIDIA Ingest component in a flow + +The **NVIDIA Ingest** component accepts **Message** inputs and outputs **Data**. The component calls a NVIDIA Ingest microservice's endpoint to ingest a local file and extract the text. + +To use the NVIDIA Ingest component in your flow, follow these steps: +1. In the component library, click the **NVIDIA Ingest** component, and then drag it onto the canvas. +2. In the **NVIDIA Ingestion URL** field, enter the URL of the NVIDIA Ingest endpoint. +Optionally, add the endpoint URL as a **Global variable**: + 1. Click **Settings**, and then click **Global Variables**. + 2. Click **Add New**. + 3. Name your variable. Paste your endpoint in the **Value** field. + 4. In the **Apply To Fields** field, select the field you want to globally apply this variable to. In this case, select **NVIDIA Ingestion URL**. + 5. Click **Save Variable**. +3. In the **Path** field, enter the path to the file you want to ingest. +4. Select which text type to extract from the file. +The component supports text, charts, and tables. +5. Select whether to split the text into chunks. +Modify the splitting parameters in the component's **Configuration** tab. +7. Click **Run** to ingest the file. +8. To confirm the component is ingesting the file, open the **Logs** pane to view the output of the flow. +9. To store the processed data in a vector database, add an **AstraDB Vector** component to your flow, and connect the **NVIDIA Ingest** component to the **AstraDB Vector** component with a **Data** output. + +![NVIDIA Ingest component flow](nvidia-component-ingest-astra.png) + +10. Run the flow. +Inspect your Astra DB vector database to view the processed data. + +## NVIDIA Ingest component parameters {#parameters} + +The **NVIDIA Ingest** component has the following parameters. + +For more information, see the [NV-Ingest documentation](https://nvidia.github.io/nv-ingest/user-guide/). + +### Inputs + +| Name | Display Name | Info | +|------|--------------|------| +| base_url | NVIDIA Ingestion URL | The URL of the NVIDIA Ingestion API. | +| path | Path | File path to process. | +| extract_text | Extract Text | Extract text from documents. Default: `True`. | +| extract_charts | Extract Charts | Extract text from charts. Default: `False`. | +| extract_tables | Extract Tables | Extract text from tables. Default: `True`. | +| text_depth | Text Depth | The level at which text is extracted. Support for 'block', 'line', and 'span' varies by document type. Default: `document`. | +| split_text | Split Text | Split text into smaller chunks. Default: `True`. | +| split_by | Split By | How to split into chunks. 'size' splits by number of characters. Default: `word`. | +| split_length | Split Length | The size of each chunk based on the 'split_by' method. Default: `200`. | +| split_overlap | Split Overlap | The number of segments to overlap from the previous chunk. Default: `20`. | +| max_character_length | Max Character Length | The maximum number of characters in each chunk. Default: `1000`. | +| sentence_window_size | Sentence Window Size | The number of sentences to include from previous and following chunks when `split_by=sentence`. Default: `0`. | + +### Outputs + +The **NVIDIA Ingest** component outputs a list of [Data](/concepts-objects#data-object) objects where each object contains: +- `text`: The extracted content. + - For text documents: The extracted text content. + - For tables and charts: The extracted table/chart content. +- `file_path`: The source file name and path. +- `document_type`: The type of the document ("text" or "structured"). +- `description`: Additional description of the content. + +The output varies based on the `document_type`: + +- Documents with `document_type: "text"` contain: + - Raw text content extracted from documents, for example, paragraphs from PDFs or DOCX files. + - Content stored directly in the `text` field. + - Content extracted using the `extract_text` parameter. + +- Documents with `document_type: "structured"` contain: + - Text extracted from tables and charts and processed to preserve structural information. + - Content extracted using the `extract_tables` and `extract_charts` parameters. + - Content stored in the `text` field after being processed from the `table_content` metadata. + +:::note +Images are currently not supported and will be skipped during processing. +::: \ No newline at end of file diff --git a/docs/docs/Integrations/Nvidia/nvidia-component-ingest-astra.png b/docs/docs/Integrations/Nvidia/nvidia-component-ingest-astra.png new file mode 100644 index 000000000..1d0af51cb Binary files /dev/null and b/docs/docs/Integrations/Nvidia/nvidia-component-ingest-astra.png differ diff --git a/docs/sidebars.js b/docs/sidebars.js index 40d1001fc..049e261de 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -123,6 +123,13 @@ module.exports = { "Integrations/Notion/notion-agent-meeting-notes", ], }, + { + type: "category", + label: "NVIDIA", + items: [ + "Integrations/Nvidia/integrations-nvidia-ingest", + ], + }, ], }, {