Knowledge

The Knowledge node lets your workflow process documents by embedding their content into a searchable vector store, then querying that content to extract structured information using AI. It’s how you build document understanding into your automations.

How it works

The Knowledge node has two modes: Embed and Query. You typically use both in a workflow:

Embed a document (e.g., a PDF invoice fetched from an SFTP server); the node parses it, splits it into chunks, and stores the embeddings in a vector database.
Query the embedded content; the node searches for relevant chunks and uses an AI model to extract or summarise the information you need.

Embedded documents are stored in collections scoped to your workspace. Any workflow in your workspace can query a collection, so you can embed documents once and use them across multiple workflows.

Modes

Embed mode

In embed mode, the node takes a document as input, parses it using the appropriate document loader, splits it into chunks, and stores the vector embeddings. Configuration:

Field	Required	Description
Collection	Yes	The collection to store embeddings in. Choose a persistent collection from your workspace, or Transient for single-run processing.
File Name	Yes*	The name to register the document under (e.g., `invoice.pdf` or `{{ trigger.fileName }}`). Supports variable references. If a file with this name already exists, it will be overwritten. Not required for transient collections.
Document Type	Yes	The format of the input document (PDF, CSV, JSON, HTML, or Plain Text).
Document Input	Yes	A reference to the base64-encoded document from a previous step (e.g., `{{ sftp_result.data.content }}`. `sftp_result.data` is the file payload `{ path, content, encoding, size }`).
Embedding Model	No	Which embedding model to use (hidden for transient; uses default).
Output Variable	No	Variable name to store the result (chunk count, collection metadata).

Supported document types:

Type	Description
PDF	PDF documents, parsed page by page
Word (DOCX/DOC)	Microsoft Word documents
Excel (XLSX/XLS)	Spreadsheets, with rows and cells extracted as text
PowerPoint (PPTX/PPT)	Presentations, with text extracted from slides
CSV / TSV	Delimited data files
JSON	JSON files, with content extracted as text
XML	XML documents
HTML	HTML pages, with text extracted and tags stripped
Plain Text	Raw text files
Markdown	Markdown (.md) files
RTF	Rich Text Format documents
Email (EML/MSG)	Email messages including headers and body
EPUB	E-book format
OpenDocument (ODT/ODS/ODP)	LibreOffice / OpenOffice documents
Images (PNG/JPG/TIFF/BMP)	Images, with text extracted via OCR
Web Page (URL)	Fetches and parses a web page by URL (workflow only)

Query mode

In query mode, the node searches the vector store for chunks relevant to your prompt, then uses an AI model to synthesise or extract information from those chunks. Configuration:

Field	Required	Description
Collection	Yes	The collection to search. Must match a collection with embedded documents, or Transient to query documents embedded earlier in the same run.
Prompt	Yes	A natural-language description of what you want to extract or answer.
Model	No	Which AI model to use for synthesis; see Models for options.
Embedding Model	No	Which embedding model to use for the search query (should match the model used during embedding). Hidden for transient collections.
Result Count	No	Number of document chunks to retrieve (default: 5).
Response Schema	No	Define a JSON schema to force the AI to return structured data. See Response Schema below.
Output Variable	No	Variable name to store the extracted result.

For best results, use the same embedding model for both the Embed and Query steps. Mixing models will produce poor search results because the vector spaces won’t align.

Collections

Collections are workspace-scoped, so every workflow in your workspace shares the same collections. This means you can:

Embed a company policy document once and query it from any workflow
Build up a collection over time by embedding new documents in each workflow run
Share knowledge across different automations

You can manage your collections (view documents, upload files, delete) from the Knowledge section in the platform sidebar.

Transient collections

For one-off document processing (where you embed, query, and discard), select Transient from the collection dropdown. Transient collections:

Are automatically created for each workflow run
Are shared across all nodes in the same run that select “Transient”
Are automatically cleaned up when the workflow completes (success or failure)
Don’t require a file name or embedding model selection

This is ideal for use cases like invoice extraction, where a new document arrives regularly and you only need to process it once.

Response Schema

In query mode, you can define a response schema to force the AI to return structured data instead of free-form text. This is useful when downstream nodes need to work with specific fields. The schema editor has two modes:

Visual: add properties with name, type, description, and required flags
JSON: paste or edit raw JSON schema directly

Supported property types: String, Number, Boolean, Array, and Object. Arrays and objects can contain nested properties. Example: Extract invoice line items as structured data:

{
  "type": "object",
  "description": "Invoice line items",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "sku": { "type": "string", "description": "Product SKU" },
          "quantity": { "type": "number", "description": "Quantity ordered" },
          "price": { "type": "number", "description": "Unit price" }
        }
      }
    },
    "total": { "type": "number", "description": "Invoice total" }
  },
  "required": ["products", "total"]
}

Response schema is also available on Connector nodes in Agent mode. The same visual/JSON editor is used.

Embedding models

Two embedding models are available:

Model	Dimensions	Best for
Gemini Embedding 001 (default)	3,072	Higher accuracy, complex documents
Text Embedding 004	768	Faster, lighter, good for most use cases

You can select the embedding model in the properties panel. If not set, the default model is used.

Examples

Extract invoice line items (transient)

A workflow that fetches a PDF invoice and extracts structured data without polluting a persistent collection:

Trigger (Manual or Webhook)
Connector (SFTP, Direct mode): fetch the PDF file, output: sftp_result
Knowledge (Embed mode)
- Collection: Transient
- Document Type: PDF
- Document Input: {{ sftp_result.data.content }}
Knowledge (Query mode)
- Collection: Transient
- Prompt: Extract all line items with SKU, quantity, and unit price
- Response Schema: define products array with sku, quantity, price fields
- Output Variable: invoice_data

After the workflow completes, the transient collection is automatically cleaned up.

Build a persistent knowledge base

Embed company documents once, then query them from any workflow: Workflow 1: Index documents (run once or on schedule):

Trigger (Schedule, weekly)
Connector (SFTP/HTTP, Direct mode): fetch updated policy documents
Knowledge (Embed mode)
- Collection: company-policies
- File Name: {{ sftp_result.data.path }}
- Document Type: PDF

Workflow 2: Answer policy questions (run on demand):

Trigger (Webhook, receives a question)
Knowledge (Query mode)
- Collection: company-policies
- Prompt: {{ trigger.question }}
- Output Variable: answer
Connector (Slack, Agent mode): send the answer back

Tips

Use transient collections for one-off processing. If you’re extracting data from a single document per run (invoices, receipts, forms), transient mode keeps your workspace clean.
Use descriptive file names for persistent collections. Names like {{ trigger.fileName }} or invoice-2024-001.pdf help you identify documents in the collection grid.
Match your embedding models. Always use the same embedding model for embedding and querying the same collection.
Start with specific prompts. Instead of “what’s in this document?”, try “extract all line items with quantities and prices as a JSON array”.
Use response schemas for reliable downstream processing. When a Transform or Condition node needs specific fields, define a response schema to guarantee the structure.

Getting Started

Node Types

Running Workflows

How it works

Modes

Embed mode

Query mode

Collections

Transient collections

Response Schema

Embedding models

Examples

Extract invoice line items (transient)

Build a persistent knowledge base

Tips

​How it works

​Modes

​Embed mode

​Query mode

​Collections

​Transient collections

​Response Schema

​Embedding models

​Examples

​Extract invoice line items (transient)

​Build a persistent knowledge base

​Tips

How it works

Modes

Embed mode

Query mode

Collections

Transient collections

Response Schema

Embedding models

Examples

Extract invoice line items (transient)

Build a persistent knowledge base

Tips