Documentation Index
Fetch the complete documentation index at: https://docs.spojit.com/llms.txt
Use this file to discover all available pages before exploring further.
The Knowledge node lets your workflow process documents — embedding their content into a searchable vector store, then querying that content to extract structured information using AI. It’s how you build document understanding into your automations.
How it works
The Knowledge node has two modes: Embed and Query. You typically use both in a workflow:
- Embed a document (e.g., a PDF invoice fetched from an SFTP server) — the node parses it, splits it into chunks, and stores the embeddings in a vector database.
- Query the embedded content — the node searches for relevant chunks and uses an AI model to extract or summarise the information you need.
Embedded documents are stored in collections scoped to your workspace. Any workflow in your workspace can query a collection, so you can embed documents once and use them across multiple workflows.
Modes
Embed mode
In embed mode, the node takes a document as input, parses it using the appropriate document loader, splits it into chunks, and stores the vector embeddings.
Configuration:
| Field | Required | Description |
|---|
| Collection | Yes | The collection to store embeddings in. Choose a persistent collection from your workspace, or Transient for single-run processing. |
| File Name | Yes* | The name to register the document under (e.g., invoice.pdf or {{ trigger.fileName }}). Supports variable references. If a file with this name already exists, it will be overwritten. Not required for transient collections. |
| Document Type | Yes | The format of the input document — PDF, CSV, JSON, HTML, or Plain Text. |
| Document Input | Yes | A reference to the base64-encoded document from a previous step (e.g., {{ sftp_result.data.content }} — sftp_result.data is the file payload { path, content, encoding, size }). |
| Embedding Model | No | Which embedding model to use (hidden for transient — uses default). |
| Output Variable | No | Variable name to store the result (chunk count, collection metadata). |
Supported document types:
| Type | Description |
|---|
| PDF | PDF documents — parsed page by page |
| Word (DOCX/DOC) | Microsoft Word documents |
| Excel (XLSX/XLS) | Spreadsheets — rows and cells extracted as text |
| PowerPoint (PPTX/PPT) | Presentations — text extracted from slides |
| CSV / TSV | Delimited data files |
| JSON | JSON files — content extracted as text |
| XML | XML documents |
| HTML | HTML pages — text extracted, tags stripped |
| Plain Text | Raw text files |
| Markdown | Markdown (.md) files |
| RTF | Rich Text Format documents |
| Email (EML/MSG) | Email messages including headers and body |
| EPUB | E-book format |
| OpenDocument (ODT/ODS/ODP) | LibreOffice / OpenOffice documents |
| Images (PNG/JPG/TIFF/BMP) | Images — text extracted via OCR |
| Web Page (URL) | Fetches and parses a web page by URL (workflow only) |
Query mode
In query mode, the node searches the vector store for chunks relevant to your prompt, then uses an AI model to synthesise or extract information from those chunks.
Configuration:
| Field | Required | Description |
|---|
| Collection | Yes | The collection to search. Must match a collection with embedded documents, or Transient to query documents embedded earlier in the same run. |
| Prompt | Yes | A natural-language description of what you want to extract or answer. |
| Model | No | Which AI model to use for synthesis — see Models for options. |
| Embedding Model | No | Which embedding model to use for the search query (should match the model used during embedding). Hidden for transient collections. |
| Result Count | No | Number of document chunks to retrieve (default: 5). |
| Response Schema | No | Define a JSON schema to force the AI to return structured data. See Response Schema below. |
| Output Variable | No | Variable name to store the extracted result. |
For best results, use the same embedding model for both the Embed and Query steps. Mixing models will produce poor search results because the vector spaces won’t align.
Collections
Collections are workspace-scoped — every workflow in your workspace shares the same collections. This means you can:
- Embed a company policy document once and query it from any workflow
- Build up a collection over time by embedding new documents in each workflow run
- Share knowledge across different automations
You can manage your collections (view documents, upload files, delete) from the Knowledge section in the platform sidebar.
Transient collections
For one-off document processing — where you embed, query, and discard — select Transient from the collection dropdown. Transient collections:
- Are automatically created for each workflow run
- Are shared across all nodes in the same run that select “Transient”
- Are automatically cleaned up when the workflow completes (success or failure)
- Don’t require a file name or embedding model selection
This is ideal for use cases like invoice extraction, where a new document arrives regularly and you only need to process it once.
Response Schema
In query mode, you can define a response schema to force the AI to return structured data instead of free-form text. This is useful when downstream nodes need to work with specific fields.
The schema editor has two modes:
- Visual — add properties with name, type, description, and required flags
- JSON — paste or edit raw JSON schema directly
Supported property types: String, Number, Boolean, Array, and Object. Arrays and objects can contain nested properties.
Example: Extract invoice line items as structured data:
{
"type": "object",
"description": "Invoice line items",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"sku": { "type": "string", "description": "Product SKU" },
"quantity": { "type": "number", "description": "Quantity ordered" },
"price": { "type": "number", "description": "Unit price" }
}
}
},
"total": { "type": "number", "description": "Invoice total" }
},
"required": ["products", "total"]
}
Response schema is also available on Connector nodes in Agent mode. The same visual/JSON editor is used.
Embedding models
Two embedding models are available:
| Model | Dimensions | Best for |
|---|
| Gemini Embedding 001 (default) | 3,072 | Higher accuracy, complex documents |
| Text Embedding 004 | 768 | Faster, lighter, good for most use cases |
You can select the embedding model in the properties panel. If not set, the default model is used.
Examples
A workflow that fetches a PDF invoice and extracts structured data without polluting a persistent collection:
- Trigger (Manual or Webhook)
- Connector (SFTP, Direct mode) — fetch the PDF file, output:
sftp_result
- Knowledge (Embed mode)
- Collection: Transient
- Document Type: PDF
- Document Input:
{{ sftp_result.data.content }}
- Knowledge (Query mode)
- Collection: Transient
- Prompt:
Extract all line items with SKU, quantity, and unit price
- Response Schema: define
products array with sku, quantity, price fields
- Output Variable:
invoice_data
After the workflow completes, the transient collection is automatically cleaned up.
Build a persistent knowledge base
Embed company documents once, then query them from any workflow:
Workflow 1 — Index documents (run once or on schedule):
- Trigger (Schedule — weekly)
- Connector (SFTP/HTTP, Direct mode) — fetch updated policy documents
- Knowledge (Embed mode)
- Collection:
company-policies
- File Name:
{{ sftp_result.data.path }}
- Document Type: PDF
Workflow 2 — Answer policy questions (run on demand):
- Trigger (Webhook — receives a question)
- Knowledge (Query mode)
- Collection:
company-policies
- Prompt:
{{ trigger.question }}
- Output Variable:
answer
- Connector (Slack, Agent mode) — send the answer back
Tips
- Use transient collections for one-off processing. If you’re extracting data from a single document per run (invoices, receipts, forms), transient mode keeps your workspace clean.
- Use descriptive file names for persistent collections. Names like
{{ trigger.fileName }} or invoice-2024-001.pdf help you identify documents in the collection grid.
- Match your embedding models. Always use the same embedding model for embedding and querying the same collection.
- Start with specific prompts. Instead of “what’s in this document?”, try “extract all line items with quantities and prices as a JSON array”.
- Use response schemas for reliable downstream processing. When a Transform or Condition node needs specific fields, define a response schema to guarantee the structure.