Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.spojit.com/llms.txt

Use this file to discover all available pages before exploring further.

The Knowledge node lets your workflow process documents — embedding their content into a searchable vector store, then querying that content to extract structured information using AI. It’s how you build document understanding into your automations.

How it works

The Knowledge node has two modes: Embed and Query. You typically use both in a workflow:
  1. Embed a document (e.g., a PDF invoice fetched from an SFTP server) — the node parses it, splits it into chunks, and stores the embeddings in a vector database.
  2. Query the embedded content — the node searches for relevant chunks and uses an AI model to extract or summarise the information you need.
Embedded documents are stored in collections scoped to your workspace. Any workflow in your workspace can query a collection, so you can embed documents once and use them across multiple workflows.

Modes

Embed mode

In embed mode, the node takes a document as input, parses it using the appropriate document loader, splits it into chunks, and stores the vector embeddings. Configuration:
FieldRequiredDescription
CollectionYesThe collection to store embeddings in. Choose a persistent collection from your workspace, or Transient for single-run processing.
File NameYes*The name to register the document under (e.g., invoice.pdf or {{ trigger.fileName }}). Supports variable references. If a file with this name already exists, it will be overwritten. Not required for transient collections.
Document TypeYesThe format of the input document — PDF, CSV, JSON, HTML, or Plain Text.
Document InputYesA reference to the base64-encoded document from a previous step (e.g., {{ sftp_result.data.content }}sftp_result.data is the file payload { path, content, encoding, size }).
Embedding ModelNoWhich embedding model to use (hidden for transient — uses default).
Output VariableNoVariable name to store the result (chunk count, collection metadata).
Supported document types:
TypeDescription
PDFPDF documents — parsed page by page
Word (DOCX/DOC)Microsoft Word documents
Excel (XLSX/XLS)Spreadsheets — rows and cells extracted as text
PowerPoint (PPTX/PPT)Presentations — text extracted from slides
CSV / TSVDelimited data files
JSONJSON files — content extracted as text
XMLXML documents
HTMLHTML pages — text extracted, tags stripped
Plain TextRaw text files
MarkdownMarkdown (.md) files
RTFRich Text Format documents
Email (EML/MSG)Email messages including headers and body
EPUBE-book format
OpenDocument (ODT/ODS/ODP)LibreOffice / OpenOffice documents
Images (PNG/JPG/TIFF/BMP)Images — text extracted via OCR
Web Page (URL)Fetches and parses a web page by URL (workflow only)

Query mode

In query mode, the node searches the vector store for chunks relevant to your prompt, then uses an AI model to synthesise or extract information from those chunks. Configuration:
FieldRequiredDescription
CollectionYesThe collection to search. Must match a collection with embedded documents, or Transient to query documents embedded earlier in the same run.
PromptYesA natural-language description of what you want to extract or answer.
ModelNoWhich AI model to use for synthesis — see Models for options.
Embedding ModelNoWhich embedding model to use for the search query (should match the model used during embedding). Hidden for transient collections.
Result CountNoNumber of document chunks to retrieve (default: 5).
Response SchemaNoDefine a JSON schema to force the AI to return structured data. See Response Schema below.
Output VariableNoVariable name to store the extracted result.
For best results, use the same embedding model for both the Embed and Query steps. Mixing models will produce poor search results because the vector spaces won’t align.

Collections

Collections are workspace-scoped — every workflow in your workspace shares the same collections. This means you can:
  • Embed a company policy document once and query it from any workflow
  • Build up a collection over time by embedding new documents in each workflow run
  • Share knowledge across different automations
You can manage your collections (view documents, upload files, delete) from the Knowledge section in the platform sidebar.

Transient collections

For one-off document processing — where you embed, query, and discard — select Transient from the collection dropdown. Transient collections:
  • Are automatically created for each workflow run
  • Are shared across all nodes in the same run that select “Transient”
  • Are automatically cleaned up when the workflow completes (success or failure)
  • Don’t require a file name or embedding model selection
This is ideal for use cases like invoice extraction, where a new document arrives regularly and you only need to process it once.

Response Schema

In query mode, you can define a response schema to force the AI to return structured data instead of free-form text. This is useful when downstream nodes need to work with specific fields. The schema editor has two modes:
  • Visual — add properties with name, type, description, and required flags
  • JSON — paste or edit raw JSON schema directly
Supported property types: String, Number, Boolean, Array, and Object. Arrays and objects can contain nested properties. Example: Extract invoice line items as structured data:
{
  "type": "object",
  "description": "Invoice line items",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "sku": { "type": "string", "description": "Product SKU" },
          "quantity": { "type": "number", "description": "Quantity ordered" },
          "price": { "type": "number", "description": "Unit price" }
        }
      }
    },
    "total": { "type": "number", "description": "Invoice total" }
  },
  "required": ["products", "total"]
}
Response schema is also available on Connector nodes in Agent mode. The same visual/JSON editor is used.

Embedding models

Two embedding models are available:
ModelDimensionsBest for
Gemini Embedding 001 (default)3,072Higher accuracy, complex documents
Text Embedding 004768Faster, lighter, good for most use cases
You can select the embedding model in the properties panel. If not set, the default model is used.

Examples

Extract invoice line items (transient)

A workflow that fetches a PDF invoice and extracts structured data without polluting a persistent collection:
  1. Trigger (Manual or Webhook)
  2. Connector (SFTP, Direct mode) — fetch the PDF file, output: sftp_result
  3. Knowledge (Embed mode)
    • Collection: Transient
    • Document Type: PDF
    • Document Input: {{ sftp_result.data.content }}
  4. Knowledge (Query mode)
    • Collection: Transient
    • Prompt: Extract all line items with SKU, quantity, and unit price
    • Response Schema: define products array with sku, quantity, price fields
    • Output Variable: invoice_data
After the workflow completes, the transient collection is automatically cleaned up.

Build a persistent knowledge base

Embed company documents once, then query them from any workflow: Workflow 1 — Index documents (run once or on schedule):
  1. Trigger (Schedule — weekly)
  2. Connector (SFTP/HTTP, Direct mode) — fetch updated policy documents
  3. Knowledge (Embed mode)
    • Collection: company-policies
    • File Name: {{ sftp_result.data.path }}
    • Document Type: PDF
Workflow 2 — Answer policy questions (run on demand):
  1. Trigger (Webhook — receives a question)
  2. Knowledge (Query mode)
    • Collection: company-policies
    • Prompt: {{ trigger.question }}
    • Output Variable: answer
  3. Connector (Slack, Agent mode) — send the answer back

Tips

  • Use transient collections for one-off processing. If you’re extracting data from a single document per run (invoices, receipts, forms), transient mode keeps your workspace clean.
  • Use descriptive file names for persistent collections. Names like {{ trigger.fileName }} or invoice-2024-001.pdf help you identify documents in the collection grid.
  • Match your embedding models. Always use the same embedding model for embedding and querying the same collection.
  • Start with specific prompts. Instead of “what’s in this document?”, try “extract all line items with quantities and prices as a JSON array”.
  • Use response schemas for reliable downstream processing. When a Transform or Condition node needs specific fields, define a response schema to guarantee the structure.