April 26, 2026 β€’ Version: 2026.3.8

Multimodal Memory Indexing Not Available in OpenClaw Memory Search Pipeline

When using Gemini Embedding 2 with memory search, non-text media files (images, video, audio, PDFs) are not indexed, causing the multimodal embedding capability to remain unused and preventing semantic search across media types.


πŸ” Symptoms

Technical Manifestations

When a user configures OpenClaw with memorySearch.provider = "gemini" and model = "gemini-embedding-2-preview", the following behavior occurs:

1. Media files are silently excluded from indexing

# User stores a photo with the memory system
$ openclaw memory store --file receipt.jpg --note "Expense from vendor meeting"

# Expected: Photo embedded directly into vector index
# Actual: File is either skipped or requires manual Markdown summarization

$ openclaw memory search "expense receipts"
# Only returns text-based results
# Image of the receipt is not retrievable via semantic search

2. Current indexer filters block non-Markdown files

The internal MemoryIndexer class contains a file extension filter:

// Pseudocode representing current behavior
const SUPPORTED_EXTENSIONS = ['.md', '.markdown'];

function indexFile(filePath) {
  const ext = path.extname(filePath);
  if (!SUPPORTED_EXTENSIONS.includes(ext)) {
    logger.debug(`Skipping non-markdown file: ${filePath}`);
    return; // <-- Multimodal media never reaches embedding API
  }
  // ... text extraction and embedding logic
}

3. Vector store receives no media embeddings

# Checking the SQLite memory database directly
$ sqlite3 ~/.openclaw/memory.db "SELECT id, type, metadata FROM memory_vectors LIMIT 10;"

id          type        metadata
----------  ----------  --------------------------------------------------
vec_001     text        {"source": "note.md", "chunk": 0}
vec_002     text        {"source": "note.md", "chunk": 1}
# No entries with type='image', 'video', 'audio', or 'pdf'

4. API calls to Gemini Embedding 2 only include text chunks

When memory_search queries the embedding model for retrieval:

# Debug log showing current API payload
{
  "model": "gemini-embedding-2-preview",
  "contents": [{ "text": "query text" }],
  // Missing: multimodal content support
}

🧠 Root Cause

Architectural Gap Analysis

1. File Discovery Phase Limitation

The FileSystemMemorySource class (or equivalent) performs recursive directory scanning with a hardcoded extension filter:

// Current implementation in src/memory/sources/file-system.ts
async scanForMemoryFiles(directory: string): Promise<MemoryFile[]> {
  const files = await glob('**/*', { 
    cwd: directory,
    absolute: true 
  });
  
  return files
    .filter(f => f.endsWith('.md') || f.endsWith('.markdown'))
    .map(f => new MemoryFile(f));
}

This design predates the multimodal embedding capability and assumes all memory content is text-based.

2. Text Extraction Layer

Even if media files were discovered, the current pipeline routes all content through a Markdown parser:

// Current pipeline: media file β†’ skipped OR requires external conversion
async extractContent(file: MemoryFile): Promise<string> {
  switch (file.mimeType) {
    case 'text/markdown':
    case 'text/plain':
      return fs.readFile(file.path, 'utf-8');
    default:
      // No handler for image/*, video/*, audio/*, application/pdf
      throw new UnsupportedMediaTypeError(file.mimeType);
  }
}

3. Single-Modal Embedding Request Construction

The Gemini embedding request builder only constructs text payloads:

// Current: text-only content parts
function buildEmbeddingRequest(chunks: string[]): EmbeddingRequest {
  return {
    model: this.config.model,
    contents: chunks.map(text => ({
      parts: [{ text }]  // <-- Only text part type supported
    }))
  };
}

4. SQLite Vector Store Schema Constraints

The current schema does not anticipate non-text vector types:

CREATE TABLE memory_vectors (
  id TEXT PRIMARY KEY,
  vector BLOB NOT NULL,
  metadata JSON,           -- Limited metadata for text chunks
  memory_type TEXT,        -- Only 'text' or null expected
  created_at INTEGER
);

No media_type or mime_type column exists to distinguish image vs. video vs. audio embeddings.

5. Missing Multimodal API Integration

Gemini Embedding 2 requires specific request formatting for multimodal inputs:

// Required for multimodal (but not implemented):
{
  "model": "gemini-embedding-2-preview",
  "contents": [{
    "parts": [
      { "text": "Image showing vendor receipt" },
      { "inline_data": {
        "mime_type": "image/jpeg",
        "data": "<base64-encoded-image>"
      }}
    ]
  }]
}

This capability exists in the Gemini API but has no handler in OpenClaw’s memory pipeline.

Summary of Failure Sequence

User uploads media β†’ FileSystemMemorySource.scan() filters by .md β†’ Media file skipped β†’ Gemini API never called with media β†’ Vector store empty for media β†’ memory_search returns no media results

πŸ› οΈ Step-by-Step Fix

Implementation Guide: Adding Multimodal Memory Indexing

Prerequisites

  • OpenClaw version: 2026.3.8 or later (when implementing)
  • Gemini API key with access to gemini-embedding-2-preview
  • Configuration: memorySearch.provider = "gemini" and model = "gemini-embedding-2-preview"

Step 1: Extend File Discovery to Include Media Types

Modify src/memory/sources/file-system.ts:

typescript // BEFORE (current behavior) const SUPPORTED_EXTENSIONS = [’.md’, ‘.markdown’];

// AFTER (proposed) const SUPPORTED_EXTENSIONS = [’.md’, ‘.markdown’]; const SUPPORTED_MEDIA_TYPES = [ ‘image/jpeg’, ‘image/png’, ‘image/gif’, ‘image/webp’, ‘video/mp4’, ‘video/webm’, ‘audio/mpeg’, ‘audio/wav’, ‘audio/ogg’, ‘application/pdf’ ];

Update the file scanning logic:

typescript async scanForMemoryFiles(directory: string): Promise<MemoryFile[]> { const files = await glob(’**/*’, { cwd: directory, absolute: true });

return files .filter(f => { const ext = path.extname(f).toLowerCase(); return SUPPORTED_EXTENSIONS.includes(ext) || SUPPORTED_MEDIA_TYPES.includes(this.getMimeType(ext)); }) .map(f => new MemoryFile(f)); }


Step 2: Implement Media Content Extraction

Create src/memory/extractors/media-extractor.ts:

typescript import { readFile, stat } from ‘fs/promises’;

export interface MediaContent { mimeType: string; data: string; // base64 encoded sizeBytes: number; }

export class MediaExtractor { // Gemini Embedding 2 has a 20MB input limit per content private readonly MAX_FILE_SIZE = 20 * 1024 * 1024;

async extract(filePath: string): Promise { const stats = await stat(filePath);

if (stats.size > this.MAX_FILE_SIZE) {
  throw new Error(
    `File ${filePath} exceeds ${this.MAX_FILE_SIZE} bytes limit for embedding`
  );
}

const buffer = await readFile(filePath);
const mimeType = this.detectMimeType(filePath);

return {
  mimeType,
  data: buffer.toString('base64'),
  sizeBytes: stats.size
};

}

private detectMimeType(filePath: string): string { const ext = path.extname(filePath).toLowerCase(); const mimeMap: Record<string, string> = { ‘.jpg’: ‘image/jpeg’, ‘.jpeg’: ‘image/jpeg’, ‘.png’: ‘image/png’, ‘.gif’: ‘image/gif’, ‘.webp’: ‘image/webp’, ‘.mp4’: ‘video/mp4’, ‘.webm’: ‘video/webm’, ‘.mp3’: ‘audio/mpeg’, ‘.wav’: ‘audio/wav’, ‘.ogg’: ‘audio/ogg’, ‘.pdf’: ‘application/pdf’ }; return mimeMap[ext] || ‘application/octet-stream’; } }


Step 3: Update Content Router for Multimodal Handling

Modify src/memory/content-router.ts:

typescript // BEFORE async extractContent(file: MemoryFile): Promise { const content = await fs.readFile(file.path, ‘utf-8’); return this.markdownParser.parse(content); }

// AFTER async extractContent(file: MemoryFile): Promise<string | MediaContent> { const isMediaFile = this.mediaExtractor.isSupported(file.path);

if (isMediaFile) { return this.mediaExtractor.extract(file.path); }

const text = await fs.readFile(file.path, ‘utf-8’); return this.markdownParser.parse(text); }


Step 4: Build Multimodal Embedding Requests

Update src/memory/providers/gemini-embedding.ts:

typescript interface TextPart { text: string; }

interface InlineDataPart { inline_data: { mime_type: string; data: string; }; }

type ContentPart = TextPart | InlineDataPart;

interface EmbeddingTask { parts: ContentPart[]; mediaType?: string; }

export class GeminiEmbeddingProvider { async embed(task: EmbeddingTask): Promise<number[]> { const request = { model: this.config.model, content: { parts: task.parts }, taskType: ‘SEMANTIC_SIMILARITY’ };

const response = await this.geminiClient.embedContent(request);
return response.embedding.values;

}

// Build multimodal embedding request with optional text context buildMultimodalRequest(mediaData: MediaContent, textContext?: string): EmbeddingTask { const parts: ContentPart[] = [];

// Optional text caption/context
if (textContext) {
  parts.push({ text: textContext });
}

// Inline media data
parts.push({
  inline_data: {
    mime_type: mediaData.mimeType,
    data: mediaData.data
  }
});

return { parts, mediaType: mediaData.mimeType };

} }


Step 5: Update SQLite Schema for Media Vectors

sql – Migration: add columns for media vector metadata ALTER TABLE memory_vectors ADD COLUMN media_type TEXT;

ALTER TABLE memory_vectors ADD COLUMN mime_type TEXT;

– Updated schema CREATE TABLE memory_vectors ( id TEXT PRIMARY KEY, vector BLOB NOT NULL, metadata JSON, memory_type TEXT, – ’text’ or ‘media’ media_type TEXT, – ‘image’, ‘video’, ‘audio’, ‘document’ mime_type TEXT, – ‘image/jpeg’, ‘application/pdf’, etc. original_file_path TEXT, created_at INTEGER );

– Index for media-specific queries CREATE INDEX idx_memory_vectors_media ON memory_vectors(memory_type, media_type);


Step 6: Integrate into Memory Indexer Pipeline

Modify src/memory/indexer.ts:

typescript async indexFile(file: MemoryFile): Promise { const content = await this.contentRouter.extractContent(file);

if (typeof content === ‘string’) { // Existing text embedding logic await this.indexText(content, file.path); } else { // New multimodal embedding logic await this.indexMedia(content, file.path); } }

async indexMedia(media: MediaContent, sourcePath: string): Promise { // Check if provider supports multimodal if (!this.isMultimodalProvider()) { throw new Error( Multimodal embedding requires gemini-embedding-2-preview. + Current provider: ${this.config.provider} ); }

const task = this.geminiProvider.buildMultimodalRequest( media, Source file: ${path.basename(sourcePath)} );

const vector = await this.geminiProvider.embed(task);

await this.vectorStore.insert({ id: this.generateId(), vector, memory_type: ‘media’, media_type: this.categorizeMedia(media.mimeType), mime_type: media.mimeType, original_file_path: sourcePath, metadata: { size_bytes: media.sizeBytes, indexed_at: Date.now() } }); }

private categorizeMedia(mimeType: string): string { if (mimeType.startsWith(‘image/’)) return ‘image’; if (mimeType.startsWith(‘video/’)) return ‘video’; if (mimeType.startsWith(‘audio/’)) return ‘audio’; return ‘document’; }


Step 7: Update Configuration Schema

Modify src/config/schema.ts:

typescript export interface MemorySearchConfig { provider: ‘gemini’ | ‘openai’ | ’local’; model?: string;

// New configuration options multimodal?: { enabled: boolean; // Default: true when using gemini-embedding-2-preview maxFileSizeBytes: number; // Default: 20MB supportedTypes?: string[]; // Optional filter };

// Indexing control indexMedia?: boolean; // Default: false (opt-in until GA) mediaIndexPath?: string; // Separate directory for media index }


Configuration Example

yaml

~/.openclaw/config.yaml

memorySearch: provider: gemini model: gemini-embedding-2-preview multimodal: enabled: true maxFileSizeBytes: 20971520 # 20MB indexMedia: true mediaIndexPath: ~/.openclaw/memory/media

πŸ§ͺ Verification

Verification Procedures

After implementing the multimodal indexing feature, verify correct operation using the following test scenarios:


Test 1: Media File Discovery

bash

Place a test image in the memory directory

$ cp /path/to/receipt.jpg ~/.openclaw/memory/

Verify file is discovered

$ openclaw memory scan –verbose

[DEBUG] Discovered file: /home/user/.openclaw/memory/receipt.jpg [DEBUG] Mime type: image/jpeg [DEBUG] Size: 245KB [INFO] Found 1 text files, 1 media files

Expected output should show media files being discovered, not filtered out.


Test 2: Direct Media Embedding

bash

Store the image with semantic context

$ openclaw memory store –file ~/.openclaw/memory/receipt.jpg
–note “Vendor receipt from Acme Corp meeting”

Verify embedding was created

$ sqlite3 ~/.openclaw/memory.db
“SELECT id, memory_type, mime_type, original_file_path FROM memory_vectors
WHERE original_file_path LIKE ‘%receipt.jpg%’;”

id memory_type mime_type original_file_path


vec_m_001 media image/jpeg /home/user/.openclaw/memory/receipt.jpg

Exit code 0 and a media-type row indicates successful embedding.


Test 3: Semantic Search for Media Content

bash

Search using text that should match the embedded image

$ openclaw memory search “Acme Corp vendor receipt”

Expected output format:

— Text Results —

[Score: 0.87] note.md: “Meeting notes with Acme Corp vendor…”

— Media Results —

[Score: 0.82] receipt.jpg (image/jpeg)

Context: “Vendor receipt from Acme Corp meeting”

File: ~/.openclaw/memory/receipt.jpg

Exit code 0 confirms mixed text+media results are returned.


Test 4: Multimodal Provider Guard

bash

Test with unsupported provider configuration

$ openclaw config set memorySearch.provider openai $ openclaw config set memorySearch.model text-embedding-3-small $ openclaw memory store –file ~/.openclaw/memory/receipt.jpg

Expected error:

[ERROR] Multimodal embedding requires gemini-embedding-2-preview. Current provider: openai

Exit code: 1


Test 5: File Size Limit Enforcement

bash

Create a test file exceeding 20MB limit

$ dd if=/dev/urandom of=large_video.mp4 bs=1M count=25

$ openclaw memory store –file ./large_video.mp4

Expected output:

[ERROR] File ./large_video.mp4 exceeds 20971520 bytes limit for embedding

Exit code: 1


Test 6: End-to-End Video Embedding

bash

Store a video file

$ openclaw memory store –file ./meeting-recording.mp4
–note “Product demo showing new dashboard feature”

Query by semantic concept

$ openclaw memory search “dashboard feature presentation”

Expected: Video file should appear in media results

With relevance score above configured threshold (default: 0.7)


Regression Test: Text-Only Mode Unaffected

bash

Ensure existing text indexing still works

$ echo “# Meeting Notes\n\nDiscussed Q4 roadmap.” > /tmp/notes.md $ openclaw memory store –file /tmp/notes.md

$ openclaw memory search “quarterly roadmap”

Should return text result with score >= 0.75

Should NOT attempt multimodal embedding for .md files

⚠️ Common Pitfalls

Implementation and Usage Traps

1. API Quota Consumption

Multimodal embeddings consume significantly more API quota than text-only embeddings.

  • Risk: A directory with 100 images embedded individually could exhaust daily API limits within minutes.
  • Mitigation: Implement batch embedding with rate limiting:
    // Rate limit: max 10 media embeddings per minute
    const EMBEDDING_RATE_LIMIT = {
      maxPerMinute: 10,
      maxConcurrent: 2
    };
    
  • Detection: Monitor API quota via `google.ai.generativelanguage.v1beta.QuotaInfo`

2. File Size vs. Token Budget

Gemini Embedding 2 has a 20MB per-request limit, but large images may exceed token context after processing.

  • Risk: A 19MB PDF might still fail if it contains complex layouts exceeding token limits.
  • Mitigation: Pre-check file complexity:
    async validateForEmbedding(file: MemoryFile): Promise<ValidationResult> {
      const stats = await stat(file.path);
    

    if (stats.size > 20 * 1024 * 1024) { return { valid: false, reason: ’exceeds_20mb_limit’ }; }

    if (file.mimeType === ‘application/pdf’) { const pageCount = await this.countPdfPages(file.path); if (pageCount > 100) { return { valid: false, reason: ‘pdf_exceeds_page_limit’ }; } }

    return { valid: true }; }

3. Base64 Encoding Overhead

Base64 encoding increases file size by ~33%.

  • Risk: A 15MB image becomes ~20MB when base64-encoded, potentially hitting limits mid-transfer.
  • Mitigation: Set practical file size limits at 15MB for raw files to account for encoding overhead.

4. SQLite BLOB Storage Growth

Storing vector blobs alongside base64 media data can rapidly inflate the database.

  • Risk: 1,000 images Γ— 5MB average = 5GB database growth.
  • Mitigation: Store vectors in database, keep media files on disk with reference paths:
    // Store only vector + metadata, not full base64
    interface MediaVectorRecord {
      id: string;
      vector: Float32Array;        // ~1536 dimensions for Gemini
      metadata: {
        fileHash: string;          // SHA-256 for integrity
        filePath: string;          // Reference to original file
        thumbnailPath?: string;    // Optional preview
      };
    }
    

5. Context Window Truncation

When providing text context alongside media, long captions may be truncated by the embedding model’s context window.

  • Risk: A 2,000-character description gets truncated, losing semantic nuance.
  • Mitigation: Implement intelligent truncation that preserves key entities and actions.

6. macOS Spotlight Integration Conflict

On macOS, .openclaw directory might be indexed by Spotlight, causing duplicate embeddings.

  • Detection: Check for `com.apple.metadata:*` extended attributes on indexed files.
  • Mitigation: Add `~/.openclaw/memory` to Spotlight privacy exceptions via `mdimport -x`

7. Docker Volume Mount Permissions

When running OpenClaw in Docker, mounted volumes may have different permission models.

  • Symptom: `EACCES: permission denied` when reading media files from mounted paths.
  • Mitigation: Ensure volume mounts use consistent UID/GID:
    docker run -v /host/media:/container/media:ro,uid=1000 \
      openclaw/memory indexer
    

8. Network Timeout on Large Files

High-resolution images may timeout during base64 transfer to Gemini API.

  • Configuration: Set extended timeouts:
    const GEMINI_TIMEOUT = {
      connectTimeout: 10000,   // 10s
      socketTimeout: 120000    // 2min for large media
    };
    

9. Memory Pressure During Batch Indexing

Loading multiple large images into memory simultaneously.

  • Risk: Out-of-memory errors when indexing entire media directories.
  • Mitigation: Process in streaming fashion with garbage collection hints:
    async function* streamMediaFiles(directory: string) {
      for await (const file of globStream('**/*.{jpg,png,mp4}', { cwd: directory })) {
        // Release memory after each embedding
        const result = await embedMedia(file);
        yield result;
        if (global.gc) global.gc(); // Force GC after each iteration
      }
    }
    

10. Mixed Provider Configuration

Users may configure text search with OpenAI but expect Gemini for media.

  • Risk: Partial functionality or silent failures.
  • Enforcement: Validate provider consistency:
    function validateMultimodalConfig(config: MemorySearchConfig): void {
      if (config.multimodal?.enabled && 
          !(config.model?.includes('gemini-embedding-2'))) {
        throw new ConfigurationError(
          'multimodal.enabled requires model containing "gemini-embedding-2"'
        );
      }
    }
    

Contextual References

  • UNSUPPORTED_MEDIA_TYPE β€” Currently thrown when non-Markdown files are encountered during indexing. Will transition to supported media types after implementation.
  • EMBEDDING_PROVIDER_MISMATCH β€” Raised when `indexMedia: true` is set but provider is not Gemini. Aligns with provider validation in Step 7.
  • FILE_SIZE_EXCEEDED β€” Proposed error code for files exceeding the 20MB limit. Includes file path and actual size in error metadata.
  • VECTOR_DIMENSION_MISMATCH β€” Potential error if Gemini Embedding 2 dimensions differ from existing vector store schema. Gemini typically returns 1536 dimensions.
  • QUOTA_EXCEEDED β€” API quota exhaustion from rapid multimodal indexing. Should include `retryAfter` header value for exponential backoff.
  • MEDIA_EXTRACTION_FAILED β€” Corruption or unsupported encoding within otherwise valid media files (e.g., corrupted JPEG).
  • GitHub Issue #1842 β€” Original request for multimodal memory search, predating this implementation guide.
  • GitHub Issue #2107 β€” Discussion on optimal chunking strategy for video frames when embedding into vector space.
  • GitHub Discussion #892 β€” Community workaround using external OCR + text embedding, loses fidelity compared to native multimodal.
  • Gemini Embedding 2 Documentation β€” Official reference for multimodal embedding request format and supported input types.

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.