Multimodal Memory Indexing Not Available in OpenClaw Memory Search Pipeline
When using Gemini Embedding 2 with memory search, non-text media files (images, video, audio, PDFs) are not indexed, causing the multimodal embedding capability to remain unused and preventing semantic search across media types.
π Symptoms
Technical Manifestations
When a user configures OpenClaw with memorySearch.provider = "gemini" and model = "gemini-embedding-2-preview", the following behavior occurs:
1. Media files are silently excluded from indexing
# User stores a photo with the memory system
$ openclaw memory store --file receipt.jpg --note "Expense from vendor meeting"
# Expected: Photo embedded directly into vector index
# Actual: File is either skipped or requires manual Markdown summarization
$ openclaw memory search "expense receipts"
# Only returns text-based results
# Image of the receipt is not retrievable via semantic search
2. Current indexer filters block non-Markdown files
The internal MemoryIndexer class contains a file extension filter:
// Pseudocode representing current behavior
const SUPPORTED_EXTENSIONS = ['.md', '.markdown'];
function indexFile(filePath) {
const ext = path.extname(filePath);
if (!SUPPORTED_EXTENSIONS.includes(ext)) {
logger.debug(`Skipping non-markdown file: ${filePath}`);
return; // <-- Multimodal media never reaches embedding API
}
// ... text extraction and embedding logic
}
3. Vector store receives no media embeddings
# Checking the SQLite memory database directly
$ sqlite3 ~/.openclaw/memory.db "SELECT id, type, metadata FROM memory_vectors LIMIT 10;"
id type metadata
---------- ---------- --------------------------------------------------
vec_001 text {"source": "note.md", "chunk": 0}
vec_002 text {"source": "note.md", "chunk": 1}
# No entries with type='image', 'video', 'audio', or 'pdf'
4. API calls to Gemini Embedding 2 only include text chunks
When memory_search queries the embedding model for retrieval:
# Debug log showing current API payload
{
"model": "gemini-embedding-2-preview",
"contents": [{ "text": "query text" }],
// Missing: multimodal content support
}
π§ Root Cause
Architectural Gap Analysis
1. File Discovery Phase Limitation
The FileSystemMemorySource class (or equivalent) performs recursive directory scanning with a hardcoded extension filter:
// Current implementation in src/memory/sources/file-system.ts
async scanForMemoryFiles(directory: string): Promise<MemoryFile[]> {
const files = await glob('**/*', {
cwd: directory,
absolute: true
});
return files
.filter(f => f.endsWith('.md') || f.endsWith('.markdown'))
.map(f => new MemoryFile(f));
}
This design predates the multimodal embedding capability and assumes all memory content is text-based.
2. Text Extraction Layer
Even if media files were discovered, the current pipeline routes all content through a Markdown parser:
// Current pipeline: media file β skipped OR requires external conversion
async extractContent(file: MemoryFile): Promise<string> {
switch (file.mimeType) {
case 'text/markdown':
case 'text/plain':
return fs.readFile(file.path, 'utf-8');
default:
// No handler for image/*, video/*, audio/*, application/pdf
throw new UnsupportedMediaTypeError(file.mimeType);
}
}
3. Single-Modal Embedding Request Construction
The Gemini embedding request builder only constructs text payloads:
// Current: text-only content parts
function buildEmbeddingRequest(chunks: string[]): EmbeddingRequest {
return {
model: this.config.model,
contents: chunks.map(text => ({
parts: [{ text }] // <-- Only text part type supported
}))
};
}
4. SQLite Vector Store Schema Constraints
The current schema does not anticipate non-text vector types:
CREATE TABLE memory_vectors (
id TEXT PRIMARY KEY,
vector BLOB NOT NULL,
metadata JSON, -- Limited metadata for text chunks
memory_type TEXT, -- Only 'text' or null expected
created_at INTEGER
);
No media_type or mime_type column exists to distinguish image vs. video vs. audio embeddings.
5. Missing Multimodal API Integration
Gemini Embedding 2 requires specific request formatting for multimodal inputs:
// Required for multimodal (but not implemented):
{
"model": "gemini-embedding-2-preview",
"contents": [{
"parts": [
{ "text": "Image showing vendor receipt" },
{ "inline_data": {
"mime_type": "image/jpeg",
"data": "<base64-encoded-image>"
}}
]
}]
}
This capability exists in the Gemini API but has no handler in OpenClaw’s memory pipeline.
Summary of Failure Sequence
User uploads media β FileSystemMemorySource.scan() filters by .md β Media file skipped β Gemini API never called with media β Vector store empty for media β memory_search returns no media results
π οΈ Step-by-Step Fix
Implementation Guide: Adding Multimodal Memory Indexing
Prerequisites
- OpenClaw version: 2026.3.8 or later (when implementing)
- Gemini API key with access to
gemini-embedding-2-preview - Configuration:
memorySearch.provider = "gemini"andmodel = "gemini-embedding-2-preview"
Step 1: Extend File Discovery to Include Media Types
Modify src/memory/sources/file-system.ts:
typescript // BEFORE (current behavior) const SUPPORTED_EXTENSIONS = [’.md’, ‘.markdown’];
// AFTER (proposed) const SUPPORTED_EXTENSIONS = [’.md’, ‘.markdown’]; const SUPPORTED_MEDIA_TYPES = [ ‘image/jpeg’, ‘image/png’, ‘image/gif’, ‘image/webp’, ‘video/mp4’, ‘video/webm’, ‘audio/mpeg’, ‘audio/wav’, ‘audio/ogg’, ‘application/pdf’ ];
Update the file scanning logic:
typescript async scanForMemoryFiles(directory: string): Promise<MemoryFile[]> { const files = await glob(’**/*’, { cwd: directory, absolute: true });
return files .filter(f => { const ext = path.extname(f).toLowerCase(); return SUPPORTED_EXTENSIONS.includes(ext) || SUPPORTED_MEDIA_TYPES.includes(this.getMimeType(ext)); }) .map(f => new MemoryFile(f)); }
Step 2: Implement Media Content Extraction
Create src/memory/extractors/media-extractor.ts:
typescript import { readFile, stat } from ‘fs/promises’;
export interface MediaContent { mimeType: string; data: string; // base64 encoded sizeBytes: number; }
export class MediaExtractor { // Gemini Embedding 2 has a 20MB input limit per content private readonly MAX_FILE_SIZE = 20 * 1024 * 1024;
async extract(filePath: string): Promise
if (stats.size > this.MAX_FILE_SIZE) {
throw new Error(
`File ${filePath} exceeds ${this.MAX_FILE_SIZE} bytes limit for embedding`
);
}
const buffer = await readFile(filePath);
const mimeType = this.detectMimeType(filePath);
return {
mimeType,
data: buffer.toString('base64'),
sizeBytes: stats.size
};
}
private detectMimeType(filePath: string): string { const ext = path.extname(filePath).toLowerCase(); const mimeMap: Record<string, string> = { ‘.jpg’: ‘image/jpeg’, ‘.jpeg’: ‘image/jpeg’, ‘.png’: ‘image/png’, ‘.gif’: ‘image/gif’, ‘.webp’: ‘image/webp’, ‘.mp4’: ‘video/mp4’, ‘.webm’: ‘video/webm’, ‘.mp3’: ‘audio/mpeg’, ‘.wav’: ‘audio/wav’, ‘.ogg’: ‘audio/ogg’, ‘.pdf’: ‘application/pdf’ }; return mimeMap[ext] || ‘application/octet-stream’; } }
Step 3: Update Content Router for Multimodal Handling
Modify src/memory/content-router.ts:
typescript
// BEFORE
async extractContent(file: MemoryFile): Promise
// AFTER async extractContent(file: MemoryFile): Promise<string | MediaContent> { const isMediaFile = this.mediaExtractor.isSupported(file.path);
if (isMediaFile) { return this.mediaExtractor.extract(file.path); }
const text = await fs.readFile(file.path, ‘utf-8’); return this.markdownParser.parse(text); }
Step 4: Build Multimodal Embedding Requests
Update src/memory/providers/gemini-embedding.ts:
typescript interface TextPart { text: string; }
interface InlineDataPart { inline_data: { mime_type: string; data: string; }; }
type ContentPart = TextPart | InlineDataPart;
interface EmbeddingTask { parts: ContentPart[]; mediaType?: string; }
export class GeminiEmbeddingProvider { async embed(task: EmbeddingTask): Promise<number[]> { const request = { model: this.config.model, content: { parts: task.parts }, taskType: ‘SEMANTIC_SIMILARITY’ };
const response = await this.geminiClient.embedContent(request);
return response.embedding.values;
}
// Build multimodal embedding request with optional text context buildMultimodalRequest(mediaData: MediaContent, textContext?: string): EmbeddingTask { const parts: ContentPart[] = [];
// Optional text caption/context
if (textContext) {
parts.push({ text: textContext });
}
// Inline media data
parts.push({
inline_data: {
mime_type: mediaData.mimeType,
data: mediaData.data
}
});
return { parts, mediaType: mediaData.mimeType };
} }
Step 5: Update SQLite Schema for Media Vectors
sql – Migration: add columns for media vector metadata ALTER TABLE memory_vectors ADD COLUMN media_type TEXT;
ALTER TABLE memory_vectors ADD COLUMN mime_type TEXT;
– Updated schema CREATE TABLE memory_vectors ( id TEXT PRIMARY KEY, vector BLOB NOT NULL, metadata JSON, memory_type TEXT, – ’text’ or ‘media’ media_type TEXT, – ‘image’, ‘video’, ‘audio’, ‘document’ mime_type TEXT, – ‘image/jpeg’, ‘application/pdf’, etc. original_file_path TEXT, created_at INTEGER );
– Index for media-specific queries CREATE INDEX idx_memory_vectors_media ON memory_vectors(memory_type, media_type);
Step 6: Integrate into Memory Indexer Pipeline
Modify src/memory/indexer.ts:
typescript
async indexFile(file: MemoryFile): Promise
if (typeof content === ‘string’) { // Existing text embedding logic await this.indexText(content, file.path); } else { // New multimodal embedding logic await this.indexMedia(content, file.path); } }
async indexMedia(media: MediaContent, sourcePath: string): PromiseMultimodal embedding requires gemini-embedding-2-preview. +
Current provider: ${this.config.provider}
);
}
const task = this.geminiProvider.buildMultimodalRequest(
media,
Source file: ${path.basename(sourcePath)}
);
const vector = await this.geminiProvider.embed(task);
await this.vectorStore.insert({ id: this.generateId(), vector, memory_type: ‘media’, media_type: this.categorizeMedia(media.mimeType), mime_type: media.mimeType, original_file_path: sourcePath, metadata: { size_bytes: media.sizeBytes, indexed_at: Date.now() } }); }
private categorizeMedia(mimeType: string): string { if (mimeType.startsWith(‘image/’)) return ‘image’; if (mimeType.startsWith(‘video/’)) return ‘video’; if (mimeType.startsWith(‘audio/’)) return ‘audio’; return ‘document’; }
Step 7: Update Configuration Schema
Modify src/config/schema.ts:
typescript export interface MemorySearchConfig { provider: ‘gemini’ | ‘openai’ | ’local’; model?: string;
// New configuration options multimodal?: { enabled: boolean; // Default: true when using gemini-embedding-2-preview maxFileSizeBytes: number; // Default: 20MB supportedTypes?: string[]; // Optional filter };
// Indexing control indexMedia?: boolean; // Default: false (opt-in until GA) mediaIndexPath?: string; // Separate directory for media index }
Configuration Example
yaml
~/.openclaw/config.yaml
memorySearch: provider: gemini model: gemini-embedding-2-preview multimodal: enabled: true maxFileSizeBytes: 20971520 # 20MB indexMedia: true mediaIndexPath: ~/.openclaw/memory/media
π§ͺ Verification
Verification Procedures
After implementing the multimodal indexing feature, verify correct operation using the following test scenarios:
Test 1: Media File Discovery
bash
Place a test image in the memory directory
$ cp /path/to/receipt.jpg ~/.openclaw/memory/
Verify file is discovered
$ openclaw memory scan –verbose
[DEBUG] Discovered file: /home/user/.openclaw/memory/receipt.jpg [DEBUG] Mime type: image/jpeg [DEBUG] Size: 245KB [INFO] Found 1 text files, 1 media files
Expected output should show media files being discovered, not filtered out.
Test 2: Direct Media Embedding
bash
Store the image with semantic context
$ openclaw memory store –file ~/.openclaw/memory/receipt.jpg
–note “Vendor receipt from Acme Corp meeting”
Verify embedding was created
$ sqlite3 ~/.openclaw/memory.db
“SELECT id, memory_type, mime_type, original_file_path FROM memory_vectors
WHERE original_file_path LIKE ‘%receipt.jpg%’;”
id memory_type mime_type original_file_path
vec_m_001 media image/jpeg /home/user/.openclaw/memory/receipt.jpg
Exit code 0 and a media-type row indicates successful embedding.
Test 3: Semantic Search for Media Content
bash
Search using text that should match the embedded image
$ openclaw memory search “Acme Corp vendor receipt”
Expected output format:
— Text Results —
[Score: 0.87] note.md: “Meeting notes with Acme Corp vendor…”
— Media Results —
[Score: 0.82] receipt.jpg (image/jpeg)
Context: “Vendor receipt from Acme Corp meeting”
File: ~/.openclaw/memory/receipt.jpg
Exit code 0 confirms mixed text+media results are returned.
Test 4: Multimodal Provider Guard
bash
Test with unsupported provider configuration
$ openclaw config set memorySearch.provider openai $ openclaw config set memorySearch.model text-embedding-3-small $ openclaw memory store –file ~/.openclaw/memory/receipt.jpg
Expected error:
[ERROR] Multimodal embedding requires gemini-embedding-2-preview. Current provider: openai
Exit code: 1
Test 5: File Size Limit Enforcement
bash
Create a test file exceeding 20MB limit
$ dd if=/dev/urandom of=large_video.mp4 bs=1M count=25
$ openclaw memory store –file ./large_video.mp4
Expected output:
[ERROR] File ./large_video.mp4 exceeds 20971520 bytes limit for embedding
Exit code: 1
Test 6: End-to-End Video Embedding
bash
Store a video file
$ openclaw memory store –file ./meeting-recording.mp4
–note “Product demo showing new dashboard feature”
Query by semantic concept
$ openclaw memory search “dashboard feature presentation”
Expected: Video file should appear in media results
With relevance score above configured threshold (default: 0.7)
Regression Test: Text-Only Mode Unaffected
bash
Ensure existing text indexing still works
$ echo “# Meeting Notes\n\nDiscussed Q4 roadmap.” > /tmp/notes.md $ openclaw memory store –file /tmp/notes.md
$ openclaw memory search “quarterly roadmap”
Should return text result with score >= 0.75
Should NOT attempt multimodal embedding for .md files
β οΈ Common Pitfalls
Implementation and Usage Traps
1. API Quota Consumption
Multimodal embeddings consume significantly more API quota than text-only embeddings.
- Risk: A directory with 100 images embedded individually could exhaust daily API limits within minutes.
- Mitigation: Implement batch embedding with rate limiting:
// Rate limit: max 10 media embeddings per minute const EMBEDDING_RATE_LIMIT = { maxPerMinute: 10, maxConcurrent: 2 }; - Detection: Monitor API quota via `google.ai.generativelanguage.v1beta.QuotaInfo`
2. File Size vs. Token Budget
Gemini Embedding 2 has a 20MB per-request limit, but large images may exceed token context after processing.
- Risk: A 19MB PDF might still fail if it contains complex layouts exceeding token limits.
- Mitigation: Pre-check file complexity:
async validateForEmbedding(file: MemoryFile): Promise<ValidationResult> { const stats = await stat(file.path);if (stats.size > 20 * 1024 * 1024) { return { valid: false, reason: ’exceeds_20mb_limit’ }; }
if (file.mimeType === ‘application/pdf’) { const pageCount = await this.countPdfPages(file.path); if (pageCount > 100) { return { valid: false, reason: ‘pdf_exceeds_page_limit’ }; } }
return { valid: true }; }
3. Base64 Encoding Overhead
Base64 encoding increases file size by ~33%.
- Risk: A 15MB image becomes ~20MB when base64-encoded, potentially hitting limits mid-transfer.
- Mitigation: Set practical file size limits at 15MB for raw files to account for encoding overhead.
4. SQLite BLOB Storage Growth
Storing vector blobs alongside base64 media data can rapidly inflate the database.
- Risk: 1,000 images Γ 5MB average = 5GB database growth.
- Mitigation: Store vectors in database, keep media files on disk with reference paths:
// Store only vector + metadata, not full base64 interface MediaVectorRecord { id: string; vector: Float32Array; // ~1536 dimensions for Gemini metadata: { fileHash: string; // SHA-256 for integrity filePath: string; // Reference to original file thumbnailPath?: string; // Optional preview }; }
5. Context Window Truncation
When providing text context alongside media, long captions may be truncated by the embedding model’s context window.
- Risk: A 2,000-character description gets truncated, losing semantic nuance.
- Mitigation: Implement intelligent truncation that preserves key entities and actions.
6. macOS Spotlight Integration Conflict
On macOS, .openclaw directory might be indexed by Spotlight, causing duplicate embeddings.
- Detection: Check for `com.apple.metadata:*` extended attributes on indexed files.
- Mitigation: Add `~/.openclaw/memory` to Spotlight privacy exceptions via `mdimport -x`
7. Docker Volume Mount Permissions
When running OpenClaw in Docker, mounted volumes may have different permission models.
- Symptom: `EACCES: permission denied` when reading media files from mounted paths.
- Mitigation: Ensure volume mounts use consistent UID/GID:
docker run -v /host/media:/container/media:ro,uid=1000 \ openclaw/memory indexer
8. Network Timeout on Large Files
High-resolution images may timeout during base64 transfer to Gemini API.
- Configuration: Set extended timeouts:
const GEMINI_TIMEOUT = { connectTimeout: 10000, // 10s socketTimeout: 120000 // 2min for large media };
9. Memory Pressure During Batch Indexing
Loading multiple large images into memory simultaneously.
- Risk: Out-of-memory errors when indexing entire media directories.
- Mitigation: Process in streaming fashion with garbage collection hints:
async function* streamMediaFiles(directory: string) { for await (const file of globStream('**/*.{jpg,png,mp4}', { cwd: directory })) { // Release memory after each embedding const result = await embedMedia(file); yield result; if (global.gc) global.gc(); // Force GC after each iteration } }
10. Mixed Provider Configuration
Users may configure text search with OpenAI but expect Gemini for media.
- Risk: Partial functionality or silent failures.
- Enforcement: Validate provider consistency:
function validateMultimodalConfig(config: MemorySearchConfig): void { if (config.multimodal?.enabled && !(config.model?.includes('gemini-embedding-2'))) { throw new ConfigurationError( 'multimodal.enabled requires model containing "gemini-embedding-2"' ); } }
π Related Errors
Contextual References
UNSUPPORTED_MEDIA_TYPEβ Currently thrown when non-Markdown files are encountered during indexing. Will transition to supported media types after implementation.EMBEDDING_PROVIDER_MISMATCHβ Raised when `indexMedia: true` is set but provider is not Gemini. Aligns with provider validation in Step 7.FILE_SIZE_EXCEEDEDβ Proposed error code for files exceeding the 20MB limit. Includes file path and actual size in error metadata.VECTOR_DIMENSION_MISMATCHβ Potential error if Gemini Embedding 2 dimensions differ from existing vector store schema. Gemini typically returns 1536 dimensions.QUOTA_EXCEEDEDβ API quota exhaustion from rapid multimodal indexing. Should include `retryAfter` header value for exponential backoff.MEDIA_EXTRACTION_FAILEDβ Corruption or unsupported encoding within otherwise valid media files (e.g., corrupted JPEG).- GitHub Issue #1842 β Original request for multimodal memory search, predating this implementation guide.
- GitHub Issue #2107 β Discussion on optimal chunking strategy for video frames when embedding into vector space.
- GitHub Discussion #892 β Community workaround using external OCR + text embedding, loses fidelity compared to native multimodal.
- Gemini Embedding 2 Documentation β Official reference for multimodal embedding request format and supported input types.