TrackIt
TrackIt
Contact us
Blogs

End-to-End Multimodal Video Search on AWS with Bedrock and OpenSearch

Author

Antoine Berger

Date Published

Building a video search engine that responds to natural-language queries and jumps directly to the right moment doesn’t require advanced research or a complex interface. The approach outlined below offers a practical, implementation-ready workflow that runs entirely on AWS. It uses TwelveLabs models on Amazon Bedrock to generate multimodal embeddings and Amazon OpenSearch Service to deliver fast vector and text-based retrieval. 

TL;DR

Videos are processed by a TwelveLabs model on Amazon Bedrock which automatically slices them into short segments and converts them into multimodal embeddings. These are then indexed in OpenSearch.

At query time, the text input is embedded and used for two parallel lookups: a vector similarity search to find semantically related moments and a keyword search to capture direct textual matches. The two result lists are merged and re-ranked, producing a clean set of time-coded segments that can be jumped to immediately.

Key Terminology

  • Embedding: a numeric vector that captures the meaning of a video segment.
  • Vector search (k-NN): finds the k nearest vectors, meaning the most similar moments.
  • BM25: a classic keyword ranker that rewards exact, informative terms and downplays very common ones.
  • Fusion / Hybrid: combines two ranked lists (vector + keywords) into a single final ranking.
  • RRF (Reciprocal Rank Fusion): a simple fusion method that sums position-based votes from each list.
  • Cosine similarity: measures how aligned two vectors are (1 means very similar, 0 means unrelated).
  • HNSW: a fast, approximate nearest-neighbor index used under the hood for vector search.
  • Recall@10: checks whether at least one relevant segment appears in the top 10 (0 to 1).
  • nDCG@10: evaluates how well relevant results are ordered within the top 10 (closer to 1 means better).

From Raw Video to Embeddings

Indexing entire videos typically leads to poor search granularity—queries resolve to the full file rather than specific moments. To solve this, we rely on the native capabilities of the TwelveLabs Marengo model on Amazon Bedrock.

Instead of building a complex pre-processing pipeline to manually slice video files, we simply submit the full video asset to Bedrock. The model automatically segments the video into fixed windows (defaulting to roughly 5 seconds) and generates multimodal embeddings for each segment. This captures visual content, audio context, and underlying semantics in a single step, keeping the architecture lightweight.

Bedrock Integration

The integration follows standard AWS conventions using StartAsyncInvoke for video processing and InvokeModel for text queries.

1import { BedrockRuntimeClient, InvokeModelCommand, StartAsyncInvokeCommand } from "@aws-sdk/client-bedrock-runtime";
2
3const bedrock = new BedrockRuntimeClient({ region: process.env.AWS_REGION! });
4
5// 1) Text query → synchronous (InvokeModel)
6export async function embedTextQuery(query: string) {
7 const body = JSON.stringify({ inputType: "text", inputText: query });
8 const res = await bedrock.send(new InvokeModelCommand({
9 modelId: process.env.MARENGO_INFERENCE_PROFILE_ID!, // e.g., "us.twelvelabs.marengo-embed-2-7-v1:0"
10 contentType: "application/json",
11 accept: "application/json",
12 body
13 }));
14 return JSON.parse(new TextDecoder().decode(res.body)).embedding as number[];
15}
16
17// 2) Video clip from S3 → asynchronous (StartAsyncInvoke)
18export async function startVideoEmbedding(s3Uri: string) {
19 return bedrock.send(new StartAsyncInvokeCommand({
20 modelId: process.env.MARENGO_MODEL_ID!, // e.g., "twelvelabs.marengo-embed-2-7-v1:0"
21 modelInput: {
22 inputType: "video",
23 mediaSource: { s3Location: { uri: s3Uri } },
24 embeddingOption: ["visual-text", "audio"]
25 },
26 outputDataConfig: { s3OutputDataConfig: { s3Uri: process.env.BEDROCK_OUTPUT_S3! } }
27 }));
28}

According to AWS guidance, InvokeModel is appropriate for text and image queries, while StartAsyncInvoke is intended for video, audio, or larger jobs. Two practices help avoid issues later: use the same embedding model for both indexing and querying, and ensure the vector dimension matches the OpenSearch index mapping.

Index Design in OpenSearch (Vector and Text)

For this architecture, we utilize Amazon OpenSearch Serverless to remove the operational overhead of node management. While a provisioned cluster offers granular control over sharding, the Serverless option is preferred here for its 'deploy-and-forget' simplicity.


Why not just store vectors in S3? A common question is why we don't use a lighter solution, like storing embeddings in S3 or using a basic vector store. The answer is Hybrid Search. To build a truly robust video search, we need to combine semantic understanding (vectors) with exact text matching (BM25 for transcripts/OCR). OpenSearch provides a single engine that handles both effectively, allowing us to execute the custom RRF fusion logic described below—something that is difficult or impossible to achieve with simple S3-based retrieval or abstracted "Knowledge Base" wrappers.


Below is the target document structure (what we store) and the corresponding index mapping (how we define the fields).


Document Structure: The raw output from Bedrock is transformed into this compact record for each segment:

1{
2 "video_id": "vid_123",
3 "segment_id": "vid_123_s0042",
4 "start_ms": 40000,
5 "end_ms": 45000
6}


Index Mapping: We configure the index to handle both exact keywords (for IDs) and vector similarity (for the embedding). A minimal index mapping looks like this:

1{
2 "settings": { "index": { "knn": true } },
3 "mappings": {
4 "properties": {
5 "video_id": { "type": "keyword" },
6 "segment_id": { "type": "keyword" },
7 "start_ms": { "type": "integer" },
8 "end_ms": { "type": "integer" },
9 "mm_vec": { "type": "knn_vector", "dimension": 1024 }
10 }
11 }
12}

The dimension value (1024) must match the embedding size produced by the chosen model. Optional attributes such as title, tags, transcript, or ocr_text can be added whenever they become available.

Filling OpenSearch (The Handoff)

This is the point where embeddings become searchable documents. The flow follows three straightforward steps:

  1. StartAsyncInvoke writes the embedding output to S3.
    The output is a JSON file containing a list of embeddings (one per segment) with associated timecodes.
  2. A lightweight worker on Lambda or ECS reads that JSON and converts each entry into a document.
  3. The worker uses the Bulk API to upsert the entire batch into OpenSearch.

Short example:

1// Read Bedrock output from S3 → bulk index to OpenSearch (short & robust)
2const { Body } = await s3.send(new GetObjectCommand({ Bucket, Key }));
3const out = JSON.parse(await Body.transformToString());
4
5// The output contains an array of embeddings generated from the auto-segmentation
6const items = out.embeddings ?? [];
7const videoId = out.videoId ?? Key.replace(/\.[^.]+$/, "");
8
9const body = items.flatMap((e: any, i: number) => {
10 // Create a stable ID for each segment
11 const segId = `${videoId}_s${String(i).padStart(5, "0")}`;
12
13 return [
14 { index: { _index: index, _id: segId } },
15 {
16 video_id: videoId,
17 segment_id: segId,
18 start_ms: Math.round((e.startSec ?? 0) * 1000),
19 // Handle potential casing differences in output keys
20 end_ms: Math.round(((e.endSec ?? e.endsec) ?? 0) * 1000),
21 mm_vec: e.embedding
22 }
23 ];
24});
25
26if (body.length > 0) {
27 await opensearch.bulk({ body });
28}

Trigger the worker through S3 Events or EventBridge so indexing runs as soon as the embedding output arrives. Additional metadata such as title, tags, transcript, or OCR text can be incorporated later and should be stored as text fields to support keyword search.

Indexing through the bulk API is straightforward. Keep document IDs stable, such as using segment_id, to ensure reprocessing remains idempotent.

Searching: Query Embeddings, Hybrid Retrieval, and Fusion

A natural-language query such as “where the drone flies over the bridge at sunset” benefits from two complementary signals:

  1. Semantic retrieval: Embed the query with the same TwelveLabs model used during indexing and run a k-NN search on the mm_vec field. This surfaces segments that match the meaning of the query, even without explicit keyword overlap.
  2. Lexical retrieval: Run BM25 across available text fields such as title, tags, transcript, or OCR text. This highlights exact matches, named entities, and other literal cues.

Combining both of these methods is called hybrid retrieval. Each method produces its own ranked list of results, and these lists need to be merged into one. A simple way to do that is RRF (Reciprocal Rank Fusion). It works by giving each result points based on how high it appears in each list, then adding those points together so the strongest combined candidates rise to the top.

Example flow:

1// 1) Turn the user's text into a vector in the SAME space as your video segments.
2const qvec = await embedTextQuery(query);
3
4// 2) Get two candidate lists: semantic (vectors) and lexical (keywords).
5const { body: v } = await opensearch.search({
6 index,
7 body: { size: 100, query: { knn: { mm_vec: { vector: qvec, k: 100 } } } }
8});
9
10const { body: k } = await opensearch.search({
11 index,
12 body: { size: 200, query: { multi_match: { query, fields: ["title^2","tags","transcript","ocr_text"] } } }
13});
14
15// 3) Fuse (combine) the two ranked lists with RRF (simple and strong).
16function rrfFuse(vecHits, kwHits, K = 60) {
17 const score = new Map();
18 const add = (hits) => hits.forEach((h, i) =>
19 score.set(h._id, (score.get(h._id) ?? 0) + 1 / (K + i + 1)));
20 add(v.hits.hits);
21 add(k.hits.hits);
22 return [...score.entries()].sort((a, b) => b[1] - a[1]).map(([id]) => id); // top segment IDs
23}
24
25const topSegmentIds = rrfFuse(v.hits.hits, k.hits.hits);

Tip: Always return time-coded segments so the user can jump directly to the right moment. Semantic search captures intent. Keyword search captures literal matches. Hybrid retrieval, combined with a simple merge method like RRF, provides a reliable ranking that outperforms either method alone.

Architecture Recap

The infrastructure runs entirely on AWS, and the core workflow stays intact as new components are introduced. Transcripts from ASR, OCR output, or scene detection can be added later without changing the underlying design.

Evaluation, Scaling, and Operations

Evaluation should stay simple but grounded. A few practical guidelines:

  • Testing: Build a small query set of 20 to 50 examples. Track Recall@10 and nDCG@10, and compare vector-only, BM25-only, and hybrid retrieval.
  • Latency: Run vector search and BM25 in parallel. Tune k (around 100 works well), adjust ef_search if needed, keep segments in the 3 to 5 second range, and cache embeddings for frequent queries.
  • Observability: Log timings for each stage, monitor P95 latency, and alert on error rates.
  • Upgrades: Tag the embedding_version, reindex into a shadow index, and switch via an alias.
  • Fallback: If embedding generation fails, return BM25-only results.


Conclusion

This workflow establishes a clear and reproducible path from raw video → auto-segmented embeddings (via Bedrock)OpenSearch vectors + optional text hybrid searchjump-to-the-moment.

Everything runs entirely on AWS and scales naturally as the catalog grows, without relying on patterns taken from other products. Additional components such as ASR, OCR, scene detection, or a generative layer for summaries and Q&A can be added when needed without altering the core design.