TrackIt

Imagine building a video search engine for a streaming platform. Users may enter queries such as "show me all the penalty kicks" or "find scenes where characters argue." But how should one go about choosing the right AI model to power that search? And how can its performance be measured in a reliable way?

This is the challenge TrackIt faced when evaluating Twelve Labs' video understanding models on Amazon Bedrock. The goal was to compare two fundamentally different approaches: Pegasus, which generates textual descriptions from videos, and Marengo, which converts videos and queries into mathematical representations (embeddings) for similarity search.

Two Approaches to Video Understanding

Approach 1: Text Generation (Pegasus)

Pegasus works like a storyteller. Given a video, it generates a detailed textual description, for example: “In this scene, Jim places Dwight’s stapler in Jell-O as part of an elaborate office prank.” It is an excellent choice for summaries and narrative descriptions. The question, however, is whether this type of model can reliably identify the specific scenes returned for a search query.

Approach 2: Embedding Search (Marengo)

Marengo takes a different approach. It converts both videos and text queries into numerical vectors known as embeddings. Search then becomes a similarity problem: a query like “penalty kick” is converted into an embedding and compared with embeddings generated from video segments to find the closest matches. Two versions were evaluated: the stable 2.7 and the newer 3.0.

Building the Evaluation Framework

The Pegasus Pipeline: Judge-Based Evaluation

To evaluate Pegasus, a five-step pipeline was implemented:

Dataset Preparation: Evaluation videos are converted into a format compatible with Amazon Bedrock.
Inference Generation: Pegasus processes each video and generates scene descriptions.
Automatic Upload: Results are formatted and uploaded to S3 storage
Judge Evaluation: Claude AI (another language model) acts as an impartial judge, scoring how well Pegasus identified the requested scenes
Results Analysis: Scores are aggregated and performance metrics are calculated

The architecture follows a modular design. Each step runs as a separate Python script that reads from and writes to predefined locations, making the pipeline easier to debug, modify, and extend.

The Marengo Pipeline: Three-Stage Retrieval

Evaluating embedding models requires a different approach. Instead of generating descriptions directly, the system retrieves video segments based on similarity between embeddings:

Video Embedding (Asynchronous): Embeddings are precomputed for all video clips. This step runs once and the results are stored for later retrieval.
Query Processing (Synchronous): Text queries are converted into embeddings and compared with the stored video embeddings to identify the most similar clips.
Quality Assessment: Pegasus generates descriptions for the retrieved clips, and Claude AI evaluates their relevance to the original query.

This pipeline uses one model (Pegasus) to help evaluate another (Marengo), creating a comprehensive assessment system.

The Results: Surprising Findings

The Percentile Problem

Our initial results showed something puzzling: Marengo 3.0 scored 0% on sports queries while Marengo 2.7 scored perfectly. Was the newer model broken?

Marengo Percentile Based Analysis Diagram

Not quite. The result turned out to be a measurement artifact. Marengo 3.0 uses a different scale for similarity scores, like measuring temperature in Celsius versus Fahrenheit. When we switched to percentile-based evaluation (looking at relative rankings rather than absolute scores), both versions performed perfectly.

Key Performance Metrics

The models performed as follows:

Pegasus 1.2 (Text Generation)

Generates detailed, coherent descriptions
Only 25% accuracy in identifying specific requested scenes
20% failure rate on longer videos
Best for: General video summaries, not precision search

Marengo 2.7 (Embeddings)

Perfect retrieval accuracy (100%)
Higher confidence scores (0.34 average similarity)
Zero failures
Best for: Production video search systems

Marengo 3.0 (Embeddings)

Perfect ranking ability (when measured correctly)
Lower absolute scores (0.14 average)
Slightly faster (3% improvement)
Trade-off: Speed vs. confidence calibration

Technical Architecture: Building for Scale

The evaluation framework incorporated several design patterns worth highlighting:

Configuration-Driven Design

Each script starts with a CONFIG dictionary containing model IDs, S3 buckets, and file paths. This makes it easy to switch between models or datasets without modifying the code.

Asynchronous Processing

Video embedding is computationally expensive. The framework uses Amazon Bedrock's async invocation pattern: submit jobs, poll for completion, then download results. This allows multiple videos to be processed in parallel.

Unified Reporting

A master script (generate-report.py) combines results from all models using Claude Opus 4.5 to create comprehensive reports. It reads multiple JSONL files, aggregates metrics, and produces both detailed analysis and CSV summaries.

Cost Tracking

Every script includes cost estimation based on video duration and token usage. This helps teams budget their AI expenses, crucial when processing hours of video content.

Lessons Learned

1. Metrics Matter

The mAP (mean Average Precision) score of 0 for Marengo 3.0 was misleading. After implementing percentile-based evaluation, it became clear that both models were equally effective at ranking results because they simply used different score scales.

2. Different Models, Different Strengths

Text generation models excel at creating human-readable content but struggle with precision tasks. Embedding models are fantastic for search but require careful threshold tuning.

3. Production Considerations

Higher similarity scores are not just numbers. They affect:

API confidence thresholds
Vector database performance
User trust in results

4. Evaluation is Iterative

Our framework evolved from simple metrics to dual evaluation approaches as we discovered edge cases and measurement artifacts.

The Verdict

For production video search systems, Marengo 2.7 emerged as the winner. Although both Marengo versions having perfect ranking ability, 2.7's better-calibrated confidence scores make it more suitable for real-world applications where results must be filtered.

Choosing between Pegasus and Marengo Diagram

Extensible Evaluation Framework

By separating evaluation into modular components and using multiple metrics, the framework provides a foundation for continuous model assessment. The code architecture, with its configuration-driven design and clear separation of concerns, makes it easy to add new models or evaluation criteria.

Technical Note

For readers interested in the implementation details, the framework consists of the following:

Evaluation Framework Implementation Details Diagram

Pegasus Pipeline: 5 Python scripts for end-to-end evaluation
Marengo Pipeline: 3-stage evaluation with embedding comparison
Unified Reporting: Automated report generation using Claude Opus 4
Cost Tracking: Built-in estimation for all model invocations
Dual Metrics: Both threshold-based and percentile-based evaluation

The modular architecture allows teams to adapt the framework for their own model evaluation needs, whether for video, text, or other AI applications.

Readers can access the code here.

Evaluating Video AI for Search: TwelveLabs Pegasus vs. Marengo