TrackIt

Enabling Semantic Video Search and AI-Driven Metadata for Sony’s Ci Media Cloud

Customer Challenge

Sony sought to introduce semantic intelligence into its Ci Media Cloud asset management and collaboration service to improve discoverability across its customers’ expanding media libraries. As content volumes increased to millions of assets, traditional keyword search and manual tagging approaches created friction in locating relevant assets and specific moments within long-form videos. This slowed editorial and downstream monetization workflows and made it difficult for teams to quickly identify and reuse the right content.

The objective was to build a proof-of-concept to enable users of Sony’s Ci Media Cloud to search across video assets by meaning rather than manually entered tags and structured metadata. Content teams needed to retrieve precise segments based on natural language intent, visual similarity, or contextual cues, without relying on time consuming and error prone human annotation.

At the same time, the solution had to:

Automatically generate consistent, high-quality metadata at scale
Support multimodal search using text and image-based queries
Maintain strict workspace-level access controls across users and networks
Integrate seamlessly into existing ingestion workflows
Scale without introducing infrastructure management overhead

Sony Ci required an intelligent, automated layer capable of deeply interpreting video content, indexing it efficiently, and returning precise results with enterprise-grade reliability and security.

Implementation

TrackIt designed and deployed a fully serverless, AI-powered video intelligence workflow on AWS.

At the core of the solution:

Video understanding models
- TwelveLabs Marengo generated multimodal embeddings capturing visual, audio, and semantic meaning of video content
- TwelveLabs Pegasus analyzed video content to generate structured metadata including titles, descriptions, mood, genre, OCR text, and scene-level context
Orchestration layer
- AWS Step Functions coordinated long-running asynchronous AI inference jobs, including invocations of video understanding models through Amazon Bedrock
- A callback pattern using task tokens enabled reliable handling of operations exceeding Lambda time limits
Serverless compute and API layer
- AWS Lambda powered ingestion, search, metadata management, and webhook handling
- Amazon API Gateway exposed REST endpoints for video retrieval, metadata updates, deletion, and semantic search
- A Lambda authorizer validated Sony Ci OAuth tokens and enforced workspace-level access control
Search and indexing
- Amazon OpenSearch Serverless indexed embeddings and metadata for low-latency retrieval
- Semantic search supported:
- - Natural language queries
  - Image-based queries (base64 or URL)
  - Combined text + image search
  - Generative metadata filtering (genre, mood, title, description, name)
  - Technical metadata filtering
  - Time-based segment-level results

OpenSearch Serverless served as the core retrieval engine of the platform. Multimodal embeddings generated by TwelveLabs Marengo model were stored as vector representations, enabling high-dimensional similarity search across large video catalogs.

In addition to vector search, OpenSearch handled structured metadata indexing, hybrid search combining vector similarity with field-based filters, and relevance scoring to rank results. This allowed users to refine results by genre, mood, title, or description while preserving semantic ranking.

Segment-level indexing enabled precise time-based matches, returning start and end timestamps directly from the OpenSearch index. This transformed search from asset-level discovery to moment-level retrieval.

Integration with Sony Ci
- Custom actions triggered ingestion and deletion workflows
- Webhooks automatically processed newly added or removed assets
- Authentication credentials (including client ID and secret) stored securely in AWS Secrets Manager

The architecture followed Hexagonal principles to separate business logic from infrastructure, improving maintainability and extensibility.

Outcome

Sony Ci gained an intelligent semantic search layer embedded directly into its existing workflow.

Video assets are automatically analyzed upon ingestion. Embeddings and structured metadata are indexed without manual intervention. Users can query content based on intent and meaning rather than filename conventions or manually applied tags.

With OpenSearch Serverless acting as the unified indexing and retrieval layer, semantic queries, metadata filtering, and segment-level search operate within a single scalable search infrastructure.

This integration enables support for:

AI-enriched metadata generation powered by video understanding models invoked through Amazon Bedrock
Segment-level semantic retrieval with start and end timestamps
Visual similarity search
Workspace-aware authorization
Fully automated ingestion and deletion synchronization

The system scales elastically using serverless services, without infrastructure management overhead.

Benefits

Faster content discovery for editors and media teams working with large video libraries
Unlocked revenue opportunities from existing archives
Reduced time spent manually reviewing footage to locate relevant clips
Improved reuse of existing video assets across productions and campaigns
More consistent metadata across video catalogs through automated AI analysis
Scalable search capabilities that support growing media libraries without increasing manual tagging effort

Sony Ci: Client Case Study

Enabling Semantic Video Search and AI-Driven Metadata for Sony’s Ci Media Cloud

Customer Challenge

Implementation

Outcome

Benefits

Your AWS Engineering Team

Contact

Socials

Sony Ci: Client Case Study

Enabling Semantic Video Search and AI-Driven Metadata for Sony’s Ci Media CloudCustomer Challenge

Implementation

Outcome

Benefits

Enabling Semantic Video Search and AI-Driven Metadata for Sony’s Ci Media Cloud

Customer Challenge