TrackIt
TrackIt
Contact us
Case Studies

Sony Ci: Client Case Study

Author

Chris Marchitelli

Date Published

Enabling Semantic Video Search and AI-Driven Metadata for Sony’s Ci Media Cloud

Customer Challenge

Sony sought to introduce semantic intelligence into its Ci Media Cloud asset management and collaboration service to improve discoverability across its customers’ expanding media libraries. As content volumes increased to millions of assets, traditional keyword search and manual tagging approaches created friction in locating relevant assets and specific moments within long-form videos. This slowed editorial and downstream monetization workflows and made it difficult for teams to quickly identify and reuse the right content.

The objective was to build a proof-of-concept to enable users of Sony’s Ci Media Cloud to search across video assets by meaning rather than manually entered tags and structured metadata. Content teams needed to retrieve precise segments based on natural language intent, visual similarity, or contextual cues, without relying on time consuming and error prone human annotation.

At the same time, the solution had to:

  • Automatically generate consistent, high-quality metadata at scale
  • Support multimodal search using text and image-based queries
  • Maintain strict workspace-level access controls across users and networks
  • Integrate seamlessly into existing ingestion workflows
  • Scale without introducing infrastructure management overhead

Sony Ci required an intelligent, automated layer capable of deeply interpreting video content, indexing it efficiently, and returning precise results with enterprise-grade reliability and security.

Implementation

TrackIt designed and deployed a fully serverless, AI-powered video intelligence workflow on AWS.

At the core of the solution:

  • Video understanding models
    • TwelveLabs Marengo generated multimodal embeddings capturing visual, audio, and semantic meaning of video content
    • TwelveLabs Pegasus analyzed video content to generate structured metadata including titles, descriptions, mood, genre, OCR text, and scene-level context
  • Orchestration layer
    • AWS Step Functions coordinated long-running asynchronous AI inference jobs, including invocations of video understanding models through Amazon Bedrock
    • A callback pattern using task tokens enabled reliable handling of operations exceeding Lambda time limits
  • Serverless compute and API layer
    • AWS Lambda powered ingestion, search, metadata management, and webhook handling
    • Amazon API Gateway exposed REST endpoints for video retrieval, metadata updates, deletion, and semantic search
    • A Lambda authorizer validated Sony Ci OAuth tokens and enforced workspace-level access control
  • Search and indexing
    • Amazon OpenSearch Serverless indexed embeddings and metadata for low-latency retrieval
    • Semantic search supported:
      • Natural language queries
      • Image-based queries (base64 or URL)
      • Combined text + image search
      • Generative metadata filtering (genre, mood, title, description, name)
      • Technical metadata filtering
      • Time-based segment-level results

OpenSearch Serverless served as the core retrieval engine of the platform. Multimodal embeddings generated by TwelveLabs Marengo model were stored as vector representations, enabling high-dimensional similarity search across large video catalogs.

In addition to vector search, OpenSearch handled structured metadata indexing, hybrid search combining vector similarity with field-based filters, and relevance scoring to rank results. This allowed users to refine results by genre, mood, title, or description while preserving semantic ranking.

Segment-level indexing enabled precise time-based matches, returning start and end timestamps directly from the OpenSearch index. This transformed search from asset-level discovery to moment-level retrieval.

  • Integration with Sony Ci
    • Custom actions triggered ingestion and deletion workflows
    • Webhooks automatically processed newly added or removed assets
    • Authentication credentials (including client ID and secret) stored securely in AWS Secrets Manager

The architecture followed Hexagonal principles to separate business logic from infrastructure, improving maintainability and extensibility.

Outcome

Sony Ci gained an intelligent semantic search layer embedded directly into its existing workflow.

Video assets are automatically analyzed upon ingestion. Embeddings and structured metadata are indexed without manual intervention. Users can query content based on intent and meaning rather than filename conventions or manually applied tags.

With OpenSearch Serverless acting as the unified indexing and retrieval layer, semantic queries, metadata filtering, and segment-level search operate within a single scalable search infrastructure.

This integration enables support for:

  • AI-enriched metadata generation powered by video understanding models invoked through Amazon Bedrock
  • Segment-level semantic retrieval with start and end timestamps
  • Visual similarity search
  • Workspace-aware authorization
  • Fully automated ingestion and deletion synchronization

The system scales elastically using serverless services, without infrastructure management overhead.

Benefits

  • Faster content discovery for editors and media teams working with large video libraries
  • Unlocked revenue opportunities from existing archives
  • Reduced time spent manually reviewing footage to locate relevant clips
  • Improved reuse of existing video assets across productions and campaigns
  • More consistent metadata across video catalogs through automated AI analysis
  • Scalable search capabilities that support growing media libraries without increasing manual tagging effort