Written by Maram Ayari, Software Engineer at TrackIt
Migrating databases is a critical process for organizations seeking to modernize infrastructure, improve scalability, or adopt cloud-native platforms. It often involves not only transferring data but also transforming and validating it to align with the target environment’s requirements. This becomes especially complex when metadata structures, access methods, and schema definitions differ significantly between legacy and modern systems.
Successful database migrations require careful planning, reliable tooling, and clear strategies for minimizing downtime and ensuring data integrity. The following example outlines a metadata migration project from an on-premises environment to a cloud-based Dalet Flex deployment, designed and implemented for a media customer.
Contents
- Project Overview
- Architecture Overview
- Step 1: Discovery – Evaluating the Existing Infrastructure
- Step 2: Defining Metadata Requirements in Dalet Flex
- Step 3: Migration from the On-Prem Database to Amazon RDS
- Step 4: Transform and Validate the Data (ETL)
- Step 5: Load into Dalet Flex
- Conclusion
- How TrackIt Can Help
- About TrackIt
Project Overview
The objective of the project was to migrate transformed metadata from an on-premises Reach Engine environment to Dalet Flex hosted on AWS. TrackIt addressed specific data quality issues identified in the client-provided documentation to ensure that the transformed data aligned with the requirements of Dalet Flex.
While Reach Engine provides a RESTful API for data access, limitations related to API rate throttling and time constraints from the client made it impractical for use in this migration. As a result, an alternative pipeline was designed to authenticate directly against the underlying database. This approach enabled the extraction of metadata from the relevant database tables, followed by the reconstruction and validation of entities as part of the ETL process.
Steps
- Evaluate the existing infrastructure and current Reach Engine setup.
- Define the metadata fields, asset types, and any configurations that need to be adapted in Dalet Flex.
- Create scripts using the API to extract data from Reach Engine and move it to an intermediate database.
- Use a structured ETL (Extract, Transform, Load) process to prepare data for Dalet Flex.
- Move the newly formatted data into Dalet Flex using the Dalet Flex API.
Architecture Overview
Amazon RDS (Relational Database Service)
RDS is used to host the intermediate database that temporarily stores data copied from the source Reach Engine. At this stage, the data remains untransformed but serves as a stable and secure staging area for analysis, testing, and verification to ensure that all necessary data is accurately captured for the migration.
AWS Direct Connect
Direct Connect establishes a dedicated and secure network connection between the on-premises Reach Engine and the AWS infrastructure. This enables smooth data transfer and low latency throughout the migration process, even with the large data volumes involved.
AWS Glue
AWS Glue handle the transformation of metadata from the original database format to the new structure. These jobs extract data from the intermediate AWS RDS database, transform it as required, and prepare it for loading into the Dalet Flex environment. Glue’s serverless architecture and automatic scaling capabilities make it well-suited for this large-scale migration.
Amazon S3 (Simple Storage Service)
S3 plays a critical role throughout the migration, beyond serving as the final destination for transformed data. During both the extraction and transformation phases, S3 provides temporary storage for in-process data. Its scalability and reliability are essential for handling large datasets. S3 also enables secure storage and flexible access at various stages, supporting workflow management and minimizing the risk of data loss.
Step 1: Discovery – Evaluating the Existing Infrastructure
The first step in the migration process involved evaluating the existing Reach Engine environment. This included an assessment of the infrastructure and database to understand the scale and accessibility of the data.
While Reach Engine offers a RESTful API for interacting with metadata, it was subject to high rate limiting. Given the volume of data and project time constraints, using the API would have significantly prolonged the extraction process. As a result, the team opted for a more efficient approach: accessing the database directly to extract and reconstruct metadata for the migration pipeline.
As part of the discovery, the following details were gathered:
- Database Size: 77GB
- Total Tables: 227
- Database Version: PostgreSQL 13.3
- Connectivity: Internet access is routed through a proxy; access to RDS is functional via this proxy.
Step 2: Defining Metadata Requirements in Dalet Flex
In coordination with the client and the Dalet Flex team, metadata fields and asset types required for support were defined. Necessary transformations and field mappings were also documented to ensure alignment with Dalet Flex’s schema and ingestion requirements.
Authentication
Dalet Flex’s API uses Basic Authentication:
- A username and password are provided.
- These credentials are sent in the Authorization header, encoded in base64 format.
API Endpoints
- Development Environment: https://development.flex-dev.cloud.example.com/api/
- Production Environment: https://production.flex-prod.cloud.example.com/api/
Matrix Parameters Format
Unlike typical REST APIs that use query strings (?param=value), Dalet Flex employs matrix URIs. For example: /api/assets;limit=10;page=2;sortBy=id;sort=desc
Common Parameters
- limit: Maximum number of items per page (default: 100)
- page: Pagination control
- createdFrom, createdTo: Date filtering (format: 24 Jun 2021 10:48:41)
- deleted: Include deleted assets (default: false)
- sort, sortBy: Sorting results
Available Metadata Types
Metadata types can be viewed directly from the Dalet Flex interface or retrieved via the API:
- Get all metadata definitions: GET /api/metadataDefinitions
- Filter by name (e.g., asset types): GET/api/metadataDefinitions;name=assetType
- Retrieve the full definition of a metadata type: GET /api/metadataDefinitions/{metadataId}/definition
Common Asset Metadata Types
Examples of metadata types used for assets in this deployment include:
- assetType_Project
- assetType_Promo
- assetType_Collection
- assetType_DeliverableImage
Creating a New Asset with Metadata
- Create the asset: POST /assets
Include the metadatadefinitionId field to indicate the metadata type to apply. - Set metadata attributes (annotations): POST /assets/{assetId}/annotations
This is a separate request used to populate the asset’s metadata fields after creation. Metadata and asset creation cannot be performed in the same API call.
Step 3: Migration from the On-Prem Database to Amazon RDS
Connect to the On-Premises Environment
Connect to Jump Station: Use SSH or your preferred method to connect to the jump station in the on-premises environment.
Connect to the PostgreSQL Server Instance
From the jump station, connect to the server instance hosting PostgreSQL database
Create a backup with pg_dump
Using pg_dump, a full logical backup of the PostgreSQL database was created and stored in a shared location on the local SAN. This ensured a reliable snapshot of the data for migration.
Restore Backup to Amazon RDS Instance
- Initial Test with Selected Tables:
- Perform a test restore with a backup containing only two or three tables (totaling approximately 8 GB).
- Use pg_restore to restore the backup to the AWS RDS instance.
- pg_restore -U <username> -h <rds_endpoint> -d <database_name> -v /path/to/backup_file_name.backup
- Full Database Restore:
- Once the test restore is successful, proceed with a full backup and restore it to the RDS instance.
Post-Migration Verification
- Test the Tables:
- After the transfer is complete, thoroughly test the tables in the AWS RDS instance to ensure data integrity and consistency.
- Handle Inconsistencies:
- If any inconsistencies are detected, troubleshoot the issues and repeat the migration process as necessary.
Ongoing Replication with pglogical
- Install pglogical:
- Install pglogical on both the source (on-premises) and target (AWS RDS) PostgreSQL databases.
- Set Up Provider Node (Source Database):
- Configure the on-premises PostgreSQL database as a provider node
- Set Up Subscriber Node (Target Database):
- Configure the AWS RDS PostgreSQL database as a subscriber node.
- Monitor and Verify Replication:
- Monitor the replication task with CloudWatch Logs to ensure data is continuously synchronized
- Perform data consistency checks periodically to verify the integrity of the replicated data
Alternative Ongoing Replication with AWS DMS through VPN
- Configure DMS:
- Create a replication instance in AWS DMS.
- Configure source and target endpoints: (Needs Site-to-Site VPN)
- Source: On-premises PostgreSQL database.
- Target: AWS RDS for PostgreSQL instance.
- Create a Migration Task:
- Set up a DMS task to perform ongoing replication
- Ensure the task is configured for continuous data replication to keep the AWS RDS instance synchronized with the on-premises database
- Monitor and Verify Replication:
- Monitor the replication task with CloudWatch Logs to ensure data is continuously synchronized
- Perform data consistency checks periodically to verify the integrity of the replicated data
Step 4: Transform and Validate the Data (ETL)
Extracting and Filtering Asset IDs
The initial step involved extracting asset IDs from Reach Engine databse via its API. A Python script applied specific filters to capture only the relevant asset types (e.g., videos, images, documents) required for the migration. Extracted asset IDs were sent to an S3 bucket for further processing. An AWS Glue job was triggered to filter out assets missing mandatory data, ensuring that only complete records were considered for migration and establishing a roadmap for subsequent phases.
Fetching Metadata for Each Asset
Following the extraction of asset IDs, the next step was to fetch the associated metadata. This metadata, including descriptive information, media attributes, and relationships to other assets, was stored in various tables within the database.
ETL jobs were designed to join asset IDs with these metadata tables, ensuring that all relevant details for each asset were gathered. The process involved querying metadata tables using asset IDs as the primary filter and extracting metadata such as technical information (file formats, resolutions), descriptive attributes (titles, tags, descriptions), and relational data (links to other assets or categories).
The logic here was to:
- Query the metadata tables using the asset IDs as the primary filter.
- Extract the metadata related to each asset, which could include technical information (like file formats, resolutions), descriptive attributes (like titles, tags, descriptions), and any relational data (such as links to other assets or categories).
Transforming the Data for the Target Database (Dalet Flex)
After extracting the necessary metadata, the data was transformed into a format compatible with Dalet Flex. This transformation process included:
- Data normalization: Standardizing metadata fields to align with the target Database’s structure.
- Field mapping: Ensuring that fields from the source Database were correctly mapped to the corresponding fields in the new Database. For instance, technical attributes such as resolution or file format might need to be renamed or reformatted to meet the requirements of the new platform.
Loading the Data into Amazon S3
Once the transformation was complete, the final step involved loading the transformed data into Amazon S3. This staging area provided a secure and scalable storage solution where the data could be temporarily held before being migrated to Dalet Flex. The transformed data in Amazon S3 was organized and prepared for the final migration phase, ensuring that all metadata was properly validated and ready for transfer.
Step 5: Load into Dalet Flex
This workflow outlines a data transformation pipeline designed to prepare JSON files for ingestion into a Dalet/Flex system. It uses AWS services like S3, DataSync, and Glue to automate and scale the process efficiently:
- Source of Truth S3 Bucket: Raw JSON files are stored in a centralized S3 bucket acting as the source of truth.
- AWS DataSync to Transformation Bucket: DataSync transfers the contents of the source bucket to a transformation S3 bucket for staging and processing.
- Sanitize Filenames with AWS Glue: A Glue script standardizes file names to ensure Spark compatibility during downstream processing.
- Purge Unwanted Files: A second Glue job identifies and removes unwanted or unnecessary JSON files based on defined criteria.
- Conflict Detection: Another Glue script scans for content conflicts (e.g., schema mismatches) to determine if transformation rules need updates.
- Data Transformation: A final Glue job applies the latest transformation logic to reformat JSON files according to Dallet/Flex requirements.
- DataSync to Dallet/Flex Bucket: The transformed files are moved to the final S3 buckets (prod & dev) using DataSync, where they are ready for ingestion into the Dallet/Flex platform.
Conclusion
Database migrations, particularly those involving legacy systems and evolving metadata requirements, are inherently complex and demand careful orchestration across infrastructure, data integrity, and workflow compatibility. The transition from Reach Engine to Dalet Flex highlights the importance of tailored pipelines that balance technical constraints with business objectives. In media environments, where metadata accuracy and content accessibility are critical, aligning migration strategies with platform-specific schemas is essential.
How TrackIt Can Help
TrackIt, an AWS Advanced Tier Services Partner with deep expertise in media workflows, brings the technical capability and industry understanding required to execute such transformations efficiently. By leveraging AWS-native services and a media-focused approach, TrackIt can help media organizations modernize their infrastructure while ensuring operational continuity and data fidelity.
About TrackIt
TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.
We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.
Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.