Written by Maram Ayari and Leandro Mota, Software & DevOps Engineers at TrackIt
As global demand for multilingual content continues to rise, so does the need for fast and accurate subtitle translation. To address this, TrackIt conducted a proof of concept to evaluate the effectiveness of generative AI (GenAI) models in translating subtitles from English to Japanese, with the goal of improving upon the current performance of xl8 used in production. Outlined below are the approach, key findings, and considerations for integrating the solution into on-premises environments or within a private VPC setup.
Mistral NeMo
Mistral NeMo is a compact, open-weight generative model developed by Mistral AI, designed for efficient deployment in self-hosted environments. It supports long-context processing and can be integrated with popular inference engines such as vLLM, TensorRT-LLM, and TGI.
Requirements: For inference, an additional memory overhead of up to 20% beyond the base model size should be accounted for. Estimated base memory usage is approximately 1.25 GB at float16/bfloat16 precision and 2.5 GB at float32 precision. A practical method for estimating memory use is available here.
vLLM, in particular, offers a highly optimized Python-based serving framework that can expose an OpenAI-compatible API, making it well-suited for integration in controlled environments.
Performance Testing and Observations
1. Initial Issues on g5.2xlarge
- The default context length of ~128K tokens caused excessive memory usage.
- The 24GB CUDA vRAM was insufficient, leading to fragmentation and performance degradation.
- The model frequently ran out of memory, forcing a reduction in token length.
2. Workarounds Implemented
- Limited token length to 4096 to prevent memory exhaustion.
- Enabled CPU offloading (4GB initially, increased to 12GB) to balance CUDA memory usage.
- Performance remained suboptimal, with slow execution and limited translation coherence.
3. Upgrading to g5.4xlarge
- Increased vCPUs and RAM improved processing, but GPU memory remained a bottleneck.
- The model was operational but still slow, and translations suffered from loss of coherence due to reduced context length.
4. vLLM for Real-Time Subtitling
- vLLM was not an effective approach for real-time subtitle generation.
- The inference speed was too slow for handling dynamic, streaming subtitle flows.
- The model required significant preloading and batching, making it unsuitable for low-latency applications.
5. Upgrading to g5.48xlarge
- Successfully ran Mistral NeMo on a g5.48xlarge instance without forcing any limitation.
- Used tensor parallelism together with CPU offloading to optimize performance.
- Achieved significantly better results, though translation issues still persist.
- The model was able to handle the full 128K token context length without performance degradation.
6. Inference usage
def translate_subtitles(srt_data): “””Translates English subtitles to Japanese using Mistral Nemo with vLLM.””” headers = {“Content-Type”: “application/json”} if API_KEY: headers[“Authorization”] = f”Bearer {API_KEY}” # Adds API key if provided for sub in srt_data: prompt = ( f”Translate the following English subtitle to Japanese in a natural, conversational tone. ” f”Do not provide explanations, commentary, or additional text—only the translated Japanese text.\n\n” f”English: {sub[‘text’]}” ) payload = { “model”: MODEL_NAME, “messages”: [ {“role”: “system”, “content”: “You are a professional subtitle translator. Your task is to translate subtitles from English to Japanese accurately while keeping a natural tone.”}, {“role”: “user”, “content”: prompt} ], “response_format”: {“type”: “text”}, “temperature”: 0.3, “top_p”: 0.95, “max_completion_tokens”: 100, “seed”: 42, “stop”: [“\n\n”], “n”: 1 } time.sleep(DELAY_BETWEEN_REQUESTS) return srt_data |
6. Challenges with Instance Availability
- Encountered difficulties when turning off and restarting the instance.
- AWS resource allocation for g5.48xlarge is dependent on the availability within the Availability Zone (AZ).
- Due to high demand, it required multiple attempts before successfully launching the instance again.
7. Automation Enhancements
- Created an Ansible automation script to deploy/redeploy instances more quickly.
- Developed a translation script utilizing Mistral NeMo for optimized execution.
- Both outputs (g5.4xlarge and g5.48xlarge) have been shared on the Slack channel for reference.
Output
English subtitles: https://drive.google.com/file/d/1zlMVivdY7PZi__5_FNc1A_k13xlXONa-/view?usp=drive_link
Mistral Japanese translation subtitles: https://drive.google.com/file/d/1rS5JwQtYoz2GEy7kpAiGXatWalGPnQUl/view?usp=drive_link
Verdict
The deployment of Mistral AI with vLLM on AWS EC2 initially revealed substantial resource limitations, particularly concerning GPU memory and real-time processing capabilities. The default 128K token context length proved impractical on lower-tier GPU instances, and performance remained below expectations even after implementing various optimizations.
Upgrading to a g5.48xlarge instance and applying tensor parallelism along with CPU offloading enabled full utilization of the 128K token context length without encountering memory-related issues. While this configuration delivered notable performance gains, the quality of translations still requires further refinement.
In addition, AWS availability constraints for high-tier GPU instances pose challenges in ensuring consistent access to the required infrastructure for stable and scalable deployment.
Karasu
Karasu is a Japanese-focused large language model optimized for high-accuracy translation and content generation. Developed specifically for the Japanese language, it performs well in capturing nuance and tone but demands considerable computational resources for real-time inference.
Model: lightblue/ao-karasu-72B
The ao-karasu-72B model requires a minimum of four A100 GPUs for optimal performance. On AWS, this configuration is only available through p5 instances, which cost approximately $100 per hour.
Inference was successfully performed using the lightblue/ao-karasu-72B model with vLLM on a g5 instance; however, performance was significantly limited:
- Processing one-sixth(1/6) of the total subtitles took nearly three hours.
- Processing 100 subtitles took approximately one hour.
Inference Usage
import reimport requestsimport timeimport json # === CONFIGURATION ===SRT_FILE = “HappyDeathDay2U_original.srt”TRANSLATED_SRT_FILE = “HappyDeathDay2U_japanese.srt”VLLM_API_URL = “http://localhost:8000/v1/chat/completions”API_KEY = “”MODEL_NAME = “lightblue/ao-karasu-72B”DELAY_BETWEEN_REQUESTS = 1 # === EXTRACT SUBTITLES ===def extract_srt_content(file_path): “””Reads SRT file and extracts subtitle data.””” with open(file_path, “r”, encoding=”utf-8″) as file: srt_content = file.readlines() timestamp_pattern = re.compile(r”(\d{2}:\d{2}:\d{2},\d{3}) –> (\d{2}:\d{2}:\d{2},\d{3})”) srt_data = [] current_sub = {“index”: None, “start”: None, “end”: None, “text”: “”} for line in srt_content: line = line.strip() if line.isdigit(): if current_sub[“index”] is not None: srt_data.append(current_sub) print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print current_sub = {“index”: int(line), “start”: None, “end”: None, “text”: “”} elif timestamp_pattern.match(line): match = timestamp_pattern.match(line) current_sub[“start”], current_sub[“end”] = match.groups() elif line: current_sub[“text”] += (” ” if current_sub[“text”] else “”) + line else: if current_sub[“index”] is not None: srt_data.append(current_sub) print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print current_sub = {“index”: None, “start”: None, “end”: None, “text”: “”} if current_sub[“index”] is not None: srt_data.append(current_sub) print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print return srt_data # === CLEAN TRANSLATED OUTPUT ===def clean_translation(text): “””Removes unwanted explanations and extracts only Japanese translation.””” return re.sub(r'[^\u3040-\u30FF\u4E00-\u9FFF\uFF66-\uFF9F\s.,!?]’, ”, text).strip() # === TRANSLATE SUBTITLES TO JAPANESE ===def translate_subtitles(srt_data): “””Translates English subtitles to Japanese using Mistral Nemo with vLLM.””” headers = {“Content-Type”: “application/json”} if API_KEY: headers[“Authorization”] = f”Bearer {API_KEY}” # Adds API key if provided for sub in srt_data: prompt = ( f”Translate the following English subtitle to Japanese in a natural, conversational tone. ” f”Do not provide explanations, commentary, or additional text—only the translated Japanese text.\n\n” f”English: {sub[‘text’]}” ) payload = { “model”: MODEL_NAME, “messages”: [ {“role”: “system”, “content”: “You are a professional subtitle translator. Your task is to translate subtitles from English to Japanese accurately while keeping a natural tone.”}, {“role”: “user”, “content”: prompt} ], “response_format”: {“type”: “text”}, “temperature”: 0.3, “top_p”: 0.95, “max_completion_tokens”: 100, “seed”: 42, “stop”: [“\n\n”], “n”: 1 } try: print(f”\n:arrows_counterclockwise: Sending request for: {sub[‘text’]}”) print(f”:outbox_tray: Payload Sent: {json.dumps(payload, indent=2, ensure_ascii=False)}”) response = requests.post(VLLM_API_URL, json=payload, headers=headers) response.raise_for_status() response_json = response.json() print(f”:inbox_tray: Raw API Response: {json.dumps(response_json, indent=2, ensure_ascii=False)}”) translated_text = response_json[“choices”][0][“message”][“content”].strip() sub[“translated_text”] = clean_translation(translated_text) print(f”:white_check_mark: Translated: {sub[‘translated_text’]}”) except (requests.RequestException, KeyError, IndexError, json.JSONDecodeError) as e: print(f”:x: Translation failed for: {sub[‘text’]} | Error: {e}”) sub[“translated_text”] = sub[“text”] time.sleep(DELAY_BETWEEN_REQUESTS) return srt_data # === SAVE TRANSLATED SRT ===def save_translated_srt(srt_data, output_path): “””Saves the translated subtitles back into an SRT file.””” with open(output_path, “w”, encoding=”utf-8″) as file: for sub in srt_data: file.write(f”{sub[‘index’]}\n”) file.write(f”{sub[‘start’]} –> {sub[‘end’]}\n”) file.write(f”{sub[‘translated_text’]}\n\n”) print(f”:white_check_mark: Translated SRT file saved: {output_path}”) # === RUN SCRIPT ===if __name__ == “__main__”: print(“:arrows_counterclockwise: Extracting subtitles…”) srt_data = extract_srt_content(SRT_FILE) print(“:arrows_counterclockwise: Translating subtitles to Japanese…”) translated_srt_data = translate_subtitles(srt_data) |
Output
English subtitles: https://drive.google.com/file/d/1zlMVivdY7PZi__5_FNc1A_k13xlXONa-/view?usp=drive_link
Karasu Japanese translation subtitles: https://drive.google.com/file/d/13YSLg0HtkaayecSixtDB8dic51m8LrH7/view?usp=drive_link
Verdict
Karasu runs reliably on p5 instances; however, the high operating cost, approximately $100 per hour, renders it impractical for production use. More cost-effective g5 instances do not offer the inference speed necessary for efficient internal content processing. Given the current cost-performance tradeoff, Karasu is not recommended for production workloads in its present state.
Gemma 2 JPN
Gemma 2 JPN is a lightweight, instruction-tuned Japanese language model from Google, targeting low-cost deployments with competitive performance. Its design emphasizes accessibility and efficiency, making it suitable for constrained environments where resource usage is a priority.
Model: google/gemma-2-2b-jpn-it
- Model Size: 2B parameters
- Required VRAM: ~5.2 GB
- Compatible GPUs:
- NVIDIA T4 (g4dn.xlarge – ~$0.526/hour)
- NVIDIA A10G (g5.xlarge – ~$1.006/hour)
- Can run comfortably on a single GPU instance, making it lightweight and affordable for inference, including on private VPCs or on-premise environments.
Inference Usage
import torchfrom transformers import pipelinepipe = pipeline( “text-generation”, model=”google/gemma-2-2b-jpn-it”, model_kwargs={“torch_dtype”: torch.bfloat16}, device=”cuda”,)subtitles = “””100:00:38,205 –> 00:00:39,039Hmm.200:00:42,793 –> 00:00:43,627Mmm.300:03:50,814 –> 00:03:51,648Hey.400:04:39,530 –> 00:04:41,615Colonel, give us the room, please.500:04:49,331 –> 00:04:51,959Four years retired yet two salutes?600:04:52,042 –> 00:04:54,336- Shows respect.- Have a seat.700:04:58,715 –> 00:04:59,967Do you know who I am?800:05:00,717 –> 00:05:02,010No.”””translation_input_text = f”””You are an expert translator with 10 years of experience specializing in English-to-Japanese translation, particularly for subtitles and dialogue.Translate the following subtitles accurately while maintaining the nuances, tone, and context of the original dialogue.Match the original structure to ensure correct timing and readability.For complex phrases or cultural references, provide the most natural Japanese equivalent. If necessary, include one alternative translation in parentheses.Output only the translated Japanese subtitles, without explanations or commentary.Text to translate:{subtitles}”””messages = [ {“role”: “user”, “content”: translation_input_text},]outputs = pipe(messages, return_full_text=False, max_new_tokens=1024)translated_response = outputs[0][“generated_text”].strip()print(translated_response) |
Output
100:00:38,205 –> 00:00:39,039Hmm。200:00:42,793 –> 00:00:43,627Mmm。300:03:50,814 –> 00:03:51,648おい。400:04:39,530 –> 00:04:41,615Colonel、部屋を貸して。500:04:49,331 –> 00:04:51,9594年引退なのに2つの敬礼?600:04:52,042 –> 00:04:54,336- 敬意を表す。- どうぞお座りください。700:04:58,715 –> 00:04:59,967君を知ってる?800:05:00,717 –> 00:05:02,010いいえ。 |
Verdict
The google/gemma-2-2b-jpn-it model supports efficient deployment with minimal hardware requirements and lower operational costs, running on widely available GPU instances such as the NVIDIA T4 or A10G. While the model is lightweight and performs reliably, translation accuracy remains a limitation. Back-translation using Deepl frequently reveals discrepancies from the original English input, suggesting that the output lacks fidelity. As a result, despite its cost-effectiveness and ease of deployment, Gemma may not yet be suitable for use cases requiring high translation accuracy, particularly in sensitive or complex contexts.
Subtitle Translation Model Comparison
XL8 (Existing) | Mistral NeMo | Karasu | Gemma 2 JPN | |
Translation Quality | Literal, poor quality | Good contextual understanding | Good, but with occasional language errors | Clean and fluent in Japanese, but poor semantic fidelity when back-translated |
Formality Handling | No | Yes (preserves formality & tone) | No | Partial (basic honorifics handled, not nuanced) |
Character Voice/Trait Preservation | No | Yes | Limited | Limited, tone remains consistent |
Error Rate | High | Low | ~5% of outputs were in Chinese | Low (~1–2% hallucinations or back-translation drift) |
Inference Speed | Fast | Moderate (better on g5.48xlarge) | Very slow (e.g., 3 hours for 1/6 subtitles) | Fast (optimized for 2B model, runs well on g5) |
Hardware Requirements | Existing Setup | g5.4xlarge to g5.48xlarge | p5 with 4×A100 GPUs | g5.xlarge and above (A10G with 24 GB VRAM) |
On-Prem Deployment Feasibility | Already deployed | Feasible with vLLM or TensorRT-LLM | Hard to deploy due to hardware constraints | Easy, lightweight for on-prem (even with vLLM) |
Cost | Low | Moderate to High (depending on instance) | Very High (~$100/hr on AWS p5) | Low (~$1.01/hr on AWS g5.xlarge) |
Conclusion
In comparison to the client’s existing machine translation system (XL8), both Mistral and Karasu delivered noticeably higher-quality translations. XL8 often produced rigid, low-fidelity output that failed to reflect the intended meaning or tone, whereas Mistral and Karasu demonstrated stronger contextual understanding and more natural phrasing. Mistral, in particular, preserved appropriate levels of formality and retained key character traits—critical elements for accurate subtitle generation.
Karasu, however, showed occasional reliability issues, including the unexpected return of Chinese text instead of Japanese in roughly 5% of test cases. It also poses greater challenges for on-premises deployment. Considering both translation quality and ease of integration, Mistral currently stands out as the most practical option for continued evaluation.
About TrackIt
TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.
We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.
Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.