Evaluating GenAI Models for English-to-Japanese Subtitle Translation: Mistral vs Karasu vs Gemma

Written by Maram Ayari and Leandro Mota, Software & DevOps Engineers at TrackIt

As global demand for multilingual content continues to rise, so does the need for fast and accurate subtitle translation. To address this, TrackIt conducted a proof of concept to evaluate the effectiveness of generative AI (GenAI) models in translating subtitles from English to Japanese, with the goal of improving upon the current performance of xl8 used in production. Outlined below are the approach, key findings, and considerations for integrating the solution into on-premises environments or within a private VPC setup.

Mistral NeMo

Mistral NeMo is a compact, open-weight generative model developed by Mistral AI, designed for efficient deployment in self-hosted environments. It supports long-context processing and can be integrated with popular inference engines such as vLLM, TensorRT-LLM, and TGI.

Requirements: For inference, an additional memory overhead of up to 20% beyond the base model size should be accounted for. Estimated base memory usage is approximately 1.25 GB at float16/bfloat16 precision and 2.5 GB at float32 precision. A practical method for estimating memory use is available here.

vLLM, in particular, offers a highly optimized Python-based serving framework that can expose an OpenAI-compatible API, making it well-suited for integration in controlled environments.

Performance Testing and Observations

1. Initial Issues on g5.2xlarge

The default context length of ~128K tokens caused excessive memory usage.
The 24GB CUDA vRAM was insufficient, leading to fragmentation and performance degradation.
The model frequently ran out of memory, forcing a reduction in token length.

2. Workarounds Implemented

Limited token length to 4096 to prevent memory exhaustion.
Enabled CPU offloading (4GB initially, increased to 12GB) to balance CUDA memory usage.
Performance remained suboptimal, with slow execution and limited translation coherence.

3. Upgrading to g5.4xlarge

Increased vCPUs and RAM improved processing, but GPU memory remained a bottleneck.
The model was operational but still slow, and translations suffered from loss of coherence due to reduced context length.

4. vLLM for Real-Time Subtitling

vLLM was not an effective approach for real-time subtitle generation.
The inference speed was too slow for handling dynamic, streaming subtitle flows.
The model required significant preloading and batching, making it unsuitable for low-latency applications.

5. Upgrading to g5.48xlarge

Successfully ran Mistral NeMo on a g5.48xlarge instance without forcing any limitation.
Used tensor parallelism together with CPU offloading to optimize performance.
Achieved significantly better results, though translation issues still persist.
The model was able to handle the full 128K token context length without performance degradation.

6. Inference usage

def translate_subtitles(srt_data): “””Translates English subtitles to Japanese using Mistral Nemo with vLLM.””” headers = {“Content-Type”: “application/json”} if API_KEY: headers[“Authorization”] = f”Bearer {API_KEY}” # Adds API key if provided for sub in srt_data: prompt = ( f”Translate the following English subtitle to Japanese in a natural, conversational tone. ” f”Do not provide explanations, commentary, or additional text—only the translated Japanese text.\n\n” f”English: {sub[‘text’]}” ) payload = { “model”: MODEL_NAME, “messages”: [ {“role”: “system”, “content”: “You are a professional subtitle translator. Your task is to translate subtitles from English to Japanese accurately while keeping a natural tone.”}, {“role”: “user”, “content”: prompt} ], “response_format”: {“type”: “text”}, “temperature”: 0.3, “top_p”: 0.95, “max_completion_tokens”: 100, “seed”: 42, “stop”: [“\n\n”], “n”: 1 } time.sleep(DELAY_BETWEEN_REQUESTS)
return srt_data

6. Challenges with Instance Availability

Encountered difficulties when turning off and restarting the instance.
AWS resource allocation for g5.48xlarge is dependent on the availability within the Availability Zone (AZ).
Due to high demand, it required multiple attempts before successfully launching the instance again.

7. Automation Enhancements

Created an Ansible automation script to deploy/redeploy instances more quickly.
Developed a translation script utilizing Mistral NeMo for optimized execution.
Both outputs (g5.4xlarge and g5.48xlarge) have been shared on the Slack channel for reference.

Output

English subtitles: https://drive.google.com/file/d/1zlMVivdY7PZi__5_FNc1A_k13xlXONa-/view?usp=drive_link

Mistral Japanese translation subtitles: https://drive.google.com/file/d/1rS5JwQtYoz2GEy7kpAiGXatWalGPnQUl/view?usp=drive_link

Verdict

The deployment of Mistral AI with vLLM on AWS EC2 initially revealed substantial resource limitations, particularly concerning GPU memory and real-time processing capabilities. The default 128K token context length proved impractical on lower-tier GPU instances, and performance remained below expectations even after implementing various optimizations.

Upgrading to a g5.48xlarge instance and applying tensor parallelism along with CPU offloading enabled full utilization of the 128K token context length without encountering memory-related issues. While this configuration delivered notable performance gains, the quality of translations still requires further refinement.

In addition, AWS availability constraints for high-tier GPU instances pose challenges in ensuring consistent access to the required infrastructure for stable and scalable deployment.

Karasu

Karasu is a Japanese-focused large language model optimized for high-accuracy translation and content generation. Developed specifically for the Japanese language, it performs well in capturing nuance and tone but demands considerable computational resources for real-time inference.

Model: lightblue/ao-karasu-72B

The ao-karasu-72B model requires a minimum of four A100 GPUs for optimal performance. On AWS, this configuration is only available through p5 instances, which cost approximately $100 per hour.
Inference was successfully performed using the lightblue/ao-karasu-72B model with vLLM on a g5 instance; however, performance was significantly limited:

Processing one-sixth(1/6) of the total subtitles took nearly three hours.
Processing 100 subtitles took approximately one hour.

Inference Usage

import reimport requestsimport timeimport json
# === CONFIGURATION ===SRT_FILE = “HappyDeathDay2U_original.srt”TRANSLATED_SRT_FILE = “HappyDeathDay2U_japanese.srt”VLLM_API_URL = “http://localhost:8000/v1/chat/completions”API_KEY = “”MODEL_NAME = “lightblue/ao-karasu-72B”DELAY_BETWEEN_REQUESTS = 1
# === EXTRACT SUBTITLES ===def extract_srt_content(file_path):    “””Reads SRT file and extracts subtitle data.”””    with open(file_path, “r”, encoding=”utf-8″) as file:        srt_content = file.readlines()
    timestamp_pattern = re.compile(r”(\d{2}:\d{2}:\d{2},\d{3}) –> (\d{2}:\d{2}:\d{2},\d{3})”)    srt_data = []    current_sub = {“index”: None, “start”: None, “end”: None, “text”: “”}
    for line in srt_content:        line = line.strip()        if line.isdigit():            if current_sub[“index”] is not None:                srt_data.append(current_sub)                print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print            current_sub = {“index”: int(line), “start”: None, “end”: None, “text”: “”}        elif timestamp_pattern.match(line):            match = timestamp_pattern.match(line)            current_sub[“start”], current_sub[“end”] = match.groups()        elif line:            current_sub[“text”] += (” ” if current_sub[“text”] else “”) + line        else:            if current_sub[“index”] is not None:                srt_data.append(current_sub)                print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print            current_sub = {“index”: None, “start”: None, “end”: None, “text”: “”}
    if current_sub[“index”] is not None:        srt_data.append(current_sub)        print(f”:pushpin: Extracted Subtitle: {current_sub}”) # Debugging print
    return srt_data

# === CLEAN TRANSLATED OUTPUT ===def clean_translation(text):    “””Removes unwanted explanations and extracts only Japanese translation.”””    return re.sub(r'[^\u3040-\u30FF\u4E00-\u9FFF\uFF66-\uFF9F\s.,!?]’, ”, text).strip()
# === TRANSLATE SUBTITLES TO JAPANESE ===def translate_subtitles(srt_data):    “””Translates English subtitles to Japanese using Mistral Nemo with vLLM.”””    headers = {“Content-Type”: “application/json”}    if API_KEY:        headers[“Authorization”] = f”Bearer {API_KEY}” # Adds API key if provided
    for sub in srt_data:        prompt = (            f”Translate the following English subtitle to Japanese in a natural, conversational tone. ”            f”Do not provide explanations, commentary, or additional text—only the translated Japanese text.\n\n”            f”English: {sub[‘text’]}”        )
        payload = {            “model”: MODEL_NAME,            “messages”: [                {“role”: “system”, “content”: “You are a professional subtitle translator. Your task is to translate subtitles from English to Japanese accurately while keeping a natural tone.”},                {“role”: “user”, “content”: prompt}            ],            “response_format”: {“type”: “text”},            “temperature”: 0.3,            “top_p”: 0.95,            “max_completion_tokens”: 100,            “seed”: 42,            “stop”: [“\n\n”],            “n”: 1        }
        try:            print(f”\n:arrows_counterclockwise: Sending request for: {sub[‘text’]}”)            print(f”:outbox_tray: Payload Sent: {json.dumps(payload, indent=2, ensure_ascii=False)}”)
            response = requests.post(VLLM_API_URL, json=payload, headers=headers)            response.raise_for_status()
            response_json = response.json()            print(f”:inbox_tray: Raw API Response: {json.dumps(response_json, indent=2, ensure_ascii=False)}”)
            translated_text = response_json[“choices”][0][“message”][“content”].strip()            sub[“translated_text”] = clean_translation(translated_text)
            print(f”:white_check_mark: Translated: {sub[‘translated_text’]}”)
        except (requests.RequestException, KeyError, IndexError, json.JSONDecodeError) as e:            print(f”:x: Translation failed for: {sub[‘text’]} | Error: {e}”)            sub[“translated_text”] = sub[“text”]
        time.sleep(DELAY_BETWEEN_REQUESTS)
    return srt_data
# === SAVE TRANSLATED SRT ===def save_translated_srt(srt_data, output_path):    “””Saves the translated subtitles back into an SRT file.”””    with open(output_path, “w”, encoding=”utf-8″) as file:        for sub in srt_data:            file.write(f”{sub[‘index’]}\n”)            file.write(f”{sub[‘start’]} –> {sub[‘end’]}\n”)            file.write(f”{sub[‘translated_text’]}\n\n”)
    print(f”:white_check_mark: Translated SRT file saved: {output_path}”)
# === RUN SCRIPT ===if __name__ == “__main__”:    print(“:arrows_counterclockwise: Extracting subtitles…”)    srt_data = extract_srt_content(SRT_FILE)
    print(“:arrows_counterclockwise: Translating subtitles to Japanese…”)    translated_srt_data = translate_subtitles(srt_data)

Output

English subtitles: https://drive.google.com/file/d/1zlMVivdY7PZi__5_FNc1A_k13xlXONa-/view?usp=drive_link

Karasu Japanese translation subtitles: https://drive.google.com/file/d/13YSLg0HtkaayecSixtDB8dic51m8LrH7/view?usp=drive_link

Verdict

Karasu runs reliably on p5 instances; however, the high operating cost, approximately $100 per hour, renders it impractical for production use. More cost-effective g5 instances do not offer the inference speed necessary for efficient internal content processing. Given the current cost-performance tradeoff, Karasu is not recommended for production workloads in its present state.

Gemma 2 JPN

Gemma 2 JPN is a lightweight, instruction-tuned Japanese language model from Google, targeting low-cost deployments with competitive performance. Its design emphasizes accessibility and efficiency, making it suitable for constrained environments where resource usage is a priority.

Model: google/gemma-2-2b-jpn-it

Model Size: 2B parameters
Required VRAM: ~5.2 GB
Compatible GPUs:
- NVIDIA T4 (g4dn.xlarge – ~$0.526/hour)
- NVIDIA A10G (g5.xlarge – ~$1.006/hour)
Can run comfortably on a single GPU instance, making it lightweight and affordable for inference, including on private VPCs or on-premise environments.

Inference Usage

import torchfrom transformers import pipelinepipe = pipeline( “text-generation”, model=”google/gemma-2-2b-jpn-it”, model_kwargs={“torch_dtype”: torch.bfloat16}, device=”cuda”,)subtitles = “””100:00:38,205 –> 00:00:39,039Hmm.200:00:42,793 –> 00:00:43,627Mmm.300:03:50,814 –> 00:03:51,648Hey.400:04:39,530 –> 00:04:41,615Colonel, give us the room, please.500:04:49,331 –> 00:04:51,959Four years retired yet two salutes?600:04:52,042 –> 00:04:54,336- Shows respect.- Have a seat.700:04:58,715 –> 00:04:59,967Do you know who I am?800:05:00,717 –> 00:05:02,010No.”””translation_input_text = f”””You are an expert translator with 10 years of experience specializing in English-to-Japanese translation, particularly for subtitles and dialogue.Translate the following subtitles accurately while maintaining the nuances, tone, and context of the original dialogue.Match the original structure to ensure correct timing and readability.For complex phrases or cultural references, provide the most natural Japanese equivalent. If necessary, include one alternative translation in parentheses.Output only the translated Japanese subtitles, without explanations or commentary.Text to translate:{subtitles}”””messages = [ {“role”: “user”, “content”: translation_input_text},]outputs = pipe(messages, return_full_text=False, max_new_tokens=1024)translated_response = outputs[0][“generated_text”].strip()print(translated_response)

Output

100:00:38,205 –> 00:00:39,039Hmm。200:00:42,793 –> 00:00:43,627Mmm。300:03:50,814 –> 00:03:51,648おい。400:04:39,530 –> 00:04:41,615Colonel、部屋を貸して。500:04:49,331 –> 00:04:51,959４年引退なのに２つの敬礼？600:04:52,042 –> 00:04:54,336- 敬意を表す。- どうぞお座りください。700:04:58,715 –> 00:04:59,967君を知ってる？800:05:00,717 –> 00:05:02,010いいえ。

Verdict

The google/gemma-2-2b-jpn-it model supports efficient deployment with minimal hardware requirements and lower operational costs, running on widely available GPU instances such as the NVIDIA T4 or A10G. While the model is lightweight and performs reliably, translation accuracy remains a limitation. Back-translation using Deepl frequently reveals discrepancies from the original English input, suggesting that the output lacks fidelity. As a result, despite its cost-effectiveness and ease of deployment, Gemma may not yet be suitable for use cases requiring high translation accuracy, particularly in sensitive or complex contexts.

Subtitle Translation Model Comparison

	XL8 (Existing)	Mistral NeMo	Karasu	Gemma 2 JPN
Translation Quality	Literal, poor quality	Good contextual understanding	Good, but with occasional language errors	Clean and fluent in Japanese, but poor semantic fidelity when back-translated
Formality Handling	No	Yes (preserves formality & tone)	No	Partial (basic honorifics handled, not nuanced)
Character Voice/Trait Preservation	No	Yes	Limited	Limited, tone remains consistent
Error Rate	High	Low	~5% of outputs were in Chinese	Low (~1–2% hallucinations or back-translation drift)
Inference Speed	Fast	Moderate (better on g5.48xlarge)	Very slow (e.g., 3 hours for 1/6 subtitles)	Fast (optimized for 2B model, runs well on g5)
Hardware Requirements	Existing Setup	g5.4xlarge to g5.48xlarge	p5 with 4×A100 GPUs	g5.xlarge and above (A10G with 24 GB VRAM)
On-Prem Deployment Feasibility	Already deployed	Feasible with vLLM or TensorRT-LLM	Hard to deploy due to hardware constraints	Easy, lightweight for on-prem (even with vLLM)
Cost	Low	Moderate to High (depending on instance)	Very High (~$100/hr on AWS p5)	Low (~$1.01/hr on AWS g5.xlarge)

Conclusion

In comparison to the client’s existing machine translation system (XL8), both Mistral and Karasu delivered noticeably higher-quality translations. XL8 often produced rigid, low-fidelity output that failed to reflect the intended meaning or tone, whereas Mistral and Karasu demonstrated stronger contextual understanding and more natural phrasing. Mistral, in particular, preserved appropriate levels of formality and retained key character traits—critical elements for accurate subtitle generation.

Karasu, however, showed occasional reliability issues, including the unexpected return of Chinese text instead of Japanese in roughly 5% of test cases. It also poses greater challenges for on-premises deployment. Considering both translation quality and ease of integration, Mistral currently stands out as the most practical option for continued evaluation.

About TrackIt

TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.

We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.

Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.

Mistral NeMo

Performance Testing and Observations

1. Initial Issues on g5.2xlarge

2. Workarounds Implemented

3. Upgrading to g5.4xlarge

4. vLLM for Real-Time Subtitling

5. Upgrading to g5.48xlarge

6. Inference usage

6. Challenges with Instance Availability

7. Automation Enhancements

Output

Verdict

Karasu

Inference Usage

Output

Verdict

Gemma 2 JPN

Inference Usage

Output

Verdict

Subtitle Translation Model Comparison

Conclusion

About TrackIt

Navigation

Tools & Apps