Back to work

Translation / 2026-05-25

Translation Model Benchmark for Multilingual Video Transcripts

This benchmark translated roughly 500,000 characters of real multilingual transcript text from 15 languages into English. It compares Google Translate, DeepL, and Llama Maverick 4 across quality, latency, reliability, and practical operating tradeoffs.

Published Benchmark Technical writeup
Segments 293

Sampled from a production transcript table

Languages 15

Top non-English languages by transcript volume

Best quality Llama

COMET-QE -0.4481, 10 of 15 language wins

Fastest DeepL

851 ms average latency

Problem

Choose a production-ready translation path for noisy multilingual ASR transcript text without relying on clean benchmark corpora.

Approach

Translated 293 transcript segments across 15 languages and compared quality, latency, reliability, throughput, and length behavior across three providers.

Result

Llama produced the best COMET-QE quality and reliability, while DeepL was the strongest low-latency option for real-time paths.

Technologies
  • Python
  • Google Translate
  • DeepL
  • Llama Maverick 4
  • Together AI
  • COMET-QE
  • BLEU
  • BERTScore

Context

The test used ASR-style video transcripts rather than clean written text, which makes the benchmark closer to production transcript translation. The source table contained 84 million total transcripts, and the sampled workload covered about 33,333 characters per language.

The compared systems were Google Translate through the unofficial googletrans library, DeepL through the official API, and Meta Llama 4 Maverick 17B-128E through Together AI.

At a glance

DimensionWinnerWhy
Translation qualityLlama Maverick 4Highest overall COMET-QE score and strongest language coverage.
SpeedDeepLAbout 7x faster than Llama and about 1.5x faster than Google.
ReliabilityLlama Maverick 4One timeout across 293 segments.
Free-tier costTieAll systems were free for this roughly 500k character benchmark volume.

Overall results

All attempted translations

ModelSuccessError rateAvg latencyThroughputCOMET-QELength ratio
Google241/29317.7%1,299 ms477 chars/s-0.48131.253
DeepL279/2934.8%851 ms1,341 chars/s-0.48631.248
Llama292/2930.3%6,088 ms201 chars/s-0.44811.641

Fair comparison excluding integration failures

ModelSuccessError rateAvg latencyThroughputCOMET-QELength ratio
Google241/2410.0%1,299 ms477 chars/s-0.48131.253
DeepL279/2790.0%851 ms1,341 chars/s-0.48631.248
Llama292/2930.3%6,088 ms201 chars/s-0.44811.641

Language-level findings

Llama won 10 of the 15 languages by COMET-QE, while Google won five. DeepL was rarely the quality winner in this transcript-heavy sample, but it remained the strongest option when latency mattered.

LanguageGoogle COMETDeepL COMETLlama COMETWinner
Spanish-0.4340-0.4439-0.4017Llama
Portuguese-0.5881No data-0.5871Llama
Russian-0.3498-0.4469-0.3611Google
Japanese-0.6601-0.6506-0.5692Llama
Hindi-0.3107-0.3489-0.2503Llama
French-0.7363-0.6186-0.5516Llama
German-0.1751-0.2135-0.2205Google
ChineseNo data-0.3471-0.1618Llama

Recommendation

  • Use Llama Maverick 4 when quality is the primary goal and batch latency is acceptable.
  • Use DeepL for real-time or high-volume product paths where sub-second latency matters.
  • Use Google through the official API, not googletrans, for cost-sensitive batch processing where moderate reliability and quality are acceptable.
  • Run a follow-up benchmark on low-resource languages because the current sample focused on common languages only.

Detailed methodology and results

Supporting methodology, figures, and tables are rendered here as native page content with the same visual system as the rest of this website.

Comparing Google Translate, DeepL, and Llama Maverick 4 across 15 languages

Generated from ~500,000 characters of video transcript data (293 segments)

Executive Summary

The benchmark compares three translation services by translating ~500,000 characters of real video transcript data from 15 languages into English. The corpus uses Iceberg-backed automatic speech recognition (ASR) transcript data - making this a realistic test of how these models handle noisy, informal, spoken-language content.

Key finding: Llama Maverick 4 produced the highest-quality translations (COMET-QE: -0.4481) with near-perfect reliability (99.7% success rate), winning 10 out of 15 languages. DeepL was the fastest (avg 851ms, median ~635ms) and most cost-efficient for real-time use. Google Translate was competitive on quality - winning 5 out of 15 languages - but suffered from reliability issues due to the unofficial API library used.

At a glance

CategoryWinnerDetails
Translation QualityLlamaHighest COMET-QE score across most languages
SpeedDeepL~7x faster than Llama, ~1.5x faster than Google
ReliabilityLlamaOnly 1 failure out of 293 segments (timeout on long text)
Cost (free tier)All tiedAll three were free for this volume (~500K chars)

Methodology

This benchmark was designed to evaluate translation quality on real-world, messy content rather than curated test sets.

Data source

  • Source: large Iceberg-backed transcript table (84M total transcripts)
  • Selection: Top 15 non-English languages by volume, ~33,333 characters per language
  • Content type: ASR video transcripts - informal, spoken language with typical ASR noise
  • Total: 293 segments, ~500,000 characters

Models tested

  • Google Translate - via googletrans library (unofficial free API)
  • DeepL - via official DeepL API (free tier, 500K char/month)
  • Llama Maverick 4 - Meta's Llama 4 Maverick 17B-128E via Together AI API

Metrics

  • COMET-QE - Reference-free translation quality score (Unbabel/wmt20-comet-qe-da). Higher is better. Does not need human reference translations.
  • Latency - Wall-clock time per API call in milliseconds
  • Throughput - Characters translated per second
  • Length ratio - Ratio of translation length to source length (1.0 = same length)
  • Error rate - Percentage of segments that failed to translate
  • Cross-model BLEU/chrF/BERTScore - Pairwise similarity between translations from different providers

Note on COMET-QE scores: The wmt20-comet-qe-da model outputs scores that can be negative. These are relative quality indicators - what matters is the comparison between providers, not the absolute values. Negative scores are expected for ASR transcripts which are inherently noisy.

Results - All Data (Including Errors)

This section shows the raw results from all 293 segments sent to each provider. Some errors were caused by benchmark integration issues, not by the translation services themselves. Those cases are separated in the error analysis below.

ProviderSuccessError RateAvg Latency (ms)Avg Throughput (chars/s)COMET-QELength Ratio
Google241/29317.7%1299477-0.48131.253
Deepl279/2934.8%8511341-0.48631.248
Llama292/2930.3%6088201-0.44811.641
Summary comparison - all data
Summary comparison - all data

The four bar charts above compare average latency, throughput, COMET-QE quality, and error rate. Gold borders highlight the best performer in each category. Llama leads on quality (COMET-QE: -0.4481) and error rate (0.3%). DeepL leads on speed (851ms avg, 1341 chars/sec). Note that Google's 17.7% error rate and DeepL's 14 Portuguese failures are caused by integration issues, not translation quality issues.

Error Analysis

Out of 879 total translation attempts (293 segments x 3 providers), 67 failed. The vast majority were caused by issues in the benchmark harness, not by the translation services.

ProviderError TypeCountRoot CauseIntegration Issue?
GoogleInvalid source language18The harness sent zh but googletrans requires zh-cnYes
GoogleJSON parse error / rate limit34The unofficial googletrans library was rate-limited by Google's anti-bot protectionPartial - library choice
DeepLUnsupported source_lang14The harness sent PT-BR as source language - valid only as target in DeepL APIYes
LlamaRead timeout1A 15,028-char Korean segment exceeded the 120s timeoutPartial - timeout too low
Error analysis
Error analysis

The left chart shows overall error rates: Google at 17.7%, DeepL at 4.8%, and Llama at just 0.3%. The right chart breaks down errors by language - notice that Google's 18 Chinese errors (code bug sending wrong language code) and 34 errors spread across other languages (rate limiting from the unofficial library), while DeepL's 14 errors are concentrated entirely in Portuguese (a source-language mapping issue). These patterns confirm the errors are not quality-related.

Impact on results: To ensure a fair comparison, the next section excludes all 52 Google error segments (since the unofficial googletrans library is responsible for all failures - both the Chinese language code bug and rate limiting) and all 14 DeepL Portuguese segments (a source-language mapping issue). This gives each provider a level playing field, with only Llama's single timeout remaining as a genuine operational failure.

Results - Fair Comparison (Excluding Integration Issues)

This comparison removes all 52 Google error segments - both the Chinese language code bug (18) and rate limiting (34) are caused by using the unofficial googletrans library - and all 14 DeepL Portuguese segments affected by a source-language mapping issue. After exclusion, Google and DeepL both show 0% error rate, with only Llama's single timeout (0.3%) remaining.

ProviderSuccessError RateAvg Latency (ms)Avg Throughput (chars/s)COMET-QELength Ratio
Google241/2410.0%1299477-0.48131.253
Deepl279/2790.0%8511341-0.48631.248
Llama292/2930.3%6088201-0.44811.641
COMET-QE fair comparison
COMET-QE fair comparison
Latency fair comparison
Latency fair comparison

Left chart (Quality): COMET-QE score distribution per provider. The box shows the middle 50% of scores, with the line inside marking the median. Llama Maverick 4 has the highest median (-0.4481 avg) and the tightest spread, meaning it produces consistently good translations. Google (-0.4813 avg) and DeepL (-0.4863 avg) are very close to each other but both trail Llama.

Right chart (Speed): Latency distribution per provider. DeepL is clearly the fastest - its box is tightly clustered with a median around 635ms. Google sits in the middle at ~1,300ms average. Llama is significantly slower with an average of ~6,088ms and a long upper tail, with some segments exceeding 60 seconds for very long texts. This is the trade-off for running a large language model per translation request.

Translation Quality Deep Dive

COMET-QE (Quality Estimation) evaluates translations without needing human references. It uses a trained neural model to assess how well the translation captures the meaning of the source text.

COMET-QE all data
COMET-QE all data
Length ratio
Length ratio

Left chart (COMET-QE - All Data): Quality distribution across all successful translations. Each box shows the middle 50% of scores with the median line inside. Llama's median (-0.4481 avg) is noticeably higher than Google (-0.4813) and DeepL (-0.4863), and its box is narrower - meaning it produces more consistently good translations with fewer low-quality outliers.

Right chart (Length Ratio): How long are the translations compared to the source text? A ratio of 1.0 (the red dashed line) means the translation is exactly the same length as the source. Google (1.253 avg) and DeepL (1.248 avg) produce translations roughly 25% longer than the source - typical when translating from compact languages like Chinese, Japanese, and Korean into English. Llama (1.641 avg) produces translations ~64% longer - it tends to be more verbose, sometimes adding context or rephrasing more extensively. Note the wide spread in all boxes: length ratio varies significantly by source language.

COMET-QE heatmap by language
COMET-QE heatmap by language

This heatmap shows COMET-QE quality broken down by language (rows) and provider (columns). Each cell contains the average score for that combination. Green cells indicate higher quality; red/yellow cells indicate lower quality. Key observations:

  • Llama wins 10 out of 15 languages - it is the best provider for Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Turkish, and Chinese
  • Google wins 5 languages - Arabic, German, Dutch, Russian, and Vietnamese, where it slightly edges out the competition
  • European languages (German, Dutch, Russian) tend to score higher across all providers - these are well-resourced languages with clean ASR output
  • Asian languages (Japanese, Korean) and Indonesian show lower scores across the board, likely reflecting noisier ASR transcripts and harder translation pairs
  • Missing cells (e.g., Google for Chinese) indicate where a provider had no successful translations due to benchmark integration issues

Cross-Model Agreement

Beyond evaluating each provider individually, the benchmark measured how much the three providers agree with each other . When two independent translation systems produce similar output for the same input, it increases confidence that both are correct. The benchmark used three pairwise metrics:

  • BLEU - Measures word-level overlap between two translations (0 - 100). Higher means more similar phrasing.
  • chrF - Character-level overlap, less sensitive to exact word choice (0 - 100). Good for morphologically rich languages.
  • BERTScore F1 - Semantic similarity using neural embeddings (0 - 1). Two translations can score high even if they use completely different words, as long as the meaning is the same.
Cross-model pairwise agreement
Cross-model pairwise agreement

The three panels above show average pairwise BLEU, chrF, and BERTScore F1 for each provider pair. Each bar represents the average agreement between two providers across all segments where both succeeded. Key takeaways:

  • DeepL and Google agree the most - BLEU 38.3, chrF 60.2, BERTScore 0.917. Both are dedicated translation engines optimized for fluent, concise output, so they tend to produce similar phrasing.
  • Google and Llama show moderate agreement - BLEU 33.8, chrF 57.1, BERTScore 0.910. Despite Llama's different approach (LLM vs translation engine), it often converges on similar output to Google.
  • DeepL and Llama diverge the most - BLEU 30.8, chrF 54.4, BERTScore 0.897. Llama's higher length ratio (1.64x) means it rephrases more freely, producing longer translations that lower word-level overlap even when meaning is preserved.
  • BERTScore F1 is high across all pairs (0.897 - 0.917), confirming that despite surface-level wording differences, all three providers capture the same core meaning. The translations are semantically equivalent.
BERTScore heatmap by language
BERTScore heatmap by language

This heatmap breaks down BERTScore F1 by language (rows) and provider pair (columns). Each cell shows the average semantic similarity between two providers' translations for that language. Greener cells = higher agreement, yellower cells = more divergence. Key patterns:

  • European languages (Spanish, French, German, Italian, Dutch) show the highest cross-model agreement - all providers translate these well and produce semantically similar output.
  • Asian languages (Japanese, Korean, Chinese) and Arabic show slightly lower agreement, reflecting genuine differences in how each model handles more distant language pairs with different scripts and grammar.
  • DeepL vs Google (left column) is consistently the greenest - these two dedicated translation engines agree the most across all languages.
  • Overall, the uniformly green heatmap (values mostly above 0.88) confirms that all three providers produce semantically equivalent translations across all 15 languages tested.

Speed Throughput

For production use, translation speed matters - especially for real-time applications or high-volume batch processing.

Latency all data
Latency all data
Throughput vs text length
Throughput vs text length

Left chart (Latency - All Data): Box plots showing response time per provider. DeepL consistently responds under 2 seconds (median ~635ms, max ~5.6s). Google averages ~1,299ms with moderate variance (range 131 - 4,108ms). Llama averages ~6,088ms with an extremely long tail - while the median is ~2,072ms, the worst case reached 64 seconds for a long Korean segment.

Right chart (Throughput vs Text Length): Each dot represents one translated segment, plotting source text length (x-axis) against characters processed per second (y-axis). DeepL (green, top) maintains the highest throughput (~1341 chars/sec avg) regardless of text size, with some large segments exceeding 5,000 chars/sec. Google (blue, middle) averages ~477 chars/sec. Llama (purple, bottom) processes at ~201 chars/sec - the cost of running each segment through a full LLM inference pass.

Speed takeaway: If the use case requires real-time translation (e.g., live subtitles, chat), DeepL is the clear choice at ~1341 chars/sec. For batch processing where quality matters more than speed, Llama's ~201 chars/sec is acceptable.

Per-Language Breakdown

This table shows COMET-QE quality scores and average latency for each of the 15 languages tested, broken down by provider. Each cell shows the quality score (top) and the average latency plus segment count (bottom). Only successful translations are included - blank cells mean the provider failed all segments for that language.

LanguageGoogle COMET-QE / LatencyDeepL COMET-QE / LatencyLlama COMET-QE / Latency
es (Spanish)-0.4340 882ms - 15 seg-0.4439 643ms - 18 seg-0.4017 7079ms - 18 seg
pt (Portuguese)-0.5881 917ms - 11 segNo data-0.5871 8730ms - 14 seg
ru (Russian)-0.3498 1118ms - 10 seg-0.4469 598ms - 13 seg-0.3611 8950ms - 13 seg
ja (Japanese)-0.6601 1055ms - 17 seg-0.6506 962ms - 18 seg-0.5692 9092ms - 18 seg
id (Indonesian)-0.5898 1249ms - 24 seg-0.6000 795ms - 25 seg-0.5630 4692ms - 25 seg
hi (Hindi)-0.3107 1425ms - 19 seg-0.3489 743ms - 22 seg-0.2503 4442ms - 22 seg
vi (Vietnamese)-0.4714 1124ms - 17 seg-0.5198 585ms - 20 seg-0.4898 4766ms - 20 seg
ko (Korean)-0.5908 1583ms - 24 seg-0.6157 1183ms - 25 seg-0.5541 4316ms - 24 seg
tr (Turkish)-0.5138 1442ms - 15 seg-0.4866 1096ms - 19 seg-0.4769 5144ms - 19 seg
fr (French)-0.7363 1316ms - 6 seg-0.6186 1100ms - 8 seg-0.5516 13233ms - 8 seg
de (German)-0.1751 955ms - 4 seg-0.2135 665ms - 6 seg-0.2205 6068ms - 6 seg
it (Italian)-0.6072 1708ms - 12 seg-0.6149 950ms - 14 seg-0.5672 8274ms - 14 seg
ar (Arabic)-0.4274 1305ms - 47 seg-0.4779 606ms - 51 seg-0.4525 3282ms - 51 seg
nl (Dutch)-0.2919 1563ms - 20 seg-0.3477 565ms - 22 seg-0.4110 3960ms - 22 seg
zh (Chinese)No data-0.3471 1834ms - 18 seg-0.1618 12430ms - 18 seg

Recommendations

For highest translation quality

Use Llama Maverick 4 - It won 10 out of 15 languages on COMET-QE, producing the highest overall quality score (-0.4481). It handles noisy ASR input well and had the lowest error rate (0.3% - a single timeout). The trade-off is speed: at ~6,088ms average it's ~7x slower than DeepL.

For real-time or high-volume translation

Use DeepL - At ~851ms average and ~1341 chars/sec, it's the fastest by a significant margin. Translation quality (-0.4863) is competitive with Google Translate and only slightly behind Llama. The official API is reliable and well-documented. Note: the free tier is limited to 500K chars/month.

For cost-sensitive batch processing

Use Google Translate (with the official API) - While this benchmark used the unofficial googletrans library (which caused all 52 Google errors), the official Google Cloud Translation API is production-grade. Quality (-0.4813) is competitive with DeepL, and Google won 5 out of 15 languages (Arabic, German, Dutch, Russian, Vietnamese). At scale, Google's pricing may be the most competitive.

Important caveats

Language coverage and speed warning: The 15 languages tested here (Spanish, Portuguese, Russian, Japanese, Indonesian, Hindi, Vietnamese, Korean, Turkish, French, German, Italian, Arabic, Dutch, Chinese) are among the most widely spoken languages in the world, with large amounts of training data available for all models. The broader corpus contains 180 languages , including many low-resource languages with limited training data (e.g., Kinyarwanda, Lao, Khmer, Amharic, Kyrgyz). While Llama Maverick 4 performed best on these common languages (10/15 wins), it should not be assumed it will maintain this advantage on rarer languages where Google Translate and DeepL - which have been specifically optimized for broad multilingual coverage - may hold an edge. Additionally, Llama's average latency of ~6,088ms per segment (with peaks reaching 64 seconds for longer texts) makes it impractical for real-time or high-throughput production use cases - it is roughly 7x slower than DeepL and ~5x slower than Google Translate. A follow-up benchmark on low-resource languages is recommended before making a final decision for the full language set.

  • This benchmark used ASR transcripts - results may differ for clean, edited text
  • Google Translate results are affected by the unreliable googletrans library; the official API would likely perform better
  • COMET-QE is a model-based metric, not a human evaluation - scores are relative indicators, not absolute measures of quality
  • Llama's higher length ratio (1.64x) means it produces more verbose translations, which may or may not be desirable depending on use case
  • All three services were free at this volume; cost becomes a factor at scale

Translation Model Benchmark for Multilingual Video Transcripts - Generated from 293 segments across 15 languages (~500K characters)

Models: Google Translate - DeepL - Llama Maverick 4 (via Together AI) - Quality metric: COMET-QE (wmt20-comet-qe-da)