Back to research

ML/NLP / 2026-03

IAB 3.0 Classification and Language Detection Pipeline Proposal

This proposal lays out how to replace an external classification API with in-house language detection and IAB 3.0 classification pipelines while preserving the existing data lakehouse architecture. The goal is to scale classification coverage across a 1.75 billion item media corpus.

Published Research Native report page
Core items 1.75B
Publishers 13.7M
Transcripts 83.7M
IAB 3.0 gap 65.6%

Items without IAB 3.0 coverage

Problem

Close language-detection and IAB 3.0 classification coverage gaps across a large media corpus without depending on external API calls.

Approach

Proposed two staged pipelines: a language detection cascade and a multilingual IAB 3.0 classifier trained on historical labels.

Result

The plan maps 1.6B additional language detections and 1.14B additional IAB classifications into the existing lakehouse pattern.

Technologies
  • AWS Glue
  • Iceberg
  • S3
  • Celery
  • GCLD3
  • FastText
  • XLM-RoBERTa
  • IAB 3.0

Current architecture

The data lakehouse follows a staging, core, and curated pattern. Staging tables hold raw ingested data and may contain duplicates. Core tables are deduplicated through AWS Glue and contain one row per entity. Curated tables provide enriched and filtered downstream views.

  • The core content table contains 1.75B rows and 57 columns.
  • The core publisher table contains 13.7M publisher entities.
  • The transcript table contains 83.7M transcript rows.
  • The curated content view contains roughly 960M rows.

Coverage gap

FieldCoverageSource
Legacy IAB categories1.739B items / 99.6%Existing NLP pipeline
IAB 3.0 categories600.951M items / 34.4%External classification API
Primary IAB 3.0 category600.944M items / 34.4%External classification API
NLP topics604.563M items / 34.6%External classification API
Detected text language141M items / 8.0%Current detection

Proposed pipelines

  • Language Detection Scheduler: a Celery-based cascade to detect text language for 1.6B additional content items.
  • In-House IAB 3.0 Classification: train a multilingual model on 601M historical labels and serve local inference.
  • Topic Extraction Model: separate topic prediction for the most frequent topics first.
  • Publisher Rollup at Scale: run existing rollup logic across all 13.7M publisher entities.

Expected impact

  • Language detection coverage moves from roughly 8% toward 95%.
  • IAB 3.0 coverage moves from 34.4% toward full corpus coverage.
  • About 1.14B additional content items gain IAB 3.0 classification.
  • External classification API cost and dependency are removed for future classification.

Detailed methodology and results

Supporting methodology, figures, and tables are rendered here as native page content with the same visual system as the rest of this website.

Data Engineering Team

Data Lakehouse Architecture

Medallion Architecture: Staging Core Curated

Raw ingested data. May contain duplicates. One table per pipeline source.

Deduplicated via AWS Glue. One row per entity. Merges fields from all staging sources.

Enriched, filtered views for downstream consumers. Subset of core with renamed fields.

Celery Workers fetch from APIs

S3 Parquet .append() writes

Accumulator Append to Iceberg

Glue Merge Dedupe Core

Curated Views Filtered + renamed

Current Content Classification Coverage

Two generations of IAB categories exist in the content corpus

1.75B

13.7M

83.7M

101

Two IAB classification systems on the content corpus

FieldTaxonomyContent items coveredCoverageSource
iab_categoriesLegacy IAB1,738,628,31499.6%Existing NLP pipeline
iab_categories_3IAB 3.0600,950,78934.4%external classification API
primary_category_3IAB 3.0600,944,11134.4%external classification API
nlp_topicsTopics604,562,95734.6%external classification API

Legacy IAB example

IAB 3.0 example (external classification API)

Legacy IAB

The Two Gaps to Close

Gap 1: Language Detection

Gap 1: Language Detection
Gap 1: Language Detection

nlp_text_lang_code on content_items. Only 8% of 1.75B content items have a detected text language.

Gap 2: IAB 3.0 Classification

Gap 2: IAB 3.0 Classification
Gap 2: IAB 3.0 Classification

iab_categories_3 on content_items. external classification API has classified 34.4% . The remaining 1.14B need coverage.

These gaps are linked.

Why Does the Language Gap Exist?

Three compounding factors

1. Language Detection is Coupled to Content Suitability

2. No Fallback for Text Detection

3. Short Text Problem

Content item Text title + description

GCLD3 90% confidence?

nlp_text_lang_code detected!

NULL forever No fallback no FastText, no langid

GCLD3 Failures: Fixable or Not?

Text length comparison for the external classification API subset (948K content items)

Comparing text length (title + description) where GCLD3 succeeded vs failed :

Detected

Undetected

61% of undetected content items have 90+ chars

39% have 90 chars

The Language Mismatch Problem

Creator-declared language often differs from actual text language

Two language fields exist on content_items:

FieldSourceCoverage
primary_lang_codecontent item platform API / GCLD3 / FastText94.8% (1.65B)
nlp_text_lang_codeGCLD3 only (no fallback)8.1% (141M)

primary_lang_code answers: "What language is this content item meant to be in?" nlp_text_lang_code answers: "What language is the actual text written in?"

For IAB classification, the system needs to know what language the text actually is

When they disagree (labeled subset: 119K content items)

Actual TextDeclaredCount
EnglishHindi36,709
EnglishTelugu9,731
EnglishUrdu6,282
EnglishTamil6,043
EnglishBengali5,155
SpanishEnglish3,183
ArabicEnglish2,793

Pattern: Indian creators write titles/descriptions in English but set their publisher language to Hindi, Telugu, etc.

How external classification API Works Today

The pipeline that produced 601M IAB 3.0 classifications

Content item Title + Description + Tags

external classification API External NLP service

IAB 3.0 Categories + Topics + Confidence Scores

What external classification API produces per content item

  • Primary IAB 3.0 category e.g. "Food Drink Cooking"
  • All IAB 3.0 categories multi-label, hierarchical
  • Topics Wikipedia-linked concepts with scores
  • Confidence scores per category and topic

DLH pipeline flow

content_classification_scheduler Find unprocessed content items

content_classification_worker Call external classification API

Accumulator merge into content_items

Current staging tables

TableRows
staging_classifier_labels947,774
staging_classifier_labels_confidence947,774
staging_publisher_classifier1

948K in staging

Training Data for an In-House Model

External classification has produced a large labeled dataset

601M

948K

101

Training data available

Top IAB 3.0 categories (by content item count)

Entertainment Music

Sports

Religion Spirituality

Content item Gaming

Technology Computing

Food Drink

Automotive

Language Distribution (content corpus)

101 languages detected across 141M content items with nlp_text_lang_code

English

Hindi

Spanish

Arabic

Portuguese

Bengali

Chinese

Russian

Japanese

Vietnamese

Korean

Thai

French

Turkish

Telugu

Tamil

German

Marathi

Italian

Other 81 languages

Only 35% is English

The Solution: Two New Pipelines

Built on the same DLH patterns already in production

Pipeline 1

Language Detection Scheduler

Detect text language for 1.6B content items with NULL nlp_text_lang_code . Multi-tier cascade: GCLD3 FastText langid.

Pipeline 2

In-House IAB 3.0 Classification Model

Train a multilingual transformer on 601M external classification API-labeled content items. Replace external API with local inference.

Reuse

Publisher Rollup (Existing Logic)

Weighted aggregation of content item-level scores to publisher level. Already built. Just swap the input source.

Dependency:

Pipeline 1: Language Detection Scheduler

Same Celery patterns, new multi-tier detection cascade

3-Tier Detection Cascade

TIER 1

GCLD3

Same detector, lowered threshold (50%)

TIER 2

FastText

lid.176.bin works on 10+ chars

TIER 3

langid

Lightweight final fallback

New staging table schema

Each detection is tagged with its method and confidence

Pipeline 2: In-House IAB 3.0 Classification

Train on external classification API labels, serve via local inference

Model approach

  • Multilingual transformer (e.g. XLM-RoBERTa) as base model
  • Multi-label classification for IAB 3.0 categories
  • 601M labeled samples from external classification API for training
  • 101 languages in training data
  • Confidence scores per label (same as external classification API output)

Pipeline integration (same DLH pattern)

What changes vs. current external classification API pipeline

external classification API (current)In-House (new)
InferenceExternal API callLocal model
CostPer-call API fee$0 per call
LatencyNetwork round-tripLocal GPU
Languages19 supported101 detected
ControlExternal dependencyFull ownership

Key point:

End-to-End: Current vs. New Pipeline

Current flow (external classification API)

content_items title + desc + tags

external classification API External paid service

staging_classifier_labels

Glue merge

New flow (In-House two pipelines)

content_items 1.6B with NULL lang

Pipeline 1 GCLD3 FastText langid

staging_language_detection

Glue merge nlp_text_lang_code

content_items 1.14B with no IAB 3.0

Pipeline 2 Local model inference

staging_iab_labels

Glue merge iab_categories_3

content_items content item IAB 3.0 data

Publisher Rollup Existing weighted logic

staging_publisher_iab

Glue merge publishers

Every step uses the same DLH patterns

Before After

Current content corpus

Current content corpus
Current content corpus
Current content corpus
Current content corpus

After Both Pipelines

After Both Pipelines
After Both Pipelines
After Both Pipelines
After Both Pipelines

Implementation Roadmap

PhaseDeliverableImpactDLH Components
Phase 1Language Detection SchedulerDetect text language for 1.6B content items. 3-tier cascade. Unblocks classification.New IcebergTable, scheduler task, worker task, accumulator, Glue job
Phase 2IAB 3.0 Classification ModelTrain multilingual transformer on 601M external classification API labels. Multi-label IAB 3.0.Model training (offline), new auditor class, replaces classifier worker task
Phase 3Topic Extraction ModelSeparate model for topic prediction. Scope to top-N frequent topics initially.New model, extends Phase 2 worker or separate worker task
Phase 4Publisher Rollup at ScaleRun existing rollup across all 13.7M publishers (currently only 1 done).Existing publisher classifier task + auditor. No code change needed.

Each phase is independently deployable

Expected Gains

1.6B

1.14B

13.7M

What to build

  • Language detection scheduler (Celery + 3-tier cascade)
  • In-house IAB 3.0 model (multilingual transformer)
  • Topic extraction model (Phase 3)
  • New staging tables + Glue merge jobs

What to retain

  • Entire DLH Medallion Architecture (Staging Core Curated)
  • Celery task patterns (scheduler worker accumulator)
  • S3 + Iceberg + Glue pipeline
  • Publisher rollup logic (existing code, no changes)
  • Redis locking, queue management, all infrastructure

Bottom line: