If you've worked with neural machine translation for a while, you've likely crossed paths with fairseq's WMT19 model. It's robust, heavily optimized, and has set impressive benchmarks. But despite its accuracy, using it in production comes with trade-offs—its training loop and decoding structure aren't always friendly when scaling across newer infrastructure. Transformers, on the other hand, have become the go-to library for everything NLP.
Now, the tricky part: these two systems weren’t exactly built to speak the same language. Porting WMT19 isn’t just about loading weights. You’ll need to untangle model internals, align tokenization, and make sure your outputs match. Below is a practical guide that takes you through the process without the fluff. Just what you need to get it done.
How to Port the fairseq WMT19 Translation System to Transformers (the Right Way)
1. Understand the Core Differences
Before doing anything, you’ll need to compare what each system does under the hood.
Model Architecture:
fairseq’s WMT19 uses a transformer-big architecture but with some fairseq-specific tuning, like layer normalization placement, tied embeddings, dropout values, and relative positional bias in newer forks.
Tokenization:
WMT19 uses a byte pair encoding (BPE) model trained on joint source-target data. You can extract the vocab and merges from the dict.*.txt files in fairseq.
Checkpoint Format:
fairseq saves weights using PyTorch’s native format, but arranged differently from Hugging Face models. So, to migrate successfully, you’ll need a mapping function between parameter names.
There’s no point in writing a single line of conversion code before you’ve studied both the model structure and the tokenizer behavior. This saves you from scrambling to fix mismatches later.
2. Convert the Tokenizer and Vocabulary
Step 1: Extract fairseq BPE settings
fairseq BPEs are usually found in bpecodes and vocabulary files like dict.en.txt. You can write a short script that reads these files and reconstructs a Hugging Face-compatible tokenizer using the tokenizers library.
python
CopyEdit
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
# Load merges and vocab
with open("bpecodes") as f:
merges = [tuple(line.strip().split()) for line in f.readlines()[1:]]
with open("dict.en.txt") as f:
vocab = {line.split()[0]: int(line.split()[1]) for line in f}
# Build tokenizer
tokenizer = Tokenizer(models.BPE(vocab, merges))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.decoder = decoders.BPEDecoder()
tokenizer.save("wmt19_tokenizer.json")
Step 2: Wrap in a Hugging Face tokenizer class
Once you’ve built the tokenizer, wrap it using PreTrainedTokenizerFast. This lets it work seamlessly with Hugging Face models.
python
CopyEdit
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="wmt19_tokenizer.json")
hf_tokenizer.save_pretrained("./tokenizer/")
Test a few sentence encodings before moving on. If your tokens don’t match, stop and fix that first. Output mismatch at the tokenizer level will cascade into garbage translations.
3. Port the Weights from fairseq to Transformers
This is where things get slightly technical. fairseq and Hugging Face name their parameters differently, and some operations are structured differently as well.
Step 1: Load fairseq checkpoint
You’ll need to extract the state_dict from fairseq:
python
CopyEdit
import torch
checkpoint = torch.load("model.pt", map_location="cpu")
state_dict = checkpoint["model"]
Step 2: Map parameter names
Create a renaming script that maps each fairseq parameter to its Transformers counterpart. For example:
- encoder.layers.0.self_attn.q_proj.weight → model.encoder.layers.0.self_attn.q_proj.weight
- decoder.layers.0.fc1.weight → model.decoder.layers.0.fc1.weight
You can automate this mapping using regular expressions or a hardcoded dictionary if needed. It won’t be pretty, but it only needs to run once.
Step 3: Create a compatible Transformers model
You can subclass MBartForConditionalGeneration if you're working with multilingual models, or stick to BartForConditionalGeneration for en-de translation. If needed, clone the structure from Hugging Face's convert_bart_original_pytorch_checkpoint_to_pytorch.py.
Set the config manually to match WMT19 values:
python
CopyEdit
from transformers import BartConfig
config = BartConfig(
vocab_size=50265,
d_model=1024,
encoder_layers=6,
decoder_layers=6,
encoder_attention_heads=16,
decoder_attention_heads=16,
encoder_ffn_dim=4096,
decoder_ffn_dim=4096,
activation_function="gelu",
max_position_embeddings=1024,
dropout=0.3,
attention_dropout=0.1,
init_std=0.02,
scale_embedding=True
)
Then, instantiate the model and load weights:
python
CopyEdit
from transformers import BartForConditionalGeneration
model = BartForConditionalGeneration(config)
model.load_state_dict(new_state_dict)
model.save_pretrained("./wmt19_transformers/")
Run a quick check: encode a sentence, decode it back, and compare with fairseq’s output. It won’t be exact—beam search variations and dropout can cause slight shifts—but it should be close.
4. Reproduce fairseq's Generation Behavior
Getting the model loaded isn’t enough—you’ll need to mimic fairseq's generation parameters if you want identical performance.
Beam Search Settings:
fairseq often uses:
- beam = 5 or 6
- length_penalty = 1.0 or 0.6
- no_repeat_ngram_size = 3
Set them during the generation:
python
CopyEdit
output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_length=200,
num_beams=5,
length_penalty=1.0,
no_repeat_ngram_size=3
)
Special Tokens and EOS Behavior:
Be sure that the special tokens (like </s>) are defined and consistent. Otherwise, the output may never terminate properly.
Also, turn off early stopping if you’re comparing to fairseq's generation, which doesn’t always halt as soon as a beam hits EOS.
5. Validate Translation Quality After Porting
Once you've ported everything, don’t assume it's working just because it runs without errors. You need to validate the translation quality, both quantitatively and qualitatively.
Run BLEU Score Comparisons:
Use the same dataset that fairseq used for evaluation (e.g., WMT newstest sets) and calculate BLEU scores for both the original and ported model. The score difference should be negligible (typically ±0.1–0.3). If you see a bigger drop, revisit the tokenization or decoding parameters.
Manually Compare Output Samples:
Pick a few sample sentences and compare fairseq’s output against your Transformers-based model. Look for systematic differences—does one model consistently shorten outputs or mistranslate proper nouns? If so, the issue likely lies in token alignment or generation settings.
Check for Deterministic Behavior:
Set seeds and disable dropout during inference. Hugging Face makes this easy by setting model.eval() and torch.manual_seed(seed)—critical for verifying consistency across runs.
Wrapping Up
Moving fairseq’s WMT19 system to Hugging Face Transformers isn't about blindly copying weights—it's a step-by-step reconstruction. You align vocabularies, match every parameter, and recreate decoding behavior one piece at a time. When done right, you'll end up with a fully working, production-friendly translation model that performs just as well, minus the rigidity of the original framework.
And once it’s there, the possibilities open up: batching becomes easier, you can push to Hugging Face Hub, integrate with transformers pipelines, or even distill the model for faster inference. It's a bit of work, but totally worth it.