Reproducibility Report of ModernBERT Models for Retrieval Tasks Using DPR

Cross-posted here: https://api.wandb.ai/links/joe32140/zqs87nz3

Before We Start

As researchers from LightOn.AI and Answer.AI released the ModernBERT models (https://huggingface.co/papers/2412.13663 ), which are BERT models for 2024, I am interested to see its performance on retrieval tasks as mentioned in their paper, specifically with DPR. However, they have not released the model checkpoints for all experiments. I decided to finetune ModernBERT on the MSMACRO dataset by myself based on the provided training scripts.

Experiments

I ran experiments with the official training scripts, modifying the mini_batch_size for CachedMultipleNegativesRankingLoss to accelerate the training. Following the hyperparameter suggestions, I chose a learning rate of 8e-5 for the base model and 1e-4 for the large model. The per_device_batch_size was set to 512, which is different from the batch size of 16 mentioned in the paper. By training both model sizes on an RTX4090 24GB GPU, it took 1 hour for the base model and 2 hours for the large model for one epoch. More training logs are shown in the panels below.

In the end, I finetuned ModernBERT-base and ModernBERT-large on 1.25M partial training instances on the MSMACRO dataset following the paper’s experiment setup. I put the fine-tuned checkpoint on Hugging Face Hub:

https://huggingface.co/collections/joe32140/modernbert-for-retrieval-6764ff19edb01fb6c69538f0

Reproduced Results

As shown in the table below, my fine-tuned models outperform the original models reported in the paper on NDCG@10. They also improve performance on the ArguAna dataset by more than 10% for the base model and 9% for the large model. I hypothesize that these gains come from the much larger batch size of 512 in the official training script, compared to the reported batch size of 16 in the paper, but this still needs verification by the authors.

For the out-of-domain (OOD) evaluation on the MLDR dataset, our models show significant improvement over the original numbers.

It’s worth noting that if you fine-tune the model on the entire MSMACRO dataset with the same suggested learning rate, the performance degrades noticeably, which could be due to overfitting.

Since my fine-tuned version consistently outperforms the results reported in the paper, I also tested the gte-en-mlm-base model. It still shows substantial improvements, suggesting that the difference in batch size may be a key factor behind these gains.

	NFCorpus	SciFact	TREC-Covid	FiQA	ArguAna	SciDocs	FEVER	HotpotQA	Climate-FEVER	MLDR - OOD
gte-en-mlm-base	26.3	54.1	49.7	30.1	35.7	14.1	65.0	49.9	22.9	34.3
ModernBERT-base	23.7	57.0	72.1	28.8	35.7	12.5	59.9	46.1	23.6	27.4
ModernBERT-large	26.2	60.4	74.1	33.1	38.2	13.8	62.7	49.2	20.5	34.3
gte-en-mlm-base (ours)	29.7	60.2	57.2	31.9	48.7	15.2	67.7	50.8	24.9	35.0
ModernBERT-base (ours)	26.6	61.6	71.4	30.7	46.3	13.6	65.7	47.8	22.6	30.5
ModernBERT-large (ours)	28.4	63.6	77.4	34.3	47.7	15.7	68.2	51.8	22.9	38.9

Bottom Line

Overall, my attempt to reproduce experiment results on the newly proposed ModernBERT for retrieval tasks was successful, with my fine-tuned models outperforming the numbers reported in the original paper by a large margin. This gap may come from the batch size discrepancy between the provided training script and that in the paper, i.e., 512 in the script vs. 16 in the paper. Even if the numbers between ModernBERT and gte-en-mlm are close/mixed for retrieval tasks using DPR, the training time and inference time are much faster for ModernBERT. Thus, I would still recommend using ModerBERT in most cases.

Please try the fine-tuned retrieval models by yourself at my Hugging face model hub!