Project Description:
I am looking to hire an experienced NLP/ML engineer to train high-quality machine translation
models for Indic languages. The goal is to develop single language-pair models, such as:
● English → Telugu
● English → Hindi
(and additional language pairs, if needed)
You may choose the most suitable model architecture based on your expertise (e.g., mBART,
mT5, NLLB fine-tuning, Transformer variants, etc.), as long as the final models deliver strong translation quality.
Dataset:
● You can use the AI4Bharat datasets including:
● Samanantar
● BPCC
● Other open Indic parallel corpora
Scope of Work:
The freelancer will be responsible for:
1. Data Handling
● Cleaning, filtering, and preprocessing datasets
Sentence alignment (if needed)
● Tokenization and vocabulary preparation (SentencePiece/BPE/etc.)
2. Model Training
● Selecting an appropriate model architecture
● Training single language-pair translation models
● Implementing best practices for training efficiency (FP16, gradient accumulation, etc.)
● Hyperparameter tuning
Checkpoint management and monitoring
3. Evaluation
● Compute BLEU, SacreBLEU, and other relevant metrics
● Provide side-by-side qualitative translation samples
● Benchmarking against baseline models
4. Delivery
● Final trained model weights
● Inference scripts (Python) for quick testing
● Instructions for running and continuing training
● Documentation of preprocessing and training pipeline
● Optional: Dockerfile or virtual environment setup
Requirements:
The ideal candidate should have:
● Strong experience in NLP, Transformers, and neural MT models
● Prior work with Indic languages (big plus)
● Experience with training libraries such as PyTorch, Hugging Face Transformers, Fairseq, OpenNMT, or similar
● Ability to handle large-scale training and dataset preprocessing
● Familiarity with SentencePiece, tokenization strategies, and MT evaluation metrics
● Ability to deliver clean, well-documented code
Additional Notes:
● Compute resources can be discussed (I can provide compute, or you can use yours).
● More language pairs may be added later as separate follow-up projects.
● Quality of translation is the highest priority.
Apply Now
Apply Now