Building FinLLM-M: Training Language Models for Regulated Finance
Large language models are impressive almost everywhere except where precision, traceability, and governance matter most. Financial services sits squarely in that exception. The work behind the FinLLM-M family reflects a simple but demanding premise: if you want AI that performs reliably in regulated environments, you have to build it differently from the start.
Over the last few months, FinLLM-M has evolved into a suite of domain-specialised models (3B, 8B, 14B, and 24B parameters) designed specifically for financial reasoning, long-context understanding, structured data tasks and safe deployment. This post summarises what changed, why it mattered, and what the results actually show.

FinLLM model suite performance on finance benchmarks (new models in bold)
Why general models fall short in finance
Most foundation models are trained on vast, general internet data. That breadth is useful for general language understanding, but it introduces predictable weaknesses in finance regarding inconsistent terminology, shallow numerical reasoning, fragile performance on tables, and a tendency to hallucinate when precision matters.
Financial workflows are also structurally different as advisers, analysts and compliance teams work with call transcripts, long documents, and tightly structured forms. Key information is often separated by thousands of tokens so optimising for short, conversational prompts is not a suitable solution.
FinLLM-M was designed to close this gap by rethinking three things simultaneously: the data mix, the training pipeline, and the evaluation criteria.
A redesigned continued pre-training strategy
At the core of the new M-family is an updated continued pre-training (CPT) mixture built to support downstream financial tasks rather than generic language fluency alone. The corpus totals roughly 16 billion tokens, with a 40/60 split between finance-specific and general data.
Long-context financial data
A key change was the introduction of high-quality, long-form financial documents filtered from the FinePDFs corpus using our internal finance classifier. This matters because financial reasoning often depends on context spread across entire documents, not isolated paragraphs. Training on these sources helps preserve and strengthen long-context capabilities inherited from the external base models.
General, math and code improvements
To avoid over-specialisation, the general data subset draws from the OLMo3 mid-training mix, maintaining linguistic breadth and quality.
Math data was completely replaced with more fit for purpose datasets TinyMath-PoT and TinyMath-MIND. TinyMath-PoT teaches the model to solve complex problems by breaking them down into logical dependencies using programming syntax. This has been shown to enhance performance on downstream financial tasks. TinyMath-MIND restructures raw math data into structured, multi-turn dialogues. This conversational format significantly enhances performance in specialised knowledge tasks by teaching the model to provide deeper layers of explanation and collaborative problem-solving. Additional high-quality synthetic datasets (Dolmino-Math, CraneMath, MegaMath) further improved numerical robustness.
Code data was reweighted to favour languages that actually matter in finance: Python for quantitative work, Markdown for reporting, and JavaScript/TypeScript for structured data handling.
Instruction following before instruction tuning
Unlike traditional pipelines, high-quality instruction data was introduced during CPT itself. Older FLAN data was replaced with the more recent Tulu-3 mix, alongside specialised QA datasets and curated reasoning traces distilled from state-of-the-art models. The goal was to smooth the transition into supervised fine-tuning rather than treating instruction following as an afterthought.
Training for scale and context
Supporting this data strategy required meaningful pipeline changes. The training stack was extended to support new external models within NVIDIA NeMo, alongside long-context training using YaRN-scaled RoPE.
All M-family models were trained with a 16,384 token sequence length and configured for extrapolated inference contexts up to 256k tokens, critical for document-heavy financial use cases. The 14B model underwent CPT and annealing over 10B tokens, while the 3B and 8B variants followed a longer 15B-token CPT phase to compensate for their smaller size.
What the base model results show
The updated 14B M base model shows consistent gains across finance, UK-specific finance and general benchmarks. Most notably, it closes much of the gap to the 24B M model, matching it on general benchmarks and coming within a narrow margin on finance. This matters operationally: mid-sized models are far easier to deploy, fine-tune and serve.
The 8B M model similarly outperforms the previous 7B Q base across all categories, while the new 3B M variant establishes a genuinely viable lightweight option. Despite its size, the 3B M model approaches 7B Q-level finance performance, making it a strong candidate for task-specific fine-tuning where latency or cost constraints dominate.

CPT, annealing and what they’re actually worth
A critical question is whether continued pre-training meaningfully improves downstream instruction-tuned performance. The answer is nuanced.
When models were fine-tuned on the FinLLM instruction dataset, CPT and annealing consistently outperformed straight supervised fine-tuning of the base models. However, the average gains were modest, typically under 1%.
That said, the benefits are not evenly distributed. Niche but important capabilities like tabular reasoning and multi-turn financial dialogue showed more consistent improvements. In other words, CPT is not a silver bullet, but it is a worthwhile investment when optimising for specific financial behaviours rather than leaderboard averages.
A substantially stronger instruction dataset
Alongside base model work, the FinLLM instruction dataset was expanded from roughly 250k to 425k samples created by lowering the financial classifier threshold and scaling across multiple open datasets. Instruction following performance improved by more than nine percentage points on the previous version of the data mix, with gains across nearly every evaluated category except summarisation. This reinforces a recurring theme – data quality and relevance matter more than sheer model size.

Comparison of different SFT mixes on the 14B M model
Safety as a first-class objective
Safety work was not bolted on at the end, it was embedded into both the data and training strategy. The added safety training datasets improved measured performance across toxicity, bias and harmful instruction categories without degrading financial task accuracy. While some bias categories still require further work (likely via preference-based fine-tuning and human feedback) the results demonstrate that safety and task performance are not inherently in tension when handled carefully.
How the M-family compares in practice
When instruction-tuned, all FinLLM-M variants outperform equivalent external models by a wide margin (often 10+ percentage points) on finance-specific benchmarks, while remaining competitive on general tasks.
The comparison also highlights an important deployment insight – bigger is not always better. Mid-sized models perform exceptionally well on classification and summarisation, while larger models are preferable for user-facing, conversational tasks. This supports our model suite approach rather than a single “one-model-fits-all” strategy.
The FinLLM-M work reinforces a few important lessons:
- Domain-specific performance comes from data and evaluation, not just architecture.
- Long-context capability is essential for real financial documents.
- Mid-sized models can deliver flagship-level performance when trained correctly.
- Safety and governance are easier to achieve when incorporated from day one.
Most importantly, the results show steady, explainable progress rather than flashy but brittle gains. As FinLLM continues to evolve into an integrated intelligence layer for agentic financial workflows, the M-family provides a solid, evidence-backed foundation built for how financial services actually work.