Aligning LLMs to Financial Services: The technical procedures behind Aveni’s FinLLM

July 11, 2025

AI is now a core component in many organisations, but often relies on general-purpose models that are not aligned to the sector or use-case. Financial services is a particularly sensitive and regulated industry where misalignment can lead to overly generalised outputs, unreliable responses, and breaches in regulatory compliance. Tackling this requires purpose-built AI systems by domain experts to address the industry’s unique requirements. That’s why we created FinLLM, our suite of vertically aligned language models tailored to meet the complex demands of the UK financial services sector.

Through continued pretraining and supervised fine-tuning (SFT), we’ve observed that careful curation of the training data mix is essential for optimising performance across our evaluation benchmarks.

To align the models with the unique language, regulations, and use cases of UK finance, our SFT pipeline incorporates a tailored blend of finance-specific data, instruction-following datasets, and NLP task-specific data relevant to real-world financial applications.

Data mixes for finance adaptation

Our SFT data mixes are created by:

setting a fixed maximum number of examples,
creating mixes using different ratios of data within those mixes
performing SFT using those data mixes on the base model baseline to see which performs best on our finance benchmarks.

We apply this curated data mix across all our models. The data is grouped into four key categories: finance, instruction following, tabular, and mathematics – reflecting the core capabilities we currently target. Finance data is fixed at 25% of the total mix to maintain strong domain alignment, while the proportions of the other three categories are adjusted between 0% and 50% depending on the model’s focus, with the remaining ratios scaled accordingly.

Safety & values alignment

Incorporating safety-focused data during supervised fine-tuning enables the model to learn how to respond appropriately to malicious or toxic inputs. However, an overemphasis on safety can lead to excessive refusal behaviour where the model declines to respond even to legitimate queries. While this may result in a perfect/cautious model for content moderation, it risks undermining the model’s effectiveness in real-world applications such as financial services, where informative and context-aware responses are critical. To address this, it’s important that the model not only learns to avoid unsafe outputs but also gains a clear understanding of what constitutes an unsafe input, enabling it to generate suitable responses.

We emphasised bias and toxicity reduction using the ToxiGen dataset and value alignment using the ETHICS dataset which evaluates whether models can predict basic human ethical judgements. We found these to considerably boost performance across all evaluation benchmarks (finance, general, and safety), suggesting that improving safety performance also improves performance on general and domain-specific tasks alike.

Technical training details

We perform SFT using the Axolotl framework. Our hyperparameters are found by starting with common values found in the research literature, and then performing a hyperparameter sweep. Specifically, we use: a sequence length of 4096 tokens, an effective batch size of 128, training for 2 epochs, using the AdamW optimiser with a 5% warm-up and cosine decaying learning rate with a peak at 5e-5. We withhold 10% of the SFT data mix as a validation set and measure cross-entropy loss every quarter epoch, using the checkpoint which obtained the lowest validation loss as the final model. Each 2-epoch training run takes approximately 90-120 minutes on 4xH100 GPUs. All of our supervised fine-tuning runs involve full-parameter optimisation, as we observed that parameter-efficient fine-tuning techniques such as LoRA introduced instability in our model.

Real-world alignment

These variations helped us identify the most effective configurations for balancing performance, domain relevance, and safety. Our preliminary results show higher performance against our General and UK Finance evaluation datasets (the latter made up of in-house datasets constructed using questions from exam papers for the Diploma in Financial Planning from the Chartered Insurance Institute (CII) and study guides from the Chartered Financial Analyst (CFA) exam). Compared to FinLLM v1 (released March 2025), our general finance benchmark scores increased from 0.48 to 0.51, while UK finance-specific benchmarks rose from 0.61 to 0.62.

Our data mix strategy and supervised fine-tuning (SFT) approach enable us to align our models with real-world financial services use cases, beginning with practical applications in our own Aveni products. This includes tasks such as summarising adviser-client call transcripts, generating suitability reports, and detecting vulnerability or dissatisfaction in client conversations.

These tasks are directly tied to the needs of the financial advice sector and play a crucial role in aligning FinLLM with industry requirements. Looking ahead, we plan to enhance this alignment through post-training techniques, such as Direct Preference Optimisation (DPO), which uses ranked response pairs based on human preference, and Reinforcement Learning from Human Feedback (RLHF), which involves training a reward model to incorporate human judgement into the fine-tuning process.

These methods will allow us to shape FinLLM’s responses for factual accuracy, regulatory compliance, appropriate tone, and structured format, ensuring it continues to evolve as a purpose-built AI solution for the financial services industry.