How Aveni Labs Is Enhancing FinLLM with Purpose-Built Synthetic Financial Data

How Aveni Labs Is Enhancing FinLLM with Purpose-Built Synthetic Financial Data

Why This Research Matters

When the Aveni Labs team set out to build FinLLM, our large language model designed specifically for the financial services sector, they knew they had to do something different. Finance is a world built on numerical data and structured data and not just paragraphs of text. But when we looked closely at the datasets available for training large models, we quickly spotted a big problem: there was hardly any financial tabular data.

That is a real issue, because tables are everywhere in finance. Whether it is a client’s investment portfolio, an insurance premium breakdown or a regulatory report, understanding tables is absolutely essential. Without enough exposure to this kind of structured information during training, a model simply cannot perform at the level the industry needs.

So, our Labs team got to work. We developed a way to generate realistic, high-quality synthetic financial data, with a strong focus on tables and structured reasoning. This is a big deal for FinLLM, and for the financial services sector more broadly because it helps create AI models that are genuinely capable of understanding and working with the kind of data finance professionals rely on every day.

Key Takeaways

  • Tabular Data Gap Identified: Financial services data is heavily structured, but existing model training data lacked enough financial tables and numerical content.
  • Synthetic Data Created at Scale: Using a mix of financial web data, we generated over 1 billion tokens of high-quality synthetic documents, including financial tables and reasoning examples which we used for continued pretraining of FinLLM.
  • FinLLM Strengthened: This synthetic data significantly improves FinLLM’s ability to understand, reason over and generate structured financial information.
  • Direct Value for the Industry: By training FinLLM with richer, finance-specific data, we are helping to build AI that is far better suited to real-world financial services use cases.

Let us take you through how we did it.

Our Approach to Generating Synthetic Financial Data

1. Finding the Right Starting Points – Seeds

a) First, we needed good material to work from. We call these ‘seeds’, short snippets of content that capture key financial ideas or topics. We have three ways of doing this:
We extracted finance-related seeds from trusted sources using our in-house finance classifier. After some thorough cleaning, including anonymising sensitive content and removing anything off-topic, we ended up with around 140,000 high-quality seed documents. To make sense of them, we used machine learning to group them into clusters like ‘Investment Fees’, ‘Insurance Premiums’, ‘Inflation and Indexes’ and more. This gave us a really broad and well-organised base to work from and allowed us to remove noisy seed data that we found irrelevant.

b) We also wanted FinLLM to be especially strong when it comes to UK financial regulations. So, we turned to the FCA’s standards that outline what financial advisers and professionals need to know. We expanded on these standards and used them as additional seeds that are specific to UK financial services. This means FinLLM will be well-versed in the rules and standards that matter most to the UK market.


c) Finally, we tapped into existing web-scale corpora, containing millions of tables from across the internet. Of course, not all of them are about finance, so we carefully extracted 750,000 finance-related examples using our in-house finance classifier. Rather than using the raw tables directly (which were often messy), we used them as inspiration. We asked AI models to write new analytical articles based on each table, complete with their own tables, calculations and summaries.

2. Writing Prompts to Create New Content

Next, we needed to turn those seeds into full pieces of pretraining data. To do that, we designed a set of clever prompts for the AI models.
To get diverse documents, we prompted the LLM to generate data for different personas (eg. some aimed at finance professionals, some aimed at school students) and for different styles (eg. textbook chapters or step-by-step guides). This resulted in five different templates of prompts which we applied to all the different types of financial topics we identified.
For example, a prompt might ask the model to write a detailed university-level course unit on “Types of Subprime Mortgages – Dignity Mortgages”, complete with tables and calculations. This way, we were not just getting surface-level content – we were encouraging the model to really dive deep into each topic.

Below, we showcase a generated article where we set the target audience as “middle school students”. The document seed was: “The purpose and structure of the UK financial services industry, specifically on Understanding UK Taxation System”. The generated documents explain the seed’s concept in a suitable way given the target audience. Moreover, not only it contain tables, but also presents additional calculations that are based on the table’s content.
<here the generated content: Imagine you’ve got….>

3. Generating and Cleaning the Data

We then used top-performing open models to generate the actual synthetic data. After generating text and tabular data, we conducted another round of cleaning to make sure everything was accurate, relevant and well-structured, as not every data sample has been as well-structured as the example above. This gave us a large, diverse and high-quality dataset of 2 billion tokens that covers a huge range of financial topics – and crucially, includes lots of tables and numerical reasoning.

Results

The figure below illustrates the performance of FinLLM pre and post incorporating synthetic data, evaluated across both financial and general-purpose tabular reasoning datasets. We also show the base model Qwen 2.5, on which FinLLM was trained. Our approach not only delivers clear improvements over the previous version of FinLLM on financial tabular benchmarks (TAT-QA and TAT-HQA) but also achieves marginal gains or maintains performance on general-purpose tabular tasks (Table Bench).

 

What This Means for FinLLM – and for Financial Services

Thanks to all of this work, FinLLM will be much better at understanding and reasoning over financial tables, calculations and structured information – areas where most general-purpose models struggle.
For the financial services industry, this is hugely important. Whether it is for improving regulatory reporting, enhancing client communications, supporting investment analysis or automating compliance checks, models like FinLLM can offer smarter, more accurate support if they truly grasp the structured data that underpins the sector.
We’ve made solid progress at Aveni, and our latest synthetic data capability marks an important step forward. This new synthetic data capability is not just a technical milestone, it is a foundation for building AI that truly understands finance, and that is something the whole industry can benefit from.

Example: 

Seed:

The purpose and structure of the UK financial services industry, specifically on Understanding UK Taxation System

Target audience:

Middle school students

Generation:

Imagine you’ve got a piggy bank where you keep all your money safe. You might use that money to buy toys, video games, or even save up for something really special, like a new bike. But have you ever thought about how that money is used to help everyone in your community? Well, that’s where taxes come into play!

In the UK, the government collects a certain amount of money from people and businesses through something called taxes. This money is used to pay for things that help everyone, like schools, hospitals, and parks. Let’s break down the basics of the UK taxation system so you can understand how it all works!

### What are Taxes?

Taxes are amounts of money that people and businesses must pay to the government. This money is collected in various forms and serves different purposes. There are many kinds of taxes, but we’ll focus on a few primary ones:

1. **Income Tax**: This is the tax you pay on money you earn. For example, if you earn £1,000, the government will take a part of that money to invest in community projects like schools and roads.

2. **Value Added Tax (VAT)**: This is a tax on goods and services. When you buy a toy for £10, a small amount of that goes to the government.

3. **National Insurance**: This is a tax that helps pay for social security benefits, including pensions and unemployment benefits.

### Why Do We Pay Taxes?

Taxes play a vital role in maintaining the services and infrastructure that we often take for granted. Here are a few important reasons why paying taxes is necessary:

– **Education**: Taxes help pay for teachers, textbooks, and school buildings so that students can receive a good education.
  
– **Healthcare**: Taxes fund the National Health Service (NHS), which provides healthcare services like hospitals and clinics for everyone.
  
– **Public Transportation**: Taxes help keep trains, buses, and other public transport running, making it easier for people to get around.

### Types of Taxes in the UK

Let’s take a closer look at the various types of taxes and how they impact our lives. Here’s a table showing some basic information about a few common taxes in the UK.

 

Newsletter

Related articles