Is data cleansing the hardest thing about open banking?

8 min readMar 2, 2020

Everybody likes clean data — especially dev and data teams working on projects under strict deadlines. In reality, data preparation takes a significant effort and open banking data is perhaps more so than other data sources. In this article we will highlight some of the reasons why open banking data is so raw and what developers and data scientists can do about it.

What is open banking data?

Open banking is an efficient way for bank customers to become the masters of their financial data. The Open Banking Implementation Entity describes open banking as a solution to “open the way to new products and services that could help customers and small to medium-sized businesses get a better deal.” Creating new value to the end-user of the bank is at the very heart of open banking — it empowers the end-user to control their data in a way that we’ve never seen before. But it also levels the playing field between the incumbent financial institutions and the new-comers such as challenger banks and fintechs.

One of the most exciting things about open banking is that it allows any institution (even those not directly related to finance) to leverage open banking data and build new powerful experiences for their customers. It’s an opportunity to build smarter banking accounts, cheaper credit cards, more affordable insurance products and less intrusive credit checks across many industries.

Open banking is quietly making the world a better place with plenty of success stories along the way, however, it also has a dark side — only known to the elite group of developers and data scientists that use open banking to build the next generation of fintech applications… It’s the fact that the data comes in a variety of formats, is often incredibly unstructured and that it needs lots of work before it can be used in building the apps.

Everybody likes clean data — especially dev and data teams working on projects under strict deadlines. In reality, data preparation takes a significant effort and open banking data is perhaps more so than other data sources. In this article, we will highlight some of the reasons why open banking data is so raw and what developers and data scientists can do about it.

It starts with access to open banking data

Today, accessing open banking data is becoming increasingly more simple and inexpensive — account aggregation is rapidly becoming common practice and regulators in several markets are pushing banks to create open APIs and enable free movement of account information. A popular choice for many banks, lenders and fintechs is to choose an account aggregation provider (or account information service provider in Europe) and there are many account aggregation providers to choose from.

From a technology perspective, the three most common ways to access account data are:

Using official bank APIs — these are typically well-documented interfaces that banks have created by themselves or by following certain standards (e.g. Open Banking in UK or PSD2 in Europe).
Unofficial bank APIs — these are publically-facing interfaces that banks might use to power their applications (e.g. mobile banking apps). These APIs are typically not documented, as they are used mostly by bank’s internal development teams. These APIs are sometimes referred to as “reverse-engineered bank APIs”.
Screen-scraping online banking interfaces — in cases when banks have customer-facing online bank interfaces, a common technology to use is the automated acquisition of data directly from the online bank website interface.

This is where the challenges with open banking data start. To get good coverage of bank connections, developers are often required to use multiple account aggregators and banking APIs. This creates a situation where because of different data access points the contents of API outputs can vary. To make things even more complicated, there can be cases when you might get different data points from each of the three data access methods even for the same banks.

Contents of open banking data

When it comes to thinking about open banking data, you can think of your bank account: it contains information about your purchases, subscriptions, income sources and loans. Open banking data is the same account information that the regular bank users can see in their bank accounts. From a bank’s perspective, the customer’s view is often already an aggregated view of data coming from multiple databases.

Here are a few key points which contribute to how account information is sourced and stored, and why transaction text strings are often so unstructured.

There are numerous ways to make payments, for example, bank transfers, mobile transfers, credit and debit cards (physical or digital), prepaid cards, electronic cheques. A bank might handle each transaction type separately and hence it might add different metadata or information to the transaction fields.
There are countless payment gateways, schemes and intermediaries, through which the transactions flow and each adds their information to transactions (e.g. Paypal, Mastercard, Venmo). Each institution that handles a payment might use different taxonomies even for similar purchases.
Bank account transfers can have user-generated texts. These types of transactions can have high value from the information point of view, but they are highly unstructured, which makes them hard to use in any automated analysis. The most challenging are bank account transfers that are made between two people, as they often contain references recognisable only by the people themselves.
The banks have different approaches to storing data based on the way each bank sets up their core banking systems and what products they offer to their customers.

The transactional data quality highly depends on these variables, because each of the previously mentioned approaches tends to have a different data structure that can either make the data cleansing process relatively simple or, quite the opposite, highly complex and sometimes even problematic. If the data is “dirty”, it takes significantly more time and effort to use it in applications or for analysis. This, unfortunately, leads to cases where valuable information gets lost or tends to go unused. The more efficiently data preparation is carried out, the bigger the chances that this data can be successfully analysed and productively used. The more coherent the data, the more valuable the insights.

How to approach open banking data?

According to Business Insider, it does not matter what type of organization you’re running; all industries are known to have the same problem with data, and that’s data preparation. “A model is only as good as the data used to build it” and for open banking, this means that data preparation must be a critical element, because of the unstructured nature of the data.

Before engaging with open banking data cleansing and preparation, a good question to ask ourselves and our teams are: how much time do we want to spend on data preparation? Building internal processes for data cleansing are quite a time consuming since data preparation can take up to 80% of data scientist’s time. According to this survey, 76% of the participants stated that data cleansing, and preparation in general, are undoubtedly challenging tasks. After all, data can be successfully analysed and categorised only after the collected data is sufficiently prepared.

A growing number of development and data science teams choose to use pre-built transaction categorisation or account-based insights tools to have more time for data modelling and analysis. However, if data preparation is something you choose to handle by yourself, a great place to start is by performing data cleansing.

Data cleansing (or data cleaning) is a process during which errors in data are identified and removed, and inconsistencies are made coherent to improve the quality of data. Clean data is the type of structured data that holds information valuable to the business. Dirty data, on the other hand, is the type of data that is unstructured and does not add any value.

Some of the most popular data cleansing approaches data cleansing approaches to include:

Parsing, which means looking for recognizable data patterns;
Standardization, meaning the data is transformed into a standard construction;
Abbreviation expansion, where abbreviations are expanded into the full form. Sometimes this can also be done the other way around;
Correction, meaning that those data values that are not recognized by the model are attempted to be fixed to be used in further data analysis.

Typically, data cleansing consists of multiple approaches and is done by combining automatic and manual methods that tend to vary from team to team.

Once the data has been cleansed of errors and inconsistencies, to create additional value to the data set, data enrichment should be completed. Data enrichment is “the process of enhancing existing information by supplementing missing or incomplete data.”

This process is typically done by matching the existing data with another source. Data enrichment can be used when dealing with any type of raw data; one of the examples could be applied to locations found in transactional data. Sometimes either the city, street or postal index could be missing due to data being incomplete. There can be business cases where it is important to know the exact address of the place where the purchase or transfer was conducted.

Data enrichment is important for data scientists to be able to add extra features for every observation they work with. There are different types of features that can be carried out: age, address, education, income, and more. The more features can be implemented, the more valuable the data set.

So — is data preparation the hardest thing?

The truth is — it’s quite hard. If you are planning on building an application or process that uses open banking data, make sure to pay proper attention to how crucial it is not to overlook the complexity of data. Even though the quality of the insights can sometimes depend on the quality of the raw data, data preparation is the approach that can either make or break any further analysis or modelling of the insights. It means that if the raw data is not sufficiently cleaned, it cannot be sufficiently enriched; if the data cannot be enriched, the modelling might produce faulty results. Only by taking into account the importance of this complexity and paying attention to all of the data preparation procedures is it possible to reach the best results possible.

Since data preparation is an essential part of data analysis but takes a lot of time, it is important to allocate enough time for your team to work with the data or it’s worth considering a pre-built transaction cleansing, enrichment and categorisation solution. If you are thinking of giving some of the data cleansing and enrichment solutions a go, it is important to note that there is no one-size-fits-all solution for open banking and the solutions depend on what kind of application you are building. For example, there are at least two major tracks for transactions categorisation: categorisation for personal financial management (PFM) and categorisation for risk use-cases. Learning more about what separates good categorisation from great is also a useful tool to apply when choosing which solution is the best for you.

Get Free Access to Account Information today or learn more about Nordigen’s Account Information product.

Connect to bank accounts and get raw transaction data. Free access to regulated banking data in Europe.

If you found this article interesting please give us a ”Clap”.

If there is anything you want to discuss from your perspective add a comment below.

Check out other Nordigen Blog Posts, share them with your team and let’s build more awareness around open banking together!

Follow us or get in touch: Website | LinkedIn | Facebook | Twitter | Email