Transaction Fraud Detection 🕵️‍♂️| Automating money laundering alerts

Published in

Analytics Vidhya

9 min readAug 26, 2021

Ever since the advent of internet the digital revolution has rising and has crept into all aspects to our lives. One of the most important digital revolution happened in financial system and especially transacting money to someone from any part of the world digitally. Digital transactions have become a part of daily life like purchasing a product online, sending money to friends, depositing cash in bank account, investment purposes etc., They had a lot of benefits so does paved way for fraudulent activities. People started using digital money transactions medium to launder money and make the money look like it comes from a legal source.

💰Understanding the stages in process

There are totally three stages followed in money laundering process which helps the launderers to clean their dirty money.

Placement: Dirty money gets integrated into the financial system where funds are placed in financial instruments.
Layering: Carrying out complex financial transactions to camouflage the illegal source.This is the stage where the transaction fraud happens as launderers performs a lot of wire transfers and create a complex loop where it gets difficult to identify the origin of that money.
Integration: Now that the money is concealed from its origin, the launderers will re-enter the money into the economy for purchasing luxury assets, financial and commercial investments.

☠️ Problem in the Anti-Money Laundering (AML) industry

The Banking and financial industries provide AML support by following OFAC guidelines and FATF recommendations. These procedures are used to identify and block the fraudulent activities in the system. A rule based algorithm is implemented in the system to throw an alert whenever any fraud transactions are identified. These alerts are then investigated by the analyst and they will make decisions on whether this transaction is fraud(True positive) or not fraud (False positive). In the current scenario there are a lot of false positive alerts given by the algorithm and investigators close those alerts within a stipulated time. Since there are a large number of false positive alerts, more time is invested in clearing the alerts plus a lot of human resources are required by the organizations to investigate these alerts. Therefore a lot of cost and time are being wasted just because the rule based algorithm isn’t intelligent enough to identify the non fraudulent transactions.

💡Solution

For example, a rule based algorithm might trigger a high value transaction between husband and wife. Here the algorithm treats the parties as customer A and customer B and doesn’t understand the underlying motive of transaction and the relationship between them. Also the rule based algorithm is fixed and doesn’t learn and adapt to the trends. The money launderers are trying to avert getting stuck to this algorithm and improvise their strategy to avoid getting caught. Therefore, it is in need of the hour that the system should also get advanced to learn the fraudulent patterns in transactions and trigger only the actual fraudulent transactions and avoid false positive hits. This can be achieved by training a machine learning model with the past data which contains transactions details,whether algorithm flagged it as fraud and investigator’s outcome on those flagged alerts(actually fraud or not).

Dataset

There is a lack of publicly available datasets in transactions due to data privacy. Hence here we use a simulated data by Paysim which simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. It uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behavior to later evaluate the performance of fraud detection methods.

Find the dataset here

Dataset description

1. step — maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
2. type — CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
3. amount — amount of the transaction in local currency.
4. nameOrig — customer who started the transaction
5. oldbalanceOrg — initial balance before the transaction
6. newbalanceOrig — new balance after the transaction
7. nameDest — customer who is the recipient of the transaction
8. oldbalanceDest — initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
9. newbalanceDest — new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
10. isFraud — This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
11. isFlaggedFraud — The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

📋 Pivot table analysis

🔍Inference:
As per the current rule based algorithm, there has been no flags during fraud transactions incase of cash_out , which a serious concern to the anti money laundering system . Also there are only 16 transactions which are flagged as fraud whereas around 4k transactions are actually fraud. Our mission is now to build an efficient algorithm to mitigate this risk of letting fraud transactions unblocked

🔍Inference:
From the table we can understand that most of the customers use the system for transferring money and we have a relatively less data for payments made. Also it is quite interesting to notice the difference between the new and old balance as it tells us some stories. Here we have only the visuals of Orgin account and cash balance have reduced in all cases except cash_in . Even in transfer the balance have reduced which shows that we have more sender information in Original account

🔍Inference:
In this table we have the information of destination account , from the transfer information we can see the increase in new balance, hence this is the receiver’s info. There is no payment amount available for Dest information.

📊 Distribution of Amount

It is important to understand the distribution of our data, since it can play a major role in model building and also in understanding our data. Going forward we will be using only 50k rows as it takes a lot of time to process all the records for viz and model building. Here we check for the distribution of amount transacted using the application

🔍Inference:
From the bar plot we can understand that we have a very right skewed dataset, there are a lot of outliers which goes upto 10M with a median of 33k. The upper bracket(75th percentile) counts upto 450k

🔧Feature engineering

For the readers who doesn’t understand what feature engineering is- It is the process to use current features such as ‘type of transaction’, ‘amount transferred’ etc,to derive new features or change the existing feature. Deriving more features can help the algorithm learn better and understand the underlying pattern. With the available information it is hard to train the model and get better results. Hence we create new features by altering the existing features. Hence we can create four functions which creates a highly relevant feature for the domain

Difference in balance: It is an universal truth that the amount debited from senders account gets credited into the receivers account without any deviation in cents. But what if there is a deviation incase of the amount debited and credited. Some could be due to the charges levied by the service providers, yet we need to flag such unusual instances
Surge indicator: Also we have to trigger flag when large amount are involved in the transaction. From the distribution of amount we understood that we have a lot of outliers with high amount in transactions. Hence we consider the 75th percentile(450k) as our threshold and amount which is greater than 450k will be triggered as a flag
Frequency indicator: Here we flag the user and not the transaction. When there is a receiver who receives money from a lot of people, it could be a trigger as it can be for some illegal games of chance or luck. Hence it is flagged when there is a receiver who receives money for more than 20 times.
Merchant indicator: The customer ids in receiver starts with ‘M’ which means that they are merchants and they obviously will have a lot of receiving transactions. So we also flag whenever there is a merchant receiver

👨🏻‍🔧Building the model and evaluation of results

We have a classification problem at hand. The algorithm has to decide whether it is a fraud transaction(1) or not fraud (0). There are a lot of different algorithms for classification problems. Here we are going to iterate the data over all the algorithms and select the one which has better accuracy.

💭Thoughts:
We can see who won the prize-it is Naive Bayes. Other algorithms have also performed in par with NB especially Random Forest and KNN. It sure looks overfitted as the accuracy is near 100% which can be verified using the test data.

🧪 Evaluation of model

Post evaluating the model with the test dataset,the confusion matrix looks like

💭Thoughts:
The model has identified false positives but never let even a single false negative which is more important than FP. Since we cant miss out a fraud transactions, but we can manage false positive results by investigating them

Precision and Recall metrics:

💭Thoughts:
When we found that our false negatives are more important than false positives, we have to look at the recall number and we have 100% recall in finding the fraud transactions and 100% precision in finding the non fraud transactions and on an average our model performs more than 70% accurate which is pretty good and there are possible chance to improve the performance of this model.

🍃 Conclusion

With the advent of digital transactions, the possibility of money laundering have also soared up with the use of tech. Millions of investigators are on the field fighting against the fraudulent transactions. In the current industry we have a large inflow of false positives hits and it consumes a long time to clear the false positive hits. Customers across the world using fintech platforms demand lightning fast services. Hence automating the hits with machine learning and reducing the false positive hits is our aim. But not at the cost of leaving out the false negatives. Hence we need to be more mindful about false negatives when we try to reduce the false positives.

Please share your comments on my work and do checkout my other articles

Code available at my Kaggle notebook