Predicting Account Receivables with Machine Learning

This project was certified by IBM Data Science Board

Presentation at KDD 2020 Workshop Machine learning in Finance

The invoice-to-cash process involves various steps, from invoice creation to customer's debt (payment) settlement or reconciliation. One key step of this process is the collection of accounts receivable. Accounts receivable (AR) refers to the invoices issued by a company for products or services already delivered but not yet paid for by its customers. Properly managing AR is a core accounting activity and concern of any company, pertaining to its cash-flow.  

We present a study case carried out in partnership with a multinational bank. It sought for innovative ways to proactively identify overdue ARs with high probability of being paid such that its managers and executives could take appropriate actions (such as, reaching out to those customers and collecting those ARs).  

The problem of predicting an invoice payment is a typical classification problem using supervised learning where, given the original client dataset, we need to extract invoices' features to be able to characterize each invoice with respect to labeled classes, building then a machine learning model to perform classification of new invoice. An invoice is considered overdue if the payment occurred in more than 5 days from due date. The main reason for considering this time window is the time required to processing payment in the client system. 

The dataset received has 164,043 invoices from 8 countries from Latin America, 3,779 customers with a total of 15,437 different contracts ranging from January 2017 to August 2018. The dataset only contains information about payments, i.e., only about invoice. For instance, invoice value, country code, customer credit rating, contract number, customer number, etc. One of biggest problem of our data is that our client has no relevant information about customers as industry sector, balance sheets, etc.   

Transform the poor invoice information in relevant features to build a machine learning model to predict payment is a challenge problem. We used historical data to create aggregate features that could bring more meaning to our set of invoice-level features, such as: paid invoices, total paid invoice, sum amount of paid invoices, total outstanding invoices, total outstanding late, sum total outstanding, sum late outstanding, average days late, average days outstanding late, pay frequency, number of contracts invoice is associated.  For these historical features, we needed to define a period of time that we will consider looking back.  

We present the accuracy results (Figure 1for all the data set generated from w=2,..,12 months with all five classifiers that we tested: Deep Neural Network Naive Bayes, Logistic Regression, k-Nearest Neighbors, Random Forest, and Gradient Boosted Decision Trees. We highlight the best accuracy and we can see that with 5 months (w=4). This result shows that using all the available data is not always the best option. Specially because the concept drift we observed lead to use less data. 

Figure 1 Experiment different time window with several machine learning methods


Data distribution for training, validation and test sets. The classes are balanced and our baseline is close to 61%. We split the data using time, since we cannot use future data to make predictions. We use around 70% of available data to train the model and the other 30% we split in test and validation data as presented in Figure 2. 

Figure 2. Data Split Train and Test

We deployed our model and prioritization ranking list using a RESTful API, making the maintenance of the model (retrained for instance) and other service adjustments easier.  

Finally, as real-world environments are always in continuous flux, our features distribution are shifting as well. We noted that the current AR process has been modified over the last year, and, as it evolves, our model should accompany that. 


Ana Paula Appel, Gabriel Louzada Malfatti, Renato Luiz de Freitas Cunha, Bruno Lima, : Predicting Account Receivables with Machine Learning. CoRR abs/2008.07363 (2020)

Ana Paula Appel, Victor Oliveira, Bruno Lima, Gabriel Louzada Malfatti, Vagner Figueredo de Santana, Rogério de Paula: Optimize Cash Collection: Use Machine learning to Predicting Invoice Payment. CoRR abs/1912.10828 (2019)