Francis Analytics
Actuarial Data Mining Services

 

 

 

See Also
"Detecting Fraud Using Benford's Law"

"Data Mining Intro for Students"

and other Student Papers

 

 

 

 

 

 

 

 

 

Contact us:
215-923-1567

louise.francis@data-mines.com

Resources for Learning and Exploration - Papers


Louise Francis and Virginia R. Prevosto
"Data and Disaster: The Role of Data in the Financial Crisis"
Casualty Actuarial Society E-Forum, Spring 2010

<download>

Abstract

Motivation. Since 2007 a global financial crisis has been unfolding. The crisis was initially caused by defaults on subprime loans, aided and abetted by pools of asset-backed securities and credit derivatives, but corporate defaults, such as that of Lehman Brothers, and outright fraud have also contributed to the crisis. Little research has been published investigating the role of data issues in various aspects of the financial crisis. In this paper we illustrate how data that was available to underwriters, credit agencies, the Securities and Exchange Commission
(SEC), and fund managers could have been used to detect the problems that led to the financial crisis.

Method. In this paper we show that data quality played a significant role in the mispricing and business intelligence errors that caused the crisis. We utilize a number of relatively simple statistics to illustrate the due diligence that should have, but was not performed. We use the Madoff fraud and the mortgage meltdown as data quality case studies. We apply simple exploratory procedures to illustrate simple techniques that could have been
used to detect problems. We also illustrate some modeling methods that could have been used to help underwrite mortgages and find indications of fraud.

Conclusions. Data quality issues made a significant contribution to the global financial crisis.
Keywords. Data, data quality, financial crisis


Louise Francis FCAS, MAAA
"
The Financial Crisis: An Actuary’s View" December 2008
<download>

Risk Management:
The Current Financial Crisis, Lessons Learned and Future Implications

Presented by the Society of Actuaries, the Casualty Actuarial Society and the Canadian
Institute of Actuaries, 2008

 

Louise Francis FCAS, MAAA and Richard Derrig, PhD.
"
Distinguishing the Forest from the TREES: A Comparison of Tree Based Data Mining Methods" September 2005
<download>

Abstract
In recent years a number of "data mining" approaches for modeling data containing nonlinear and other complex dependencies have appeared in the literature. One of the key data mining techniques is decision trees, also referred to as classification and regression trees or CART (Breiman et al, 1993). That method results in relatively easy to apply decision rules that partition data and model many of the complexities in insurance data. In recent years considerable effort has been expended to improve the quality of the fit of regression trees. These new methods are based on ensembles or networks of trees and carry names like TREENET and Random Forest. Viaene et al (2002) compared several data mining procedures, including tree methods and logistic regression, for prediction accuracy on a small fixed data set of fraud indicators or "red flags". They found simple logistic regression did as well at predicting expert opinion as the more sophisticated procedures. In this paper we will introduce some available regression tree approaches and explain how they are used to model nonlinear dependencies in insurance claim data. We investigate the relative performance of several software products in predicting the key claim variables for the decision to investigate for excessive and/or fraudulent practices, and the expectation of favorable results from the investigation, in a large claim database. Among the software programs we will investigate are CART, S-PLUS, TREENET, Random Forest and Insightful Miner Tree procedures. The data used for this analysis are the approximately 500,000 auto injury claims reported to the Detailed Claim Database (DCD) of the Automobile Insurers Bureau of Massachusetts from accident years 1995 through 1997. The decision to order an independent medical examination or a special investigation for fraud, and the favorable outcomes of such decisions, are the modeling targets. We find that the methods all provide some predictive value or lift from the available DCD variables with significant differences among the methods and the four targets. All modeling outcomes are compared to logistic regression as in Viaene et al. with some model/software combinations doing significantly better than the logistic model.

Keywords: Fraud, Data Mining, ROC Curve, Variable Importance, Decision Trees

 

Louise Francis FCAS, MAAA and Richard Derrig, PhD.
" Comparison of Methods and Software for Modeling Nonlinear Dependencies: A Fraud Application" September 2005

<download>

Abstract
In recent years a number of approaches for modeling data containing nonlinear and other complex dependencies have appeared in the literature. These procedures include classification and regression trees, neural networks, regression splines and naïve Bayes. Viaene et al (2002) compared several of these procedures, as well as a classical linear model, logistic regression, for prediction accuracy on a small fixed data set of fraud indicators or "red flags". They found simple logistic regression did as well at predicting expert opinion as the more sophisticated procedures. In this paper we will introduce some available common data mining approaches and explain how they are used to model nonlinear dependencies in insurance claim data. We investigate the relative performance of several software products in predicting the key claim variables for the decision to investigate for excessive and/or fraudulent practices in a large claim database. Among the software programs we will investigate are MARS, CART, S-PLUS, TREENET and Insightful Miner. The data used for this analysis are the approximately 500,000 auto injury claims reported to the Detailed Claim Database (DCD) of the Automobile Insurers Bureau of Massachusetts from accident years 1995 through 1997. The decision to order an independent medical examination or a special investigation for fraud are the modeling targets. We find that the methods all provide some predictive value or lift from the available DCD variables with significant differences among the methods and the two targets. All modeling outcomes are compared to logistic regression as in Viaene et al. with some model/software combinations doing significantly better than the logistic model.

Keywords: Fraud, Data Mining, ROC Curve, Variable Importance
International Congress of Actuaries - Paris - May 28-June 2, 2006

 

Louise Francis FCAS, MAAA. "Taming Text , An Introduction to Text Mining" September 2005
<download>

Abstract
Motivation. One of the newest areas of data mining is text mining. Text mining is used to extract information from free form text data such as that in claim description fields. This paper introduces the methods used to do text mining and applies the method to a simple example.

Method. The paper will describe the methods used to parse data into vectors of terms for analysis. It will then show how information can be extracted from the vectorized data that can be used to create new features for use in analysis. Focus will be placed on the method of clustering for finding patterns in unstructured text information.

Results. The paper shows how feature variables can be created from unstructured text information and used for prediction.

Conclusions. Text mining has significant potential to expand the amount of information that is available to insurance analysts for exploring and modeling data

Availability. Free software that can be used to perform some of the analyses describes in this paper is described in the appendix.

Keywords. Predictive modeling, data mining, text mining, statistical analysis

 

Louise Francis FCAS, MAAA. "Dancing with Dirty Data" March 2005
<download>

Abstract
Much of the data that actuaries work with is dirty. That is, the data contain errors, miscodings, missing values and other flaws that affect the validity of analyses performed with such data. This paper will give an overview of methods that can be used to detect errors and remediate data problems. The methods will include outlier detection procedures from the exploratory data analysis and data mining literature as well as methods from research on coping with missing values. The paper will also address the need for accurate and comprehensive metadata. Conclusions. A number of graphical tools such as histograms and box and whisker plots are useful in highlighting unusual values in data. A new tool based on data spheres appears to have the potential to screen multiple variables simultaneously for outliers. For remediating missing data problems, imputation is a straightforward and frequently used approach

Availability. The R statistical language can be used to perform the exploratory and cleaning methods described in this paper. It can be downloaded for free at http://cran.r-project.org/.

 

Louise Francis FCAS, MAAA. "Martian Chronicles: Is MARS better than Neural Networks?" March 2003.
<download>

Abstract
This paper will introduce the neural network technique of analyzing data as a
generalization of more familiar linear models such as linear regression. The reader is introduced to the traditional explanation of neural networks as being modeled on the functioning of neurons in the brain. Then a comparison is made of the structure and function of neural networks to that of linear models that the reader is more familiar with.

The paper will then show that backpropagation neural networks with a single hidden layer are universal function approximators. The paper will also compare neural networks to procedures such as Factor Analysis which perform dimension reduction. The application of both the neural network method and classical statistical procedures to insurance problems such as the prediction of frequencies and severities is illustrated.

One key criticism of neural networks is that they are a "black box". Data goes into the "black box" and a prediction comes out of it, but the nature of the relationship between independent and dependent variables is usually not revealed.. Several methods for interpreting the results of a neural network analysis, including a procedure for visualizing the form of the fitted function will be presented.


Louise Francis, FCAS, MAAA. "Neural Networks Demystified." March 2001.
<download>

Abstract:
This paper will introduce the neural network technique of analyzing data as a
generalization of more familiar linear models such as linear regression. The reader is introduced to the traditional explanation of neural networks as being modeled on the functioning of neurons in the brain. Then a comparison is made of the structure and function of neural networks to that of linear models that the reader is more familiar with.

The paper will then show that backpropagation neural networks with a single hidden layer are universal function approximators. The paper will also compare neural networks to procedures such as Factor Analysis which perform dimension reduction. The application of both the neural network method and classical statistical procedures to insurance problems such as the prediction of frequencies and severities is illustrated.

One key criticism of neural networks is that they are a "black box". Data goes into the "black box" and a prediction comes out of it, but the nature of the relationship between independent and dependent variables is usually not revealed.. Several methods for interpreting the results of a neural network analysis, including a procedure for visualizing the form of the fitted function will be presented.

 

Louse Francis FCAS, MAAA. "A Model for Combining Timing, Interest Rate and Aggregate Risk Loss." 1998.
<download>

Abstract:
The purpose of this paper is to develop a simple model for determining distributions of present value estimates of aggregate losses. Three random components of the model that will be described are aggregate losses, payout patterns, and interest rates. In addition, this paper addresses the impact of timing and investment variability on risk margin/solvency requirements.