Resources for Learning and Exploration - Papers
Louise Francis and Virginia R. Prevosto
"Data and Disaster: The
Role of Data in the Financial Crisis"
Casualty Actuarial Society E-Forum, Spring 2010
<download>
Abstract
Motivation. Since 2007 a global financial crisis has
been unfolding. The crisis was initially caused by defaults
on subprime loans, aided and abetted by pools of asset-backed
securities and credit derivatives, but corporate defaults, such
as that of Lehman Brothers, and outright fraud have also contributed
to the crisis. Little research has been published investigating
the role of data issues in various aspects of the financial
crisis. In this paper we illustrate how data that was available
to underwriters, credit agencies, the Securities and Exchange
Commission
(SEC), and fund managers could have been used to detect the
problems that led to the financial crisis.
Method. In this paper we show that data quality played
a significant role in the mispricing and business intelligence
errors that caused the crisis. We utilize a number of relatively
simple statistics to illustrate the due diligence that should
have, but was not performed. We use the Madoff fraud and the
mortgage meltdown as data quality case studies. We apply simple
exploratory procedures to illustrate simple techniques that
could have been
used to detect problems. We also illustrate some modeling methods
that could have been used to help underwrite mortgages and find
indications of fraud.
Conclusions. Data quality issues made a significant contribution
to the global financial crisis.
Keywords. Data, data quality, financial crisis
Louise Francis FCAS, MAAA
" The Financial Crisis:
An Actuarys View" December 2008
<download>
Risk
Management:
The Current Financial Crisis, Lessons Learned and Future Implications
Presented by the Society of Actuaries, the Casualty Actuarial
Society and the Canadian
Institute of Actuaries, 2008
Louise Francis FCAS, MAAA and Richard
Derrig, PhD.
" Distinguishing the Forest
from the TREES: A Comparison of Tree Based Data Mining Methods"
September 2005
<download>
Abstract
In recent years a number of "data mining" approaches
for modeling data containing nonlinear and other complex dependencies
have appeared in the literature. One of the key data mining
techniques is decision trees, also referred to as classification
and regression trees or CART (Breiman et al, 1993). That method
results in relatively easy to apply decision rules that partition
data and model many of the complexities in insurance data. In
recent years considerable effort has been expended to improve
the quality of the fit of regression trees. These new methods
are based on ensembles or networks of trees and carry names
like TREENET and Random Forest. Viaene et al (2002) compared
several data mining procedures, including tree methods and logistic
regression, for prediction accuracy on a small fixed data set
of fraud indicators or "red flags". They found simple
logistic regression did as well at predicting expert opinion
as the more sophisticated procedures. In this paper we will
introduce some available regression tree approaches and explain
how they are used to model nonlinear dependencies in insurance
claim data. We investigate the relative performance of several
software products in predicting the key claim variables for
the decision to investigate for excessive and/or fraudulent
practices, and the expectation of favorable results from the
investigation, in a large claim database. Among the software
programs we will investigate are CART, S-PLUS, TREENET, Random
Forest and Insightful Miner Tree procedures. The data used for
this analysis are the approximately 500,000 auto injury claims
reported to the Detailed Claim Database (DCD) of the Automobile
Insurers Bureau of Massachusetts from accident years 1995 through
1997. The decision to order an independent medical examination
or a special investigation for fraud, and the favorable outcomes
of such decisions, are the modeling targets. We find that the
methods all provide some predictive value or lift from the available
DCD variables with significant differences among the methods
and the four targets. All modeling outcomes are compared to
logistic regression as in Viaene et al. with some model/software
combinations doing significantly better than the logistic model.
Keywords: Fraud, Data Mining, ROC Curve, Variable Importance,
Decision Trees
Louise Francis FCAS, MAAA and Richard
Derrig, PhD.
" Comparison of Methods and Software for Modeling Nonlinear
Dependencies: A Fraud Application" September 2005
<download>
Abstract
In recent years a number of approaches for modeling data containing
nonlinear and other complex dependencies have appeared in the
literature. These procedures include classification and regression
trees, neural networks, regression splines and naïve Bayes.
Viaene et al (2002) compared several of these procedures, as
well as a classical linear model, logistic regression, for prediction
accuracy on a small fixed data set of fraud indicators or "red
flags". They found simple logistic regression did as well
at predicting expert opinion as the more sophisticated procedures.
In this paper we will introduce some available common data mining
approaches and explain how they are used to model nonlinear
dependencies in insurance claim data. We investigate the relative
performance of several software products in predicting the key
claim variables for the decision to investigate for excessive
and/or fraudulent practices in a large claim database. Among
the software programs we will investigate are MARS, CART, S-PLUS,
TREENET and Insightful Miner. The data used for this analysis
are the approximately 500,000 auto injury claims reported to
the Detailed Claim Database (DCD) of the Automobile Insurers
Bureau of Massachusetts from accident years 1995 through 1997.
The decision to order an independent medical examination or
a special investigation for fraud are the modeling targets.
We find that the methods all provide some predictive value or
lift from the available DCD variables with significant differences
among the methods and the two targets. All modeling outcomes
are compared to logistic regression as in Viaene et al. with
some model/software combinations doing significantly better
than the logistic model.
Keywords: Fraud, Data Mining, ROC Curve, Variable Importance
International Congress of Actuaries - Paris - May 28-June 2,
2006
Louise Francis FCAS, MAAA. "Taming
Text , An Introduction to Text Mining" September 2005
<download>
Abstract
Motivation. One of the newest areas of data mining is
text mining. Text mining is used to extract information from
free form text data such as that in claim description fields.
This paper introduces the methods used to do text mining and
applies the method to a simple example.
Method. The paper will describe the methods used to parse
data into vectors of terms for analysis. It will then show how
information can be extracted from the vectorized data that can
be used to create new features for use in analysis. Focus will
be placed on the method of clustering for finding patterns in
unstructured text information.
Results. The paper shows how feature variables can be
created from unstructured text information and used for prediction.
Conclusions. Text mining has significant potential to
expand the amount of information that is available to insurance
analysts for exploring and modeling data
Availability. Free software that can be used to perform
some of the analyses describes in this paper is described in
the appendix.
Keywords. Predictive modeling, data mining, text mining,
statistical analysis
Louise Francis FCAS, MAAA. "Dancing
with Dirty Data" March 2005
<download>
Abstract
Much of the data that actuaries work with is dirty. That is,
the data contain errors, miscodings, missing values and other
flaws that affect the validity of analyses performed with such
data. This paper will give an overview of methods that can be
used to detect errors and remediate data problems. The methods
will include outlier detection procedures from the exploratory
data analysis and data mining literature as well as methods
from research on coping with missing values. The paper will
also address the need for accurate and comprehensive metadata.
Conclusions. A number of graphical tools such as histograms
and box and whisker plots are useful in highlighting unusual
values in data. A new tool based on data spheres appears to
have the potential to screen multiple variables simultaneously
for outliers. For remediating missing data problems, imputation
is a straightforward and frequently used approach
Availability. The R statistical language can be used to perform
the exploratory and cleaning methods described in this paper.
It can be downloaded for free at http://cran.r-project.org/.
Louise Francis FCAS, MAAA. "Martian
Chronicles: Is MARS better than Neural Networks?" March 2003.
<download>
Abstract
This paper will introduce the neural network technique of analyzing
data as a
generalization of more familiar linear models such as linear
regression. The reader is introduced to the traditional explanation
of neural networks as being modeled on the functioning of neurons
in the brain. Then a comparison is made of the structure and
function of neural networks to that of linear models that the
reader is more familiar with.
The paper will then show that backpropagation neural networks
with a single hidden layer are universal function approximators.
The paper will also compare neural networks to procedures such
as Factor Analysis which perform dimension reduction. The application
of both the neural network method and classical statistical
procedures to insurance problems such as the prediction of frequencies
and severities is illustrated.
One key criticism of neural networks is that they are a "black
box". Data goes into the "black box" and a prediction
comes out of it, but the nature of the relationship between
independent and dependent variables is usually not revealed..
Several methods for interpreting the results of a neural network
analysis, including a procedure for visualizing the form of
the fitted function will be presented.
Louise Francis, FCAS, MAAA. "Neural
Networks Demystified." March 2001.
<download>
Abstract:
This paper will introduce the neural network technique of analyzing
data as a
generalization of more familiar linear models such as linear
regression. The reader is introduced to the traditional explanation
of neural networks as being modeled on the functioning of neurons
in the brain. Then a comparison is made of the structure and
function of neural networks to that of linear models that the
reader is more familiar with.
The paper will then show that backpropagation neural networks
with a single hidden layer are universal function approximators.
The paper will also compare neural networks to procedures such
as Factor Analysis which perform dimension reduction. The application
of both the neural network method and classical statistical
procedures to insurance problems such as the prediction of frequencies
and severities is illustrated.
One key criticism of neural networks is that they are a "black
box". Data goes into the "black box" and a prediction
comes out of it, but the nature of the relationship between
independent and dependent variables is usually not revealed..
Several methods for interpreting the results of a neural network
analysis, including a procedure for visualizing the form of
the fitted function will be presented.
Louse Francis FCAS, MAAA. "A Model
for Combining Timing, Interest Rate and Aggregate Risk Loss."
1998.
<download>
Abstract:
The purpose of this paper is to develop a simple model for determining
distributions of present value estimates of aggregate losses.
Three random components of the model that will be described
are aggregate losses, payout patterns, and interest rates. In
addition, this paper addresses the impact of timing and investment
variability on risk margin/solvency requirements.
|