Identify Fraud From Enron Email

by Adrian Liaw

Background (Question #1)

Enron Corporation was an American energy, commodities, and services company. Enron was one of the world's major electricity, natural gas, communications and pulp and paper companies. During 2001, it suffered one of the largest bankruptcies in the history, because of the revelations of Enron scandal, an institutionalised, systematic, and creatively planned accounting fraud. Some key people who involved in the scandal were Kenneth Lay, Jeffrey Skilling, Andrew Fastow, etc.

The goal of this project is to build a person-of-interest (POI) identifier using machine learning which identifies persons who might involved in the scandal. The dataset used in this project contains financial data of Enron employees and their e-mails. Machine learning is useful here because there are too many e-mails to analyse, and people usually analyse this type of data using human intuition, which can't build a systematic model to predict things.

Dataset Overview

Summary

This table gives us the total number of data points, number of POIs/non-POIs, and number of features.

Characteristic Quantity
data points 146
POIs 18
non-POIs 128
features 21

The allocation across two classes (POI / non-POI) is highly skewed, only approximately 12% of the data points are labeled as POI.
There are 21 features in the dataset, one of them is the label for our machine learning algorithms, poi. Another feature, email_address, is definitely not an actual feature that will be used by the algorithms, but a reference to the e-mails, I'm going to add tons of "word features" generated by parsing those e-mails.

Missing Values

This table gives us the percentage of data points without value for each feature.

Feature Percentage of data points without value
loan_advances 0.97
restricted_stock_deferred 0.88
director_fees 0.88
deferral_payments 0.73
deferred_income 0.66
long_term_incentive 0.55
bonus 0.44
from_messages 0.41
from_this_person_to_poi 0.41
shared_receipt_with_poi 0.41
to_messages 0.41
from_poi_to_this_person 0.41
other 0.36
expenses 0.35
salary 0.35
exercised_stock_options 0.30
restricted_stock 0.25
email_address 0.24
total_stock_value 0.14
total_payments 0.14
poi 0.00

Some features have many missing values, the table already sorted the percentages in a reverse order.

Outliers

There are 2 outliers in the dataset, TOTAL and THE TRAVEL AGENCY IN THE PARK, these two data points are apparently not human, they're definitely outliers, thus removing them from the dataset. Now the dataset has 144 data points.

Features (Question #2)

Feature Engineering

I created 2 new features based on the original dataset, they are fraction_from_poi and fraction_to_poi. fraction_from_poi is the fraction of all messages to this person that come from POIs, fraction_to_poi is the fraction of all messages from this person that are sent to POIs. The reason why I created these two features is because, there might be a possibility that, given a person who has high portion of e-mails sent to / recieved from POIs, the person himself / herself is a POI. fraction_from_poi depends on from_poi_to_this_person and to_messages, so if any of the two features is NaN, fraction_from_poi will be NaN; same case for fraction_to_poi, it depends on from_this_person_to_poi and from_messages.

I also parsed content of all applicable e-mails, vectorised them into numerical values which became features in the dataset. I used TfidfVectorizer provided by scikit-learn, it generated a 144x73618 matrix, which means there are 73618 word features in total. Sum all those up, including financial, e-mails and word features, we've got 73639 features. Of course we're not going to use them all in the final model, we'll do some feature selection first.

Feature Selection

I didn't select the features by hand, I used SelectKBest provided by scikit-learn instead, it automatically selected features for me. I selected 9 best features, 9 for k is the result of running a GridSearchCV on a Pipeline of SelectKBest plus LinearSVC, it gave the best $F_1$ score. Note that LinearSVC isn't my final model, it's just for helping selecting k.

The final 9 features are: [total_stock_value exercised_stock_options _audit _behavior _breakfast _embed _ene _merchant _prestohouectect] (feature names start with "_" are the word features, others are either e-mail or financial features). There are 2 financial features and 7 word features selected, no e-mail features selected, neither fraction_from_poi and fraction_to_poi.

Here's a table of selected features, ordered by score. The "Score" is the score of that feature calculated by SelectKBest, bigger is better; "p-value" is the p-value of that score, smaller is better.

Feature Score p-value
_behavior 32.15506493 7.68150329e-08
_prestohouectect 29.87398197 2.01474034e-07
_ene 28.11011323 4.29039611e-07
_breakfast 25.89880301 1.12148645e-06
exercised_stock_options 25.09754153 1.59454385e-06
_merchant 24.84351812 1.78352260e-06
total_stock_value 24.46765405 2.10580665e-06
_audit 23.97568818 2.61906165e-06
_embed 23.1934391 3.71096211e-06

Let's also see the scores of two features I created:

Feature Score p-value
fraction_to_poi 16.6417070705 7.49415402503e-05
fraction_from_poi 3.21076191697 0.0752849005991

Apparently, they aren't strong enough compared to the 9 features above.

Feature Scaling

At first, I did feature scaling on all the features universally, because that makes it easier to test out different classifiers. Doing feature scaling may cause a huge difference on the performance of some algorithms, e.g. SVM, K-means clustering; Some other algorithms do not affected by feature scaling, e.g. Decision Tree, Linear Regression.

In my final model, I applied feature scaling with MinMaxScaler, which scales the values into numbers between 0 and 1, the smallest value converts to 0, the largest one converts to 1. I did this because the final model involves SVM and Naïve Bayes, they're both affected by feature scaling.

Algorithm Selection (Question #3)

Investigation

I tried a variety of classifiers provided by scikit-learn.
LinearSVC's performance first caught my eye, it hits a precision score of 1.0 using our testing script, and about 0.5 for the recall score, which means, if the classifier labelled a person as a POI, we can almost ensure the person is a POI.
GaussianNB isn't outstanding for the precision score, but pretty good at recall, it hits 0.83 for the recall score.
DecisionTreeClassifier isn't too outstanding on both precision and recall score, they're both around 0.5.
LogisticRegression turned out to have similar behaviour as LinearSVC, but since we're not expecting probabilities as outputs, I didn't use it in my final model.
I also tried some ensemble methods such as RandomForestClassifier and AdaBoostClassifier, using AdaBoostClassifier with base_estimator of DecisionTreeClassifier made better result compared to pure DecisionTreeClassifier.
Here's a table of model performances (using default parameters without tuning them):

Algorithm Accuracy Precision Recall $F_1$ $F_2$ $F_{0.5}$
LinearSVC 0.935 1.000 0.514 0.679 0.570 0.841
GaussianNB 0.919 0.655 0.832 0.733 0.789 0.684
DecisionTreeClassifier 0.876 0.540 0.466 0.500 0.479 0.524
LogisticRegression 0.885 1.000 0.135 0.238 0.163 0.438
RandomForestClassifier 0.902 0.741 0.409 0.527 0.449 0.637
AdaBoostClassifier 0.886 0.573 0.564 0.569 0.566 0.571
GradientBoostingClassifier 0.886 0.609 0.414 0.493 0.443 0.557
ExtraTreesClassifier 0.899 0.679 0.464 0.551 0.495 0.621

Final Algorithm

I ended up using an ensemble method, VotingClassifier, with 3 classifier members: LinearSVC, GaussianNB and AdaBoostClassifier. Of course it involves parameter tuning on each classifier. VotingClassifier is exactly how it sounds like, it takes a list of classifiers, then determine the result based on majority rule.

Parameter Tuning (Question #4)

Parameter tuning is to tweak the algorithm's settings, it's an important step when building machine learning models. Depending on the situation and your goal, you might want to tune the paramters into different values, and that makes the classifier behave differently. A common mistake you can make when you don't do it well is overfitting, overfitting makes the classifier overly sensitive. An overfitted model will try to correctly label all the training data points, but sometimes the data point might be just an exception case, we don't have to be such sensitive. So an overfitted model will have high accuracy score on training dataset, low score on the testing dataset. If you tune the sensitivity in a wrong way, it happens to be overfitting.

Tuning My Model

I tuned LinearSVC by hand, by trying out different values and observing how the performance varied according to validation and evaluation. I found out that with C=1.3, the classifier gives the best performance, with precision and recall of 0.99 and 0.52.
I tuned AdaBoostClassifier using GridSearchCV, which exhaustively searched over specified parameters and automatically figure out the best set of parameters.
For GaussianNB, I don't have to, because it accepts no parameters.

Validation (Question #5)

Validation is a step of validating, testing and evaluating machine learning models, we usually do validation when we're trying out different algorithms or parameters to see if the model can generalise to unseen data. A classic mistake you would make if you do it wrong is overfitting, like I just described in the parameter tuning phase. Overfitting usually happens when you use a set of data that the classifier already seen to validate a classifier's performance, instead of using an independent dataset. So the most basic validation is to split the dataset into "training set" and "testing set", where training set is for training the model, testing set is to test and validate the model.

Validate My Model

In my project, I used a very common validation technique called "cross-validation". The core idea of cross-validation is to generate a set of training and testing set pair, then, for each pair, train the model using training set, validate it using testing set. The most basic approach is to use k-fold, but because of our small dataset, I used StratifiedShuffleSplit.

Evaluation Metrics (Question #6)

There are many metrics for evaluating performance. Accuracy score is the most basic one, but since our dataset is highly skewed, number of non-POIs is way more than POIs, it isn't ideal to use accuracy as our evaluation metric. Precision and recall scores are more appropriate ones, or $F_{\beta}$ score if considering both precision and recall. Depend on your situation or goal, you might want to use different metrics, e.g. if the POI identifier is for identify real criminals, we might consider precision score more, and it should be very high; if it's for detecting who might potentially be a POI, then do deeper investigation later, we'll consider recall score more. $F_{\beta}$ score is a kind of weighted average of precision and recall, $\beta$ stands for the weight of recall score, some common values for $\beta$ are 1, 2 and 0.5.

Model Performance

Here are the scores of my final model using different metrics:

metric score
Accuracy 0.93067
Precision 0.77273
Recall 0.68000
$F_1$ 0.72340
$F_2$ 0.69672
$F_{0.5}$ 0.75221

We've got an accuracy score of 0.93, which means, 93% of the time the identifier will correctly classify wether a person is a POI
Precision score of 0.77, which means that when it identifies a person as a POI, 77% of the time the person is actually a POI
Recall score of 0.68, which means, if a person is actually a POI, 68% of the time the identifier identifies the person as a POI
$F_1$ score is the harmonic mean of precision and recall, which is basically a way to average them. Here we've got 0.72
We can see that $F_2$ is more effected by recall since it weights recall more, and $F_{0.5}$ is more effected by precision

In [1]:
#############
# emails.py #
#############

import string
import pickle
from pathlib import Path
from nltk.stem import SnowballStemmer
from progress.bar import Bar


# Copied from text learning mini-project
def parse_out_text(path):
    """Get the content of an email"""

    words = ""
    stemmer = SnowballStemmer("english")

    with path.open() as f:
        all_text = f.read()
        content = all_text.split("X-FileName:")

        if len(content) > 1:
            # Remove punctuations and numbers
            text_string = content[1].translate({
                ord(char): None for char in (string.punctuation + string.digits)
            })

            # Stem each word, then combine into single string
            words = " ".join([stemmer.stem(x) for x in text_string.split()])

    return words


def get_all_words_by_author(email):
    """Get all the email content sent from a specific email address"""

    list_path = Path("emails_by_address/from_%s.txt" % email)
    contents = []

    # The corpus doesn't contain some of the email data
    if list_path.exists():

        with list_path.open() as f:
            # Every line of the reference file refers to an actual email
            for line in f:

                # The reference path is actually wrong in our case,
                # they all have a prefix of "enron_mail_20110402/maildir/",
                # but the maildir/ is actually located in our parent folder
                incorrect_path = Path(line.strip())
                real_path = Path("..", *incorrect_path.parts[1:]).resolve()

                contents.append(parse_out_text(real_path))

    # Return a single string which contains all of the content
    return " ".join(contents)


if __name__ == "__main__":
    with open("final_project_dataset.pkl", "rb") as f:
        dataset = pickle.load(f)

    word_data = {}

    # Output format: {"NAME": "CONTENT", ...}

    for name, row in Bar("Processing").iter(sorted(dataset.items())):
        # These two are outliers
        if name not in ("TOTAL", "THE TRAVEL AGENCY IN THE PARK"):
            if row["email_address"] != "NaN":
                word_data[name] = get_all_words_by_author(row["email_address"])
            else:
                word_data[name] = ""

    with open("word_data.pkl", "wb") as f:
        pickle.dump(word_data, f)
In [2]:
#############
# poi_id.py #
#############

import sys
import pickle
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier, VotingClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.cross_validation import StratifiedKFold
from sklearn.pipeline import Pipeline

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

# Warning messages from GridSearchCV sometimes annoying
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".

fin_features = ["salary", "deferral_payments", "total_payments",
                "loan_advances", "bonus", "restricted_stock_deferred",
                "deferred_income", "total_stock_value", "expenses",
                "exercised_stock_options", "other", "long_term_incentive",
                "restricted_stock", "director_fees"]

eml_features = ["to_messages", "from_poi_to_this_person", "from_messages",
                "from_this_person_to_poi", "shared_receipt_with_poi",
                "fraction_to_poi", "fraction_from_poi"]

# This is not the final features_list, we'll build it with SelectKBest
all_features = ["poi"] + fin_features + eml_features

sys.stdout.write("Loading dataset...   ")
sys.stdout.flush()

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)

sys.stdout.write("Done\n")

sys.stdout.write("Removing outliers...   ")
sys.stdout.flush()

### Task 2: Remove outliers
data_dict.pop("TOTAL")
data_dict.pop("THE TRAVEL AGENCY IN THE PARK")

sys.stdout.write("Done\n")

sys.stdout.write("Creating new features...   ")
sys.stdout.flush()

### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.

# New features:
#     1. fraction_from_poi: from_poi_to_this_person / to_messages
#     2. fraction_to_poi: from_this_person_to_poi / from_messages

for point in data_dict.values():
    # Default values for those having missing values
    point.update({"fraction_from_poi": 0, "fraction_to_poi": 0})

    if "NaN" not in (point["from_poi_to_this_person"], point["to_messages"]):
        point["fraction_from_poi"] = \
            point["from_poi_to_this_person"] / point["to_messages"]

    if "NaN" not in (point["from_this_person_to_poi"], point["from_messages"]):
        point["fraction_to_poi"] = \
            point["from_this_person_to_poi"] / point["from_messages"]

sys.stdout.write("Done\n")

sys.stdout.write("Loading word data...   ")
sys.stdout.flush()

# Pre-built e-mail content dict generated by emails.py
with open("word_data.pkl", "rb") as f:
    word_data = pickle.load(f)

sys.stdout.write("Done\n")

sys.stdout.write("Vectorizing email data...   ")
sys.stdout.flush()

vectorizer = TfidfVectorizer(stop_words="english")
# Sort by names, make it compatible with featureFormat
tf = vectorizer.fit_transform([x[1] for x in sorted(word_data.items())])

word_features = vectorizer.get_feature_names()

sys.stdout.write("Done\n")

sys.stdout.write("Preprocessing data...   ")
sys.stdout.flush()

# Here I'm not adding all the word features into data_dict or my_dataset,
# that will waste too much time and they'll be extemely large. Instead,
# I preprocess the data_dict with featureFormat, then concatenate with
# the matrix generated by TfidfVectorizer.
# After that, do feature scaling and selection, then transform the
# final numpy array into original dict format
data = featureFormat(data_dict, all_features,
                     remove_all_zeroes=False, sort_keys=True)

# Concatenate two arrays vertically with np.hstack
data = np.hstack((data, tf.toarray()))

labels, features = targetFeatureSplit(data)

# Feature scaling
# Note that this applies to the final dataset used by tester.py,
# see L110~L115 and L178~L185
features = MinMaxScaler().fit_transform(features)

# Add an underscore before every word feature name to avoid ambiguity with
# original features
for feature in word_features:
    all_features.append("_" + feature)

sys.stdout.write("Done\n")

sys.stdout.write("Selecting features...   ")
sys.stdout.flush()

# Feature selection
selector = SelectKBest()

# Switch to True to perform grid search
search = False

if search:
    sys.stdout.write("\n")

    estimator = Pipeline([
        ("select", SelectKBest()),
        # Note that this SVM isn't the final model
        ("svm", LinearSVC()),
    ])
    # More than 20 is too much
    params = {"select__k": list(range(2, 20))}

    # Run 2 jobs at the same time, also print the progress into console
    # Here I use StratifiedKFold with 10 folds as CV for searching
    searcher = GridSearchCV(estimator, params, scoring="f1", n_jobs=2,
                            cv=StratifiedKFold(labels, 10), verbose=1)
    searcher.fit(features, labels)

    selector.k = searcher.best_params_["select__k"]

else:
    # The result I got is 9
    selector.k = 9

features = selector.fit_transform(features, labels)
# Get selected features using numpy array indexing
# all_features contains "poi" which isn't a feature
selected_features = np.array(all_features[1:])[selector.get_support()]

sys.stdout.write("Done\n")

sys.stdout.write("Generating final dataset...   ")
sys.stdout.flush()

# Generate my_dataset for testing script
my_dataset = {}
# Use sorting to make sure the features are mapping to the correct person
for name, point, label in zip(sorted(data_dict), features, labels):
    my_dataset[name] = {"poi": label}
    # point is an 1D array, each row of the 2D array features
    for feature, value in zip(selected_features, point):
        my_dataset[name][feature] = value

# Generate features_list for testing script
features_list = ["poi"] + list(selected_features)

sys.stdout.write("Done\n")

### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

sys.stdout.write("Building model...   ")
sys.stdout.flush()

# C Parameter for this LinearSVC is tuned by hand
svm = LinearSVC(C=1.3)

nb = GaussianNB()

# Tune parameters in the next task
adabst = AdaBoostClassifier(DecisionTreeClassifier())

clf = VotingClassifier([
    ("svm", svm),
    ("nb", nb),
    ("ada", adabst),
])

sys.stdout.write("Done\n")

### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

sys.stdout.write("Tuning model...   ")
sys.stdout.flush()

# Switch to True to perform grid search
search = False

if search:

    sys.stdout.write("\n")

    # Here I use F2 score, which weights recall score more
    f2_score = make_scorer(fbeta_score, beta=2)

    param_grid = {
        "base_estimator__max_features": [None, "sqrt", "log2"],
        "base_estimator__max_depth": [None, 3, 5, 8, 10],
        "base_estimator__min_samples_leaf": [1, 2, 3, 5, 8],
        "learning_rate": [0.01, 0.1, 0.5, 0.8, 1.0],
    }

    searcher = GridSearchCV(adabst, param_grid, f2_score, n_jobs=2, verbose=1,
                            cv=StratifiedKFold(labels, 10))

    searcher.fit(features, labels)

    # Apply tuned parameters to the model
    adabst.set_params(**searcher.best_params_)

else:

    # Result I got when I ran the above searching
    adabst.set_params(base_estimator__max_features="sqrt",
                      base_estimator__min_samples_leaf=1,
                      base_estimator__max_depth=3,
                      learning_rate=0.01)

sys.stdout.write("Done\n")

### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

sys.stdout.write("Dumping classifier and data...   ")
sys.stdout.flush()

dump_classifier_and_data(clf, my_dataset, features_list)

sys.stdout.write("Done\n")
Loading dataset...   Done
Removing outliers...   Done
Creating new features...   Done
Loading word data...   Done
Vectorizing email data...   Done
Preprocessing data...   Done
Selecting features...   Done
Generating final dataset...   Done
Building model...   Done
Tuning model...   Done
Dumping classifier and data...   Done
In [3]:
!python tester.py
VotingClassifier(estimators=[('svm', LinearSVC(C=1.3, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)), ('nb', GaussianNB()), ('ada', AdaBoostClassifier(algorithm='SAM...om_state=None, splitter='best'),
          learning_rate=0.01, n_estimators=50, random_state=None))],
         voting='hard', weights=None)
	Accuracy: 0.93053	Precision: 0.77216	Recall: 0.67950	F1: 0.72287	F2: 0.69621
	Total predictions: 15000	True positives: 1359	False positives:  401	False negatives:  641	True negatives: 12599