Enron Fraud: Email/Text Analytics

By: Srisai Sivakumar

Previously we saw the use of Analytics and Machine Learning to predict the judgements of cases by the US Supreme Court. In this section, we would still be discussing a problem that has legal dimension to it, but from a different context. In this study, we confine ourself to only the tree based models

Introduction

Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. Before its bankruptcy on December 2, 2001, Enron employed approximately 20,000 staff and was one of the world’s major electricity, natural gas, communications, and pulp and paper companies, with claimed revenues of nearly $111 billion during 2000. Fortune named Enron “America’s Most Innovative Company” for six consecutive years.

At the end of 2001, it was revealed that its reported financial condition was sustained substantially by an institutionalized, systematic, and creatively planned accounting fraud, known since as the Enron scandal. Enron has since become a well-known example of willful corporate fraud and corruption.

California Energy Crisis

The California electricity crisis, also known as the Western U.S. Energy Crisis of 2000 and 2001, was a situation in which the United States state of California had a shortage of electricity supply caused by market manipulations, illegal shutdowns of pipelines by the Texas energy consortium Enron, and capped retail electricity prices. The state suffered from multiple large-scale blackouts.

California had an installed generating capacity of 45GW. At the time of the blackouts, demand was 28GW. A demand supply gap was created by energy companies, mainly Enron, to create an artificial shortage. Energy traders took power plants offline for maintenance in days of peak demand to increase the price. Traders were thus able to sell power at premium prices, sometimes up to a factor of 20 times its normal value, thus makinf profit from the market instability.

The Federal Energy Regulatory Commission (FERC) investigated Enron’s involvement. The investigation led to a $1.52 billion settlement with a group of California agencies and private utilities on July 16, 2005. However, due to its other bankruptcy obligations, only US$202 million of this was expected to be paid.

The dataset

As a company of Enron’s (then) stature, it had millions of electronic files. In this study, we will analyze the so called Enron Corpus. The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation. The corpus is “unique” in that it is one of the only publicly available mass collections of “real” emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.

FERC publicly released emails from Enron. It had over 600,000 emails from 158 users, which consisted mostly of senior management officials. We will use labeled emails from the 2010 Text Retrieval Conference Legal Track. It consists of:

  • email - text of the message
  • responsive - does email relate to energy schedules or bids?

The ‘responsive’ option bears the opinions of legal experts. We would use this model to predict if a selected email is likely to indicative of involvement in the bidding.

The set consists of around 860 emails, which after cleaning and transforming, would be split into the training and test sets.

Approach

We will use Text Analytics, which according to Wikipedia “describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation”.

It has to be kept in mind that Text Analytics comes with its own set of caveats. Texts such as emails and tweets are often ‘loosely’ structured. Its almost certain that 2 different individuals would structure their emails or tweets differently. And we find that such texts tend to have poor spellings, non-traditional grammar and at times, multi-lingual. I used to work for a Danish organization and I can vouch for the number of emails that had Danish content in it :)

Our text analytics process would involve exploring the corpus or body of text, cleaning it to be fit for analysis, transforming the text data into a ‘sparse’ matrix like data using technique such as ‘Bag of Words’, Stop Words, Stemming, etc., clustering the transformed data to visually inspect the patterns in it, other visual ways of exploring the data and analytically modelling the corpus to predict certain outcomes.

Analysis

The data

We begin by loading the data and inspecting the structure of the data.

## 'data.frame':    855 obs. of  2 variables:
##  $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
##  $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...

Lets take a look at the first 2 emails. Lets also check if these emails are responsive.

# display first email

emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# The display style is not easy to read. So lets wap the text (similar as text warp in excel, to fir the contents into one cell,, screen in our case)
strwrap(emails$email[1])
##  [1] "North America's integrated electricity market requires cooperation"                         
##  [2] "on environmental policies Commission for Environmental Cooperation"                         
##  [3] "releases working paper on North America's electricity market"                               
##  [4] "Montreal, 27 November 2001 -- The North American Commission for"                            
##  [5] "Environmental Cooperation (CEC) is releasing a working paper"                               
##  [6] "highlighting the trend towards increasing trade, competition and"                           
##  [7] "cross-border investment in electricity between Canada, Mexico and"                          
##  [8] "the United States. It is hoped that the working paper,"                                     
##  [9] "Environmental Challenges and Opportunities in the Evolving North"                           
## [10] "American Electricity Market, will stimulate public discussion"                              
## [11] "around a CEC symposium of the same title about the need to"                                 
## [12] "coordinate environmental policies trinationally as a North"                                 
## [13] "America-wide electricity market develops. The CEC symposium will"                           
## [14] "take place in San Diego on 29-30 November, and will bring together"                         
## [15] "leading experts from industry, academia, NGOs and the governments"                          
## [16] "of Canada, Mexico and the United States to consider the impact of"                          
## [17] "the evolving continental electricity market on human health and"                            
## [18] "the environment. \"Our goal [with the working paper and the"                                
## [19] "symposium] is to highlight key environmental issues that must be"                           
## [20] "addressed as the electricity markets in North America become more"                          
## [21] "and more integrated,\" said Janine Ferretti, executive director of"                         
## [22] "the CEC. \"We want to stimulate discussion around the important"                            
## [23] "policy questions being raised so that countries can cooperate in"                           
## [24] "their approach to energy and the environment.\" The CEC, an"                                
## [25] "international organization created under an environmental side"                             
## [26] "agreement to NAFTA known as the North American Agreement on"                                
## [27] "Environmental Cooperation, was established to address regional"                             
## [28] "environmental concerns, help prevent potential trade and"                                   
## [29] "environmental conflicts, and promote the effective enforcement of"                          
## [30] "environmental law. The CEC Secretariat believes that greater North"                         
## [31] "American cooperation on environmental policies regarding the"                               
## [32] "continental electricity market is necessary to: * protect air"                              
## [33] "quality and mitigate climate change, * minimize the possibility of"                         
## [34] "environment-based trade disputes, * ensure a dependable supply of"                          
## [35] "reasonably priced electricity across North America * avoid"                                 
## [36] "creation of pollution havens, and * ensure local and national"                              
## [37] "environmental measures remain effective. The Changing Market The"                           
## [38] "working paper profiles the rapid changing North American"                                   
## [39] "electricity market. For example, in 2001, the US is projected to"                           
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"                         
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"                              
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"                        
## [43] "American electricity market has developed into a complex array of"                          
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"                            
## [45] "former US congressman and chairman of the CEC's Electricity"                                
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"                         
## [47] "in our environmental approaches as well.\" The Environmental"                               
## [48] "Profile of the Electricity Sector The electricity sector is the"                            
## [49] "single largest source of nationally reported toxins in the United"                          
## [50] "States and Canada and a large source in Mexico. In the US, the"                             
## [51] "electricity sector emits approximately 25 percent of all NOx"                               
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"                          
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."                              
## [54] "These emissions have a large impact on airsheds, watersheds and"                            
## [55] "migratory species corridors that are often shared between the"                              
## [56] "three North American countries. \"We want to discuss the possible"                          
## [57] "outcomes from greater efforts to coordinate federal, state or"                              
## [58] "provincial environmental laws and policies that relate to the"                              
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"                             
## [60] "compatible environmental approaches to help make domestic"                                  
## [61] "environmental policies more effective?\" The Effects of an"                                 
## [62] "Integrated Electricity Market One key issue raised in the paper is"                         
## [63] "the effect of market integration on the competitiveness of"                                 
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"                             
## [65] "choice largely determines environmental impacts from a specific"                            
## [66] "facility, along with pollution control technologies, performance"                           
## [67] "standards and regulations. The paper highlights other impacts of a"                         
## [68] "highly competitive market as well. For example, concerns about so"                          
## [69] "called \"pollution havens\" arise when significant differences in"                          
## [70] "environmental laws or enforcement practices induce power companies"                         
## [71] "to locate their operations in jurisdictions with lower standards."                          
## [72] "\"The CEC Secretariat is exploring what additional environmental"                           
## [73] "policies will work in this restructured market and how these"                               
## [74] "policies can be adapted to ensure that they enhance"                                        
## [75] "competitiveness and benefit the entire region,\" said Sharp."                               
## [76] "Because trade rules and policy measures directly influence the"                             
## [77] "variables that drive a successfully integrated North American"                              
## [78] "electricity market, the working paper also addresses fuel choice,"                          
## [79] "technology, pollution control strategies and subsidies. The CEC"                            
## [80] "will use the information gathered during the discussion period to"                          
## [81] "develop a final report that will be submitted to the Council in"                            
## [82] "early 2002. For more information or to view the live video webcast"                         
## [83] "of the symposium, please go to: http://www.cec.org/electricity."                            
## [84] "You may download the working paper and other supporting documents"                          
## [85] "from:"                                                                                      
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"                               
## [88] "Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514)"                            
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# responsive?
emails$responsive[1]
## [1] 0
# Lets do it for the second email too.

emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the  California State Auditor. I look forward to seeing you at The Aspen  Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
emails$responsive[2]
## [1] 1
# To get an understanding of the total number of responsive emails, lets to a table of values for the responsive feature of the data frame.

table(emails$responsive)
## 
##   0   1 
## 716 139

The secome email has turned out to be anti-climatic. Its a forwarded email with no inherent content in it. Such emails are beyond the scope of this study.

Transformations

We now begin the use of the ‘tm’ package.

We begin by cleaning and tidying the data. This involvs multiple steps.

It is important to know R treats ‘help’, ‘Help’, ‘hElP’, and ‘HELP’ differently. i.e. R treats upper and lower case representation of the same word as different words. So as part of the cleaning, we convert all the contents of the email to lower case.

First we define the data as a corpus and they do the lower case transformation. Once its done, we define the corpus to be a plain text document. We then proceed to remove parts of the text data that doesnt add any value to the analysis- like punctuations.

An example of a partly cleaned email is shown below.

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5607
## [1] "north americas integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north americas electricity market montreal 27 november 2001  the north american commission for environmental cooperation cec is releasing a working paper highlighting the trend towards increasing trade competition and crossborder investment in electricity between canada mexico and the united states it is hoped that the working paper environmental challenges and opportunities in the evolving north american electricity market will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north americawide electricity market develops the cec symposium will take place in san diego on 2930 november and will bring together leading experts from industry academia ngos and the governments of canada mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment our goal with the working paper and the symposium is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated said janine ferretti executive director of the cec we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment the cec an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation was established to address regional environmental concerns help prevent potential trade and environmental conflicts and promote the effective enforcement of environmental law the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to   protect air quality and mitigate climate change   minimize the possibility of environmentbased trade disputes   ensure a dependable supply of reasonably priced electricity across north america   avoid creation of pollution havens and   ensure local and national environmental measures remain effective the changing market the working paper profiles the rapid changing north american electricity market for example in 2001 the us is projected to export 131 thousand gigawatthours gwh of electricity to canada and mexico by 2007 this number is projected to grow to 169 thousand gwh of electricity over the past few decades the north american electricity market has developed into a complex array of crossborder transactions and relationships said phil sharp former us congressman and chairman of the cecs electricity advisory board we need to achieve this new level of cooperation in our environmental approaches as well the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico in the us the electricity sector emits approximately 25 percent of all nox emissions roughly 35 percent of all co2 emissions 25 percent of all mercury emissions and almost 70 percent of so2 emissions these emissions have a large impact on airsheds watersheds and migratory species corridors that are often shared between the three north american countries we want to discuss the possible outcomes from greater efforts to coordinate federal state or provincial environmental laws and policies that relate to the electricity sector said ferretti how can we develop more compatible environmental approaches to help make domestic environmental policies more effective the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal natural gas or renewables fuel choice largely determines environmental impacts from a specific facility along with pollution control technologies performance standards and regulations the paper highlights other impacts of a highly competitive market as well for example concerns about so called pollution havens arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region said sharp because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market the working paper also addresses fuel choice technology pollution control strategies and subsidies the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002 for more information or to view the live video webcast of the symposium please go to httpwwwcecorgelectricity you may download the working paper and other supporting documents from httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commission for environmental cooperation 393 rue stjacques ouest bureau 200 montrãal quãbec canada h2y 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg "
Stop Words

Stop words are a set of commonly used words in any language, in our case, English. Stop words (its absence!) is important to many applications. It helps eliminate words that are unlikely to contribute to better predictionthus helps reduce the size of the data. It also helps the algorithm focus on the ‘important’ words in the context of the problem in hand.

A simple analogy can be drawn using a google search. Say you want to learn R programming. You google ‘how do I learn R programming’. It would give you lots of pages that match your search. But one must be mindful of the fact that it has search terms like ‘how’, ‘do’ and ‘I’, which do not contribute meaningfully over the search phrase ‘leann r programming’, but instead gives tou some pages that are relevant to the words ‘I’, ‘do’, etc. This is the basic idea of stop words. They can be used in a whole range of tasks and but we are using it for:

  • Supervised machine learning: removing stop words from the feature space

  • Clustering: removing stop words prior to generating clusters

Stop words can be thought to be a “single set of words”. But in reality, mean different things to different applications. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. This is expected to work well in our case. However to some applications, this can be detrimental. In sentiment analysis, for instance, removing adjective terms such as ‘good’ and ‘nice’ or ‘not’ can cause algorithms to misinterpret the data. In such cases, more thought needs to be given to the choice of Stop Words.

Stemming

Lets ask ourselves a question: Do we need to draw a distinction between the following words, in context of the task in hand?

argue argued argues arguing test tests testing

The answer is not necessarily. Distinction between these words are not expected to contribute to the performance of our prediction. it can be thought of a common ‘stem’ word like argu or test respectively can represent these words effectively for our analysis. The algorithmic process of performing this reduction is called stemming. Many ways to approach the problem.

One approach is to build a database of words and their stems

  • Pro: handles exceptions

  • Con: won’t handle new words, bad for the Internet!

Another approach is developing a rule-based algorithm, where, for example, if word ends in “ed”, “ing”, or “ly”, we remove it ending.

  • Pro: handles new/unknown words well

  • Con: many exceptions, misses words like child and children (but would get other plurals: dog and dogs)

The second option is widely popular. The “Porter Stemmer” algorithm was developed by Martin Porter in the 80’s, and is still widely used! We use the default stemming algorithm in the tm package.

An example of an email that has been removed of the stop words and stemmed is shown below.

## [1] "north america integr electr market requir cooper  environment polici commiss  environment cooper releas work paper  north america electr market montreal 27 novemb 2001   north american commiss  environment cooper cec  releas  work paper highlight  trend toward increas trade competit  crossbord invest  electr  canada mexico   unit state   hope   work paper environment challeng  opportun   evolv north american electr market will stimul public discuss around  cec symposium    titl   need  coordin environment polici trinat   north americawid electr market develop  cec symposium will take place  san diego  2930 novemb  will bring togeth lead expert  industri academia ngos   govern  canada mexico   unit state  consid  impact   evolv continent electr market  human health   environ  goal   work paper   symposium   highlight key environment issu  must  address   electr market  north america becom    integr said janin ferretti execut director   cec  want  stimul discuss around  import polici question  rais   countri can cooper   approach  energi   environ  cec  intern organ creat   environment side agreement  nafta known   north american agreement  environment cooper  establish  address region environment concern help prevent potenti trade  environment conflict  promot  effect enforc  environment law  cec secretariat believ  greater north american cooper  environment polici regard  continent electr market  necessari    protect air qualiti  mitig climat chang   minim  possibl  environmentbas trade disput   ensur  depend suppli  reason price electr across north america   avoid creation  pollut haven    ensur local  nation environment measur remain effect  chang market  work paper profil  rapid chang north american electr market  exampl  2001  us  project  export 131 thousand gigawatthour gwh  electr  canada  mexico  2007  number  project  grow  169 thousand gwh  electr   past  decad  north american electr market  develop   complex array  crossbord transact  relationship said phil sharp former us congressman  chairman   cec electr advisori board  need  achiev  new level  cooper   environment approach  well  environment profil   electr sector  electr sector   singl largest sourc  nation report toxin   unit state  canada   larg sourc  mexico   us  electr sector emit approxim 25 percent   nox emiss rough 35 percent   co2 emiss 25 percent   mercuri emiss  almost 70 percent  so2 emiss  emiss   larg impact  airsh watersh  migratori speci corridor   often share   three north american countri  want  discuss  possibl outcom  greater effort  coordin feder state  provinci environment law  polici  relat   electr sector said ferretti  can  develop  compat environment approach  help make domest environment polici  effect  effect   integr electr market one key issu rais   paper   effect  market integr   competit  particular fuel   coal natur gas  renew fuel choic larg determin environment impact   specif facil along  pollut control technolog perform standard  regul  paper highlight  impact   high competit market  well  exampl concern   call pollut haven aris  signific differ  environment law  enforc practic induc power compani  locat  oper  jurisdict  lower standard  cec secretariat  explor  addit environment polici will work   restructur market    polici can  adapt  ensur   enhanc competit  benefit  entir region said sharp  trade rule  polici measur direct influenc  variabl  drive  success integr north american electr market  work paper also address fuel choic technolog pollut control strategi  subsidi  cec will use  inform gather   discuss period  develop  final report  will  submit   council  earli 2002   inform   view  live video webcast   symposium pleas go  httpwwwcecorgelectr  may download  work paper   support document  httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss  environment cooper 393 rue stjacqu ouest bureau 200 montrãal quãbec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg"
Sparsity

We not transform the corpus into a matrix and we proceed to make it sparse. The sparse matrix is, as the name suggests, is sparse, meaning that it has sparse data. Its mostly zeros. Processing such a matrix is easy and quick in terms of time and memory taken.

There needs to be a threshold for sparcity. In this study, we will consider a threshold of 3%. This means that all words that occur less than 3% of the times would be eliminated. This is a way of finding the vital few.

## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51612/622128
## Sparsity           : 92%
## Maximal term length: 19
## Weighting          : term frequency (tf)
Visualization of frequent words

Now that we have our final matrix needed for our analysis, lets explore it to understand it better.

Lets look at the most and least frequently occuring words and see if it makes sense in this context.

##   [1] "2000"       "2001"       "agreement"  "also"       "attach"    
##   [6] "bill"       "busi"       "california" "call"       "can"       
##  [11] "cap"        "chang"      "comment"    "commiss"    "compani"   
##  [16] "contract"   "corp"       "cost"       "credit"     "current"   
##  [21] "custom"     "david"      "day"        "deal"       "demand"    
##  [26] "discuss"    "document"   "draft"      "electr"     "email"     
##  [31] "energi"     "enron"      "fax"        "ferc"       "file"      
##  [36] "first"      "follow"     "forward"    "gas"        "generat"   
##  [41] "get"        "group"      "houston"    "includ"     "increas"   
##  [46] "inform"     "iso"        "issu"       "jeff"       "john"      
##  [51] "just"       "know"       "last"       "legal"      "let"       
##  [56] "like"       "look"       "make"       "manag"      "mark"      
##  [61] "market"     "may"        "meet"       "messag"     "month"     
##  [66] "natur"      "need"       "new"        "now"        "one"       
##  [71] "oper"       "order"      "origin"     "per"        "plan"      
##  [76] "plant"      "pleas"      "point"      "power"      "price"     
##  [81] "product"    "project"    "propos"     "provid"     "purchas"   
##  [86] "question"   "rate"       "receiv"     "regard"     "report"    
##  [91] "request"    "requir"     "respons"    "review"     "risk"      
##  [96] "said"       "say"        "see"        "sent"       "servic"    
## [101] "state"      "subject"    "suppli"     "system"     "take"      
## [106] "term"       "thank"      "time"       "trade"      "transact"  
## [111] "transmiss"  "two"        "use"        "util"       "want"      
## [116] "week"       "will"       "work"       "year"

The above list shows the words that occur atleast 200 times in the data set. It does have a few words one would associate with an energy fraud endeavour. Lets increase the threshold to 400 and see if the words that occur atleast 400 times lends more insights into the scam.

##  [1] "2001"       "agreement"  "also"       "attach"     "california"
##  [6] "call"       "can"        "chang"      "compani"    "contract"  
## [11] "electr"     "email"      "energi"     "enron"      "forward"   
## [16] "gas"        "generat"    "inform"     "know"       "market"    
## [21] "may"        "need"       "new"        "pleas"      "power"     
## [26] "price"      "said"       "state"      "subject"    "thank"     
## [31] "time"       "trade"      "use"        "util"       "will"

The words does have terms one would associate with an energy company, but its hard to gain more insights from it.

Instead of setting arbitrary thresholds like 200 and 400 to measure word frequencies, lets find the most and least occurring words and display them.

## [1] 788
##  dasovichnaenron           havent            sorri keannaenronenron 
##               26               26               26               27 
##   kaminskihouect            readi         therefor           andrew 
##               28               28               29               30 
##           attent         consider         dissemin          instead 
##               30               30               31               31 
##        afternoon            anyon           format             dear 
##               32               32               32               33 
##            delay              els          explain           extend 
##               33               33               33               33
##        may        new        can       said    compani     attach 
##        509        527        541        543        547        593 
##      state california     electr    forward     energi      pleas 
##        670        758        769        776        787        793 
##        gas    subject      price      email      enron     market 
##        835        993        997       1001       1047       1170 
##      power       will 
##       1199       1580

We see that there are 788 words in the list in total. The first table shows the least occurring words and the second, most occurring words.

Lets make some graphical representation of the word occurrences using ggplot.

The word cloud option gives a colorful way of plotting the frequently occurring word.

Clustering of words

Getting a hierarchical clustering of the data will help understand the patterns in the email data.

This shows a clear pattern. Lets make boxes to delinate the 2 clusters.

The pattern very clearly reveal 2 clusters. This is possibly the manifestation of the differences between the responsive and non-responsive emails.

If you have R version that is NOT “version 3.2.0”, you may try the following vizualization of correlation plot library(graph) library(Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T) These packages are not available for R version 3.2.0 that I have. So these plots are omitted from this analysis.

Models

Lets start building models to enable the prediction of an email being responsive.

Before that we need to join the cleaned and processed dtm data with our response variable, responsive. And also, we see that some of the column names are not legitimate R column names. So we would have to change the names of these columns. We use the make.names on the colnames to get the column names to legitimate R column names. Lets take a look at the structure of the first 10 rows of dat data frame.

## 'data.frame':    855 obs. of  10 variables:
##  $ X100  : num  0 0 0 0 0 0 5 0 0 0 ...
##  $ X1400 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1999 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2000 : num  0 0 1 0 1 0 6 0 1 0 ...
##  $ X2001 : num  2 1 0 0 0 0 7 0 0 0 ...
##  $ X713  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X77002: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl   : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ accept: num  0 0 0 0 0 0 1 0 0 0 ...
##  $ access: num  0 0 0 0 0 0 0 0 0 0 ...

We now proceed to split the model into training anf test sets. As always, we use the caret package for this and do a 70/30 split.

Trees

Lets start with a simple tree model.

## CART 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 600, 600, 600, 600, 600, 600, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.05102041  0.8659458  0.4305292  0.01938817   0.07318537
##   0.08163265  0.8688866  0.4263985  0.01447305   0.06547316
##   0.22448980  0.8607768  0.3597811  0.02150239   0.16329071
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.08163265.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  196  18
##        yes  18  23
##                                           
##                Accuracy : 0.8588          
##                  95% CI : (0.8099, 0.8991)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.2239          
##                                           
##                   Kappa : 0.4769          
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.9159          
##             Specificity : 0.5610          
##          Pos Pred Value : 0.9159          
##          Neg Pred Value : 0.5610          
##              Prevalence : 0.8392          
##          Detection Rate : 0.7686          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.7384          
##                                           
##        'Positive' Class : no              
## 
## [1] "The untuned Tree model gives an accuracy of 85.88 %"

Lets see if we can improve this model any further by using 10-fold CV

## CART 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 541, 539, 540, 540, 540, 540, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD 
##   0.05102041  0.8733269  0.4185940  0.01407933   0.1108604
##   0.08163265  0.8582421  0.3797495  0.02687094   0.1162744
##   0.22448980  0.8533241  0.3854441  0.02586106   0.1054384
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.05102041.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  208   6
##        yes  25  16
##                                           
##                Accuracy : 0.8784          
##                  95% CI : (0.8319, 0.9159)
##     No Information Rate : 0.9137          
##     P-Value [Acc > NIR] : 0.978828        
##                                           
##                   Kappa : 0.4457          
##  Mcnemar's Test P-Value : 0.001225        
##                                           
##             Sensitivity : 0.8927          
##             Specificity : 0.7273          
##          Pos Pred Value : 0.9720          
##          Neg Pred Value : 0.3902          
##              Prevalence : 0.9137          
##          Detection Rate : 0.8157          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.8100          
##                                           
##        'Positive' Class : no              
## 
## [1] "The Tree model resampled with 10-fold CV gives an accuracy of 87.84 %"
## [1] "This tree1 model provides an improvement of 2.28 % over the tree model with default resampling, tree"

We enable Parallel Processing to shorten the computing times.

## [1] "Number of registered cores is 4"

But the parallel processing may give slightly non-reproducable results.

Lets try random forest model.

## Random Forest 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 600, 600, 600, 600, 600, 600, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##     2   0.8548692  0.2135687  0.02628675   0.08815368
##    39   0.8837965  0.4882833  0.02070043   0.07354551
##   787   0.8750118  0.5008818  0.02112316   0.05908473
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 39.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  207   7
##        yes  23  18
##                                           
##                Accuracy : 0.8824          
##                  95% CI : (0.8363, 0.9192)
##     No Information Rate : 0.902           
##     P-Value [Acc > NIR] : 0.87508         
##                                           
##                   Kappa : 0.4824          
##  Mcnemar's Test P-Value : 0.00617         
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.9673          
##          Neg Pred Value : 0.4390          
##              Prevalence : 0.9020          
##          Detection Rate : 0.8118          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.8100          
##                                           
##        'Positive' Class : no              
## 
## [1] "The untuned rf model gives an accuracy of 88.24 %"

Lets continue with RF, but use 10-fold CV as the resampling method instead of the default bootstrapping.

## Random Forest 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD 
##     2   0.8583250  0.2336371  0.01804490   0.1748317
##    39   0.8899917  0.5215636  0.03165645   0.1334932
##   787   0.8767412  0.4927091  0.03301471   0.1158294
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 39.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  207   7
##        yes  22  19
##                                           
##                Accuracy : 0.8863          
##                  95% CI : (0.8408, 0.9225)
##     No Information Rate : 0.898           
##     P-Value [Acc > NIR] : 0.76960         
##                                           
##                   Kappa : 0.5055          
##  Mcnemar's Test P-Value : 0.00933         
##                                           
##             Sensitivity : 0.9039          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.9673          
##          Neg Pred Value : 0.4634          
##              Prevalence : 0.8980          
##          Detection Rate : 0.8118          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.8173          
##                                           
##        'Positive' Class : no              
## 
## [1] "The 10-fold cross validates rf1 model gives an accuracy of 88.63 %"

Lets look at boosting models with 10-fold CV.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8246             nan     0.1000    0.0287
##      2        0.7780             nan     0.1000    0.0196
##      3        0.7403             nan     0.1000    0.0175
##      4        0.7148             nan     0.1000    0.0109
##      5        0.6935             nan     0.1000    0.0069
##      6        0.6687             nan     0.1000    0.0085
##      7        0.6504             nan     0.1000    0.0067
##      8        0.6282             nan     0.1000    0.0078
##      9        0.6093             nan     0.1000    0.0058
##     10        0.5956             nan     0.1000    0.0047
##     20        0.4986             nan     0.1000    0.0022
##     40        0.4069             nan     0.1000   -0.0006
##     50        0.3706             nan     0.1000   -0.0014
## Stochastic Gradient Boosting 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.8866593  0.5106398  0.02333747 
##   1                  100      0.8883259  0.5250801  0.03773983 
##   1                  150      0.8833532  0.5184176  0.03756524 
##   2                   50      0.8900199  0.5442236  0.03148172 
##   2                  100      0.8817130  0.5221985  0.03605928 
##   2                  150      0.8834635  0.5357421  0.04306360 
##   3                   50      0.8867148  0.5345120  0.02779668 
##   3                  100      0.8867695  0.5465803  0.03856169 
##   3                  150      0.8851311  0.5435639  0.03392098 
##   Kappa SD  
##   0.09843233
##   0.15345884
##   0.14016282
##   0.12121627
##   0.12685022
##   0.15091245
##   0.12487392
##   0.13934852
##   0.13191483
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 2, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  208   6
##        yes  23  18
##                                           
##                Accuracy : 0.8863          
##                  95% CI : (0.8408, 0.9225)
##     No Information Rate : 0.9059          
##     P-Value [Acc > NIR] : 0.879122        
##                                           
##                   Kappa : 0.4937          
##  Mcnemar's Test P-Value : 0.002967        
##                                           
##             Sensitivity : 0.9004          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.9720          
##          Neg Pred Value : 0.4390          
##              Prevalence : 0.9059          
##          Detection Rate : 0.8157          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.8252          
##                                           
##        'Positive' Class : no              
## 
## [1] "The untuned gbm model gives an accuracy of 88.63 %"

Lets try boosting with repeated (10 times) 10-fold CV

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8129             nan     0.1000    0.0329
##      2        0.7641             nan     0.1000    0.0242
##      3        0.7222             nan     0.1000    0.0140
##      4        0.6901             nan     0.1000    0.0165
##      5        0.6660             nan     0.1000    0.0082
##      6        0.6379             nan     0.1000    0.0101
##      7        0.6134             nan     0.1000    0.0103
##      8        0.5940             nan     0.1000    0.0058
##      9        0.5749             nan     0.1000    0.0046
##     10        0.5585             nan     0.1000    0.0052
##     20        0.4497             nan     0.1000    0.0000
##     40        0.3448             nan     0.1000   -0.0014
##     60        0.2772             nan     0.1000   -0.0000
##     80        0.2248             nan     0.1000   -0.0006
##    100        0.1891             nan     0.1000   -0.0007
## Stochastic Gradient Boosting 
## 
## 600 samples
## 788 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## 
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD 
##   1                   50      0.8896813  0.5042790  0.02687093   0.1364611
##   1                  100      0.8875508  0.5105705  0.02901190   0.1374983
##   1                  150      0.8890342  0.5237410  0.02760239   0.1335777
##   2                   50      0.8871923  0.5185103  0.03187558   0.1435163
##   2                  100      0.8876807  0.5330914  0.02902559   0.1315474
##   2                  150      0.8895197  0.5496173  0.03089993   0.1260713
##   3                   50      0.8875200  0.5286328  0.02802356   0.1216371
##   3                  100      0.8898530  0.5491060  0.03333461   0.1458003
##   3                  150      0.8877024  0.5531663  0.03474165   0.1362639
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  205   9
##        yes  20  21
##                                           
##                Accuracy : 0.8863          
##                  95% CI : (0.8408, 0.9225)
##     No Information Rate : 0.8824          
##     P-Value [Acc > NIR] : 0.47112         
##                                           
##                   Kappa : 0.5273          
##  Mcnemar's Test P-Value : 0.06332         
##                                           
##             Sensitivity : 0.9111          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.9579          
##          Neg Pred Value : 0.5122          
##              Prevalence : 0.8824          
##          Detection Rate : 0.8039          
##    Detection Prevalence : 0.8392          
##       Balanced Accuracy : 0.8056          
##                                           
##        'Positive' Class : no              
## 
## [1] "The untuned gbm model gives an accuracy of 88.63 %"

Lets look at the 20 most important contributors to the gbm1 model prediction.

The plot makes sense in the context of the problem in hand. We see that the most important predictior words, in decreasing order of importance are:

  • California
  • price
  • load
  • capac (capacity)
  • demand

which fits well within the context of energy company and demand manipulation.

Since we find that the word ‘california’ is the most important predictor word, it would be insightful to know which words are closely associated, in statistical terms, correlated with it. Lets look at the set of words which are correlated with the occurrence of the word ‘california’. We specift the correlation threshold to be 0.7.

Lets repeat the process for the 5 most important predictor words. The correlation threshold for the second and other important predictor words are set to 0.6.

## $california
##   electr   consum   profit   public     caus wholesal     time     busi 
##     0.87     0.84     0.80     0.80     0.78     0.78     0.77     0.76 
##    everi    price   action   energi    found     paid    state  practic 
##     0.75     0.75     0.73     0.73     0.73     0.73     0.73     0.72 
##      act   dollar      use    power 
##     0.71     0.71     0.71     0.70
## $price
## california     electr   wholesal      among     consum      level 
##       0.75       0.73       0.72       0.71       0.71       0.71 
##     market     profit     action       caus       part     demand 
##       0.69       0.68       0.67       0.67       0.67       0.65 
##     public     period    practic     suppli     trader        act 
##       0.65       0.64       0.64       0.64       0.64       0.63 
##     result       2000       base      power       real       time 
##       0.63       0.62       0.62       0.62       0.62       0.62 
##       fact      plant     includ 
##       0.61       0.61       0.60
## $load
## staff 
##  0.62
## $capac
##  bid 
## 0.67
## $demand
##   suppli   electr  generat     part    price   market    among   custom 
##     0.79     0.69     0.67     0.65     0.65     0.64     0.62     0.62 
##    plant     rais     caus competit   energi    power 
##     0.62     0.62     0.61     0.61     0.60     0.60

Not surprisingly, we see that electr(icity), consum(er) and profit are most correlated with california. Its worth keeping in mind that the alphabets within brackets indicate the parts of words removed by stemming.

Result

## [1] "The least accurate prediction was 85.88 %, given by tree model"
## [1] "The most accurate prediction was 88.63 %, given by rf1 model"
## [1] "The most accurate model rf1 gives an improvement of 3.202% over the least accurate model of tree"
## [1] "With this model, we can expect close to 88.63% of correct predictions of responsiveness on a similar email datset that the model hasnt 'seen' yet."

Conclusion

We have used text analytics to decipher and evaluate a set of emails with respect to a response variable that characterizes the emails into 2 mutually exclusive categories.

In the process of setting up the model, we explored the various preprocessing options like - converting the text to lower case - removing punctuations - definition and use of stop words - introduction to stemming, types and its use in text analytics - converting the data into a sparse matrix form - numerous representations of frequently occurring words using ggplot and word cloud - hierarchical clustering of the data to reveal clear patterns in the emails - setting up of parallel processing - CART models - Random Forest models - Boosting models - important predictor words from the model - correlations of most important predictors

This analysis forms part of whats called ‘predictive coding’. In essence, predictive coding gets the input from a human, who reviews samples of documents and marks them according to the need of the task (responsive or benign in out context) and uses this human decision as its input to predict or generalize these decisions across a larger collection of documents. So this allows to atleast partly replace, if not predominantly, the expensive and tedious manual investigation of a large set of documents by computer programs.

This is of great importance as recently, in April 2012, a state judge in Virginia issued the first state court ruling allowing the use of predictive coding in e-discovery in the case Global Aerospace, Inc. With such a ruling, its not difficult to see the optential and scope for predictive coding.

Footnote: If the file gives any error while being converted to htlm using the ‘Knit HTML’ option in RStudio, the follwing code snippet can be used. library(knitr) knit2html(“enron.Rmd”)