Previously we saw the use of Analytics and Machine Learning to predict the judgements of cases by the US Supreme Court. In this section, we would still be discussing a problem that has legal dimension to it, but from a different context. In this study, we confine ourself to only the tree based models
Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. Before its bankruptcy on December 2, 2001, Enron employed approximately 20,000 staff and was one of the world’s major electricity, natural gas, communications, and pulp and paper companies, with claimed revenues of nearly $111 billion during 2000. Fortune named Enron “America’s Most Innovative Company” for six consecutive years.
At the end of 2001, it was revealed that its reported financial condition was sustained substantially by an institutionalized, systematic, and creatively planned accounting fraud, known since as the Enron scandal. Enron has since become a well-known example of willful corporate fraud and corruption.
The California electricity crisis, also known as the Western U.S. Energy Crisis of 2000 and 2001, was a situation in which the United States state of California had a shortage of electricity supply caused by market manipulations, illegal shutdowns of pipelines by the Texas energy consortium Enron, and capped retail electricity prices. The state suffered from multiple large-scale blackouts.
California had an installed generating capacity of 45GW. At the time of the blackouts, demand was 28GW. A demand supply gap was created by energy companies, mainly Enron, to create an artificial shortage. Energy traders took power plants offline for maintenance in days of peak demand to increase the price. Traders were thus able to sell power at premium prices, sometimes up to a factor of 20 times its normal value, thus makinf profit from the market instability.
The Federal Energy Regulatory Commission (FERC) investigated Enron’s involvement. The investigation led to a $1.52 billion settlement with a group of California agencies and private utilities on July 16, 2005. However, due to its other bankruptcy obligations, only US$202 million of this was expected to be paid.
As a company of Enron’s (then) stature, it had millions of electronic files. In this study, we will analyze the so called Enron Corpus. The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation. The corpus is “unique” in that it is one of the only publicly available mass collections of “real” emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.
FERC publicly released emails from Enron. It had over 600,000 emails from 158 users, which consisted mostly of senior management officials. We will use labeled emails from the 2010 Text Retrieval Conference Legal Track. It consists of:
The ‘responsive’ option bears the opinions of legal experts. We would use this model to predict if a selected email is likely to indicative of involvement in the bidding.
The set consists of around 860 emails, which after cleaning and transforming, would be split into the training and test sets.
We will use Text Analytics, which according to Wikipedia “describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation”.
It has to be kept in mind that Text Analytics comes with its own set of caveats. Texts such as emails and tweets are often ‘loosely’ structured. Its almost certain that 2 different individuals would structure their emails or tweets differently. And we find that such texts tend to have poor spellings, non-traditional grammar and at times, multi-lingual. I used to work for a Danish organization and I can vouch for the number of emails that had Danish content in it :)
Our text analytics process would involve exploring the corpus or body of text, cleaning it to be fit for analysis, transforming the text data into a ‘sparse’ matrix like data using technique such as ‘Bag of Words’, Stop Words, Stemming, etc., clustering the transformed data to visually inspect the patterns in it, other visual ways of exploring the data and analytically modelling the corpus to predict certain outcomes.
We begin by loading the data and inspecting the structure of the data.
## 'data.frame': 855 obs. of 2 variables:
## $ email : chr "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
## $ responsive: int 0 1 0 1 0 0 1 0 0 0 ...
Lets take a look at the first 2 emails. Lets also check if these emails are responsive.
# display first email
emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# The display style is not easy to read. So lets wap the text (similar as text warp in excel, to fir the contents into one cell,, screen in our case)
strwrap(emails$email[1])
## [1] "North America's integrated electricity market requires cooperation"
## [2] "on environmental policies Commission for Environmental Cooperation"
## [3] "releases working paper on North America's electricity market"
## [4] "Montreal, 27 November 2001 -- The North American Commission for"
## [5] "Environmental Cooperation (CEC) is releasing a working paper"
## [6] "highlighting the trend towards increasing trade, competition and"
## [7] "cross-border investment in electricity between Canada, Mexico and"
## [8] "the United States. It is hoped that the working paper,"
## [9] "Environmental Challenges and Opportunities in the Evolving North"
## [10] "American Electricity Market, will stimulate public discussion"
## [11] "around a CEC symposium of the same title about the need to"
## [12] "coordinate environmental policies trinationally as a North"
## [13] "America-wide electricity market develops. The CEC symposium will"
## [14] "take place in San Diego on 29-30 November, and will bring together"
## [15] "leading experts from industry, academia, NGOs and the governments"
## [16] "of Canada, Mexico and the United States to consider the impact of"
## [17] "the evolving continental electricity market on human health and"
## [18] "the environment. \"Our goal [with the working paper and the"
## [19] "symposium] is to highlight key environmental issues that must be"
## [20] "addressed as the electricity markets in North America become more"
## [21] "and more integrated,\" said Janine Ferretti, executive director of"
## [22] "the CEC. \"We want to stimulate discussion around the important"
## [23] "policy questions being raised so that countries can cooperate in"
## [24] "their approach to energy and the environment.\" The CEC, an"
## [25] "international organization created under an environmental side"
## [26] "agreement to NAFTA known as the North American Agreement on"
## [27] "Environmental Cooperation, was established to address regional"
## [28] "environmental concerns, help prevent potential trade and"
## [29] "environmental conflicts, and promote the effective enforcement of"
## [30] "environmental law. The CEC Secretariat believes that greater North"
## [31] "American cooperation on environmental policies regarding the"
## [32] "continental electricity market is necessary to: * protect air"
## [33] "quality and mitigate climate change, * minimize the possibility of"
## [34] "environment-based trade disputes, * ensure a dependable supply of"
## [35] "reasonably priced electricity across North America * avoid"
## [36] "creation of pollution havens, and * ensure local and national"
## [37] "environmental measures remain effective. The Changing Market The"
## [38] "working paper profiles the rapid changing North American"
## [39] "electricity market. For example, in 2001, the US is projected to"
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"
## [43] "American electricity market has developed into a complex array of"
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"
## [45] "former US congressman and chairman of the CEC's Electricity"
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"
## [47] "in our environmental approaches as well.\" The Environmental"
## [48] "Profile of the Electricity Sector The electricity sector is the"
## [49] "single largest source of nationally reported toxins in the United"
## [50] "States and Canada and a large source in Mexico. In the US, the"
## [51] "electricity sector emits approximately 25 percent of all NOx"
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."
## [54] "These emissions have a large impact on airsheds, watersheds and"
## [55] "migratory species corridors that are often shared between the"
## [56] "three North American countries. \"We want to discuss the possible"
## [57] "outcomes from greater efforts to coordinate federal, state or"
## [58] "provincial environmental laws and policies that relate to the"
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"
## [60] "compatible environmental approaches to help make domestic"
## [61] "environmental policies more effective?\" The Effects of an"
## [62] "Integrated Electricity Market One key issue raised in the paper is"
## [63] "the effect of market integration on the competitiveness of"
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"
## [65] "choice largely determines environmental impacts from a specific"
## [66] "facility, along with pollution control technologies, performance"
## [67] "standards and regulations. The paper highlights other impacts of a"
## [68] "highly competitive market as well. For example, concerns about so"
## [69] "called \"pollution havens\" arise when significant differences in"
## [70] "environmental laws or enforcement practices induce power companies"
## [71] "to locate their operations in jurisdictions with lower standards."
## [72] "\"The CEC Secretariat is exploring what additional environmental"
## [73] "policies will work in this restructured market and how these"
## [74] "policies can be adapted to ensure that they enhance"
## [75] "competitiveness and benefit the entire region,\" said Sharp."
## [76] "Because trade rules and policy measures directly influence the"
## [77] "variables that drive a successfully integrated North American"
## [78] "electricity market, the working paper also addresses fuel choice,"
## [79] "technology, pollution control strategies and subsidies. The CEC"
## [80] "will use the information gathered during the discussion period to"
## [81] "develop a final report that will be submitted to the Council in"
## [82] "early 2002. For more information or to view the live video webcast"
## [83] "of the symposium, please go to: http://www.cec.org/electricity."
## [84] "You may download the working paper and other supporting documents"
## [85] "from:"
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"
## [88] "Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514)"
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# responsive?
emails$responsive[1]
## [1] 0
# Lets do it for the second email too.
emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the California State Auditor. I look forward to seeing you at The Aspen Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
emails$responsive[2]
## [1] 1
# To get an understanding of the total number of responsive emails, lets to a table of values for the responsive feature of the data frame.
table(emails$responsive)
##
## 0 1
## 716 139
The secome email has turned out to be anti-climatic. Its a forwarded email with no inherent content in it. Such emails are beyond the scope of this study.
We now begin the use of the ‘tm’ package.
We begin by cleaning and tidying the data. This involvs multiple steps.
It is important to know R treats ‘help’, ‘Help’, ‘hElP’, and ‘HELP’ differently. i.e. R treats upper and lower case representation of the same word as different words. So as part of the cleaning, we convert all the contents of the email to lower case.
First we define the data as a corpus and they do the lower case transformation. Once its done, we define the corpus to be a plain text document. We then proceed to remove parts of the text data that doesnt add any value to the analysis- like punctuations.
An example of a partly cleaned email is shown below.
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 5607
## [1] "north americas integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north americas electricity market montreal 27 november 2001 the north american commission for environmental cooperation cec is releasing a working paper highlighting the trend towards increasing trade competition and crossborder investment in electricity between canada mexico and the united states it is hoped that the working paper environmental challenges and opportunities in the evolving north american electricity market will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north americawide electricity market develops the cec symposium will take place in san diego on 2930 november and will bring together leading experts from industry academia ngos and the governments of canada mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment our goal with the working paper and the symposium is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated said janine ferretti executive director of the cec we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment the cec an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation was established to address regional environmental concerns help prevent potential trade and environmental conflicts and promote the effective enforcement of environmental law the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to protect air quality and mitigate climate change minimize the possibility of environmentbased trade disputes ensure a dependable supply of reasonably priced electricity across north america avoid creation of pollution havens and ensure local and national environmental measures remain effective the changing market the working paper profiles the rapid changing north american electricity market for example in 2001 the us is projected to export 131 thousand gigawatthours gwh of electricity to canada and mexico by 2007 this number is projected to grow to 169 thousand gwh of electricity over the past few decades the north american electricity market has developed into a complex array of crossborder transactions and relationships said phil sharp former us congressman and chairman of the cecs electricity advisory board we need to achieve this new level of cooperation in our environmental approaches as well the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico in the us the electricity sector emits approximately 25 percent of all nox emissions roughly 35 percent of all co2 emissions 25 percent of all mercury emissions and almost 70 percent of so2 emissions these emissions have a large impact on airsheds watersheds and migratory species corridors that are often shared between the three north american countries we want to discuss the possible outcomes from greater efforts to coordinate federal state or provincial environmental laws and policies that relate to the electricity sector said ferretti how can we develop more compatible environmental approaches to help make domestic environmental policies more effective the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal natural gas or renewables fuel choice largely determines environmental impacts from a specific facility along with pollution control technologies performance standards and regulations the paper highlights other impacts of a highly competitive market as well for example concerns about so called pollution havens arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region said sharp because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market the working paper also addresses fuel choice technology pollution control strategies and subsidies the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002 for more information or to view the live video webcast of the symposium please go to httpwwwcecorgelectricity you may download the working paper and other supporting documents from httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commission for environmental cooperation 393 rue stjacques ouest bureau 200 montrãal quãbec canada h2y 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg "
Stop words are a set of commonly used words in any language, in our case, English. Stop words (its absence!) is important to many applications. It helps eliminate words that are unlikely to contribute to better predictionthus helps reduce the size of the data. It also helps the algorithm focus on the ‘important’ words in the context of the problem in hand.
A simple analogy can be drawn using a google search. Say you want to learn R programming. You google ‘how do I learn R programming’. It would give you lots of pages that match your search. But one must be mindful of the fact that it has search terms like ‘how’, ‘do’ and ‘I’, which do not contribute meaningfully over the search phrase ‘leann r programming’, but instead gives tou some pages that are relevant to the words ‘I’, ‘do’, etc. This is the basic idea of stop words. They can be used in a whole range of tasks and but we are using it for:
Supervised machine learning: removing stop words from the feature space
Clustering: removing stop words prior to generating clusters
Stop words can be thought to be a “single set of words”. But in reality, mean different things to different applications. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. This is expected to work well in our case. However to some applications, this can be detrimental. In sentiment analysis, for instance, removing adjective terms such as ‘good’ and ‘nice’ or ‘not’ can cause algorithms to misinterpret the data. In such cases, more thought needs to be given to the choice of Stop Words.
Lets ask ourselves a question: Do we need to draw a distinction between the following words, in context of the task in hand?
argue argued argues arguing test tests testing
The answer is not necessarily. Distinction between these words are not expected to contribute to the performance of our prediction. it can be thought of a common ‘stem’ word like argu or test respectively can represent these words effectively for our analysis. The algorithmic process of performing this reduction is called stemming. Many ways to approach the problem.
One approach is to build a database of words and their stems
Pro: handles exceptions
Con: won’t handle new words, bad for the Internet!
Another approach is developing a rule-based algorithm, where, for example, if word ends in “ed”, “ing”, or “ly”, we remove it ending.
Pro: handles new/unknown words well
Con: many exceptions, misses words like child and children (but would get other plurals: dog and dogs)
The second option is widely popular. The “Porter Stemmer” algorithm was developed by Martin Porter in the 80’s, and is still widely used! We use the default stemming algorithm in the tm package.
An example of an email that has been removed of the stop words and stemmed is shown below.
## [1] "north america integr electr market requir cooper environment polici commiss environment cooper releas work paper north america electr market montreal 27 novemb 2001 north american commiss environment cooper cec releas work paper highlight trend toward increas trade competit crossbord invest electr canada mexico unit state hope work paper environment challeng opportun evolv north american electr market will stimul public discuss around cec symposium titl need coordin environment polici trinat north americawid electr market develop cec symposium will take place san diego 2930 novemb will bring togeth lead expert industri academia ngos govern canada mexico unit state consid impact evolv continent electr market human health environ goal work paper symposium highlight key environment issu must address electr market north america becom integr said janin ferretti execut director cec want stimul discuss around import polici question rais countri can cooper approach energi environ cec intern organ creat environment side agreement nafta known north american agreement environment cooper establish address region environment concern help prevent potenti trade environment conflict promot effect enforc environment law cec secretariat believ greater north american cooper environment polici regard continent electr market necessari protect air qualiti mitig climat chang minim possibl environmentbas trade disput ensur depend suppli reason price electr across north america avoid creation pollut haven ensur local nation environment measur remain effect chang market work paper profil rapid chang north american electr market exampl 2001 us project export 131 thousand gigawatthour gwh electr canada mexico 2007 number project grow 169 thousand gwh electr past decad north american electr market develop complex array crossbord transact relationship said phil sharp former us congressman chairman cec electr advisori board need achiev new level cooper environment approach well environment profil electr sector electr sector singl largest sourc nation report toxin unit state canada larg sourc mexico us electr sector emit approxim 25 percent nox emiss rough 35 percent co2 emiss 25 percent mercuri emiss almost 70 percent so2 emiss emiss larg impact airsh watersh migratori speci corridor often share three north american countri want discuss possibl outcom greater effort coordin feder state provinci environment law polici relat electr sector said ferretti can develop compat environment approach help make domest environment polici effect effect integr electr market one key issu rais paper effect market integr competit particular fuel coal natur gas renew fuel choic larg determin environment impact specif facil along pollut control technolog perform standard regul paper highlight impact high competit market well exampl concern call pollut haven aris signific differ environment law enforc practic induc power compani locat oper jurisdict lower standard cec secretariat explor addit environment polici will work restructur market polici can adapt ensur enhanc competit benefit entir region said sharp trade rule polici measur direct influenc variabl drive success integr north american electr market work paper also address fuel choic technolog pollut control strategi subsidi cec will use inform gather discuss period develop final report will submit council earli 2002 inform view live video webcast symposium pleas go httpwwwcecorgelectr may download work paper support document httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss environment cooper 393 rue stjacqu ouest bureau 200 montrãal quãbec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg"
We not transform the corpus into a matrix and we proceed to make it sparse. The sparse matrix is, as the name suggests, is sparse, meaning that it has sparse data. Its mostly zeros. Processing such a matrix is easy and quick in terms of time and memory taken.
There needs to be a threshold for sparcity. In this study, we will consider a threshold of 3%. This means that all words that occur less than 3% of the times would be eliminated. This is a way of finding the vital few.
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51612/622128
## Sparsity : 92%
## Maximal term length: 19
## Weighting : term frequency (tf)
Now that we have our final matrix needed for our analysis, lets explore it to understand it better.
Lets look at the most and least frequently occuring words and see if it makes sense in this context.
## [1] "2000" "2001" "agreement" "also" "attach"
## [6] "bill" "busi" "california" "call" "can"
## [11] "cap" "chang" "comment" "commiss" "compani"
## [16] "contract" "corp" "cost" "credit" "current"
## [21] "custom" "david" "day" "deal" "demand"
## [26] "discuss" "document" "draft" "electr" "email"
## [31] "energi" "enron" "fax" "ferc" "file"
## [36] "first" "follow" "forward" "gas" "generat"
## [41] "get" "group" "houston" "includ" "increas"
## [46] "inform" "iso" "issu" "jeff" "john"
## [51] "just" "know" "last" "legal" "let"
## [56] "like" "look" "make" "manag" "mark"
## [61] "market" "may" "meet" "messag" "month"
## [66] "natur" "need" "new" "now" "one"
## [71] "oper" "order" "origin" "per" "plan"
## [76] "plant" "pleas" "point" "power" "price"
## [81] "product" "project" "propos" "provid" "purchas"
## [86] "question" "rate" "receiv" "regard" "report"
## [91] "request" "requir" "respons" "review" "risk"
## [96] "said" "say" "see" "sent" "servic"
## [101] "state" "subject" "suppli" "system" "take"
## [106] "term" "thank" "time" "trade" "transact"
## [111] "transmiss" "two" "use" "util" "want"
## [116] "week" "will" "work" "year"
The above list shows the words that occur atleast 200 times in the data set. It does have a few words one would associate with an energy fraud endeavour. Lets increase the threshold to 400 and see if the words that occur atleast 400 times lends more insights into the scam.
## [1] "2001" "agreement" "also" "attach" "california"
## [6] "call" "can" "chang" "compani" "contract"
## [11] "electr" "email" "energi" "enron" "forward"
## [16] "gas" "generat" "inform" "know" "market"
## [21] "may" "need" "new" "pleas" "power"
## [26] "price" "said" "state" "subject" "thank"
## [31] "time" "trade" "use" "util" "will"
The words does have terms one would associate with an energy company, but its hard to gain more insights from it.
Instead of setting arbitrary thresholds like 200 and 400 to measure word frequencies, lets find the most and least occurring words and display them.
## [1] 788
## dasovichnaenron havent sorri keannaenronenron
## 26 26 26 27
## kaminskihouect readi therefor andrew
## 28 28 29 30
## attent consider dissemin instead
## 30 30 31 31
## afternoon anyon format dear
## 32 32 32 33
## delay els explain extend
## 33 33 33 33
## may new can said compani attach
## 509 527 541 543 547 593
## state california electr forward energi pleas
## 670 758 769 776 787 793
## gas subject price email enron market
## 835 993 997 1001 1047 1170
## power will
## 1199 1580
We see that there are 788 words in the list in total. The first table shows the least occurring words and the second, most occurring words.
Lets make some graphical representation of the word occurrences using ggplot.
The word cloud option gives a colorful way of plotting the frequently occurring word.
Getting a hierarchical clustering of the data will help understand the patterns in the email data.
This shows a clear pattern. Lets make boxes to delinate the 2 clusters.
The pattern very clearly reveal 2 clusters. This is possibly the manifestation of the differences between the responsive and non-responsive emails.
If you have R version that is NOT “version 3.2.0”, you may try the following vizualization of correlation plot library(graph) library(Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T) These packages are not available for R version 3.2.0 that I have. So these plots are omitted from this analysis.
Lets start building models to enable the prediction of an email being responsive.
Before that we need to join the cleaned and processed dtm data with our response variable, responsive. And also, we see that some of the column names are not legitimate R column names. So we would have to change the names of these columns. We use the make.names on the colnames to get the column names to legitimate R column names. Lets take a look at the structure of the first 10 rows of dat data frame.
## 'data.frame': 855 obs. of 10 variables:
## $ X100 : num 0 0 0 0 0 0 5 0 0 0 ...
## $ X1400 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X1999 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X2000 : num 0 0 1 0 1 0 6 0 1 0 ...
## $ X2001 : num 2 1 0 0 0 0 7 0 0 0 ...
## $ X713 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X77002: num 0 0 0 0 0 0 0 0 0 0 ...
## $ abl : num 0 0 0 0 0 0 2 0 0 0 ...
## $ accept: num 0 0 0 0 0 0 1 0 0 0 ...
## $ access: num 0 0 0 0 0 0 0 0 0 0 ...
We now proceed to split the model into training anf test sets. As always, we use the caret package for this and do a 70/30 split.
Lets start with a simple tree model.
## CART
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 600, 600, 600, 600, 600, 600, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.05102041 0.8659458 0.4305292 0.01938817 0.07318537
## 0.08163265 0.8688866 0.4263985 0.01447305 0.06547316
## 0.22448980 0.8607768 0.3597811 0.02150239 0.16329071
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.08163265.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 196 18
## yes 18 23
##
## Accuracy : 0.8588
## 95% CI : (0.8099, 0.8991)
## No Information Rate : 0.8392
## P-Value [Acc > NIR] : 0.2239
##
## Kappa : 0.4769
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9159
## Specificity : 0.5610
## Pos Pred Value : 0.9159
## Neg Pred Value : 0.5610
## Prevalence : 0.8392
## Detection Rate : 0.7686
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.7384
##
## 'Positive' Class : no
##
## [1] "The untuned Tree model gives an accuracy of 85.88 %"
Lets see if we can improve this model any further by using 10-fold CV
## CART
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 541, 539, 540, 540, 540, 540, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.05102041 0.8733269 0.4185940 0.01407933 0.1108604
## 0.08163265 0.8582421 0.3797495 0.02687094 0.1162744
## 0.22448980 0.8533241 0.3854441 0.02586106 0.1054384
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.05102041.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 208 6
## yes 25 16
##
## Accuracy : 0.8784
## 95% CI : (0.8319, 0.9159)
## No Information Rate : 0.9137
## P-Value [Acc > NIR] : 0.978828
##
## Kappa : 0.4457
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.8927
## Specificity : 0.7273
## Pos Pred Value : 0.9720
## Neg Pred Value : 0.3902
## Prevalence : 0.9137
## Detection Rate : 0.8157
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.8100
##
## 'Positive' Class : no
##
## [1] "The Tree model resampled with 10-fold CV gives an accuracy of 87.84 %"
## [1] "This tree1 model provides an improvement of 2.28 % over the tree model with default resampling, tree"
We enable Parallel Processing to shorten the computing times.
## [1] "Number of registered cores is 4"
But the parallel processing may give slightly non-reproducable results.
Lets try random forest model.
## Random Forest
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 600, 600, 600, 600, 600, 600, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.8548692 0.2135687 0.02628675 0.08815368
## 39 0.8837965 0.4882833 0.02070043 0.07354551
## 787 0.8750118 0.5008818 0.02112316 0.05908473
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 39.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 207 7
## yes 23 18
##
## Accuracy : 0.8824
## 95% CI : (0.8363, 0.9192)
## No Information Rate : 0.902
## P-Value [Acc > NIR] : 0.87508
##
## Kappa : 0.4824
## Mcnemar's Test P-Value : 0.00617
##
## Sensitivity : 0.9000
## Specificity : 0.7200
## Pos Pred Value : 0.9673
## Neg Pred Value : 0.4390
## Prevalence : 0.9020
## Detection Rate : 0.8118
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.8100
##
## 'Positive' Class : no
##
## [1] "The untuned rf model gives an accuracy of 88.24 %"
Lets continue with RF, but use 10-fold CV as the resampling method instead of the default bootstrapping.
## Random Forest
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.8583250 0.2336371 0.01804490 0.1748317
## 39 0.8899917 0.5215636 0.03165645 0.1334932
## 787 0.8767412 0.4927091 0.03301471 0.1158294
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 39.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 207 7
## yes 22 19
##
## Accuracy : 0.8863
## 95% CI : (0.8408, 0.9225)
## No Information Rate : 0.898
## P-Value [Acc > NIR] : 0.76960
##
## Kappa : 0.5055
## Mcnemar's Test P-Value : 0.00933
##
## Sensitivity : 0.9039
## Specificity : 0.7308
## Pos Pred Value : 0.9673
## Neg Pred Value : 0.4634
## Prevalence : 0.8980
## Detection Rate : 0.8118
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.8173
##
## 'Positive' Class : no
##
## [1] "The 10-fold cross validates rf1 model gives an accuracy of 88.63 %"
Lets look at boosting models with 10-fold CV.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8246 nan 0.1000 0.0287
## 2 0.7780 nan 0.1000 0.0196
## 3 0.7403 nan 0.1000 0.0175
## 4 0.7148 nan 0.1000 0.0109
## 5 0.6935 nan 0.1000 0.0069
## 6 0.6687 nan 0.1000 0.0085
## 7 0.6504 nan 0.1000 0.0067
## 8 0.6282 nan 0.1000 0.0078
## 9 0.6093 nan 0.1000 0.0058
## 10 0.5956 nan 0.1000 0.0047
## 20 0.4986 nan 0.1000 0.0022
## 40 0.4069 nan 0.1000 -0.0006
## 50 0.3706 nan 0.1000 -0.0014
## Stochastic Gradient Boosting
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.8866593 0.5106398 0.02333747
## 1 100 0.8883259 0.5250801 0.03773983
## 1 150 0.8833532 0.5184176 0.03756524
## 2 50 0.8900199 0.5442236 0.03148172
## 2 100 0.8817130 0.5221985 0.03605928
## 2 150 0.8834635 0.5357421 0.04306360
## 3 50 0.8867148 0.5345120 0.02779668
## 3 100 0.8867695 0.5465803 0.03856169
## 3 150 0.8851311 0.5435639 0.03392098
## Kappa SD
## 0.09843233
## 0.15345884
## 0.14016282
## 0.12121627
## 0.12685022
## 0.15091245
## 0.12487392
## 0.13934852
## 0.13191483
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 2, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 208 6
## yes 23 18
##
## Accuracy : 0.8863
## 95% CI : (0.8408, 0.9225)
## No Information Rate : 0.9059
## P-Value [Acc > NIR] : 0.879122
##
## Kappa : 0.4937
## Mcnemar's Test P-Value : 0.002967
##
## Sensitivity : 0.9004
## Specificity : 0.7500
## Pos Pred Value : 0.9720
## Neg Pred Value : 0.4390
## Prevalence : 0.9059
## Detection Rate : 0.8157
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.8252
##
## 'Positive' Class : no
##
## [1] "The untuned gbm model gives an accuracy of 88.63 %"
Lets try boosting with repeated (10 times) 10-fold CV
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8129 nan 0.1000 0.0329
## 2 0.7641 nan 0.1000 0.0242
## 3 0.7222 nan 0.1000 0.0140
## 4 0.6901 nan 0.1000 0.0165
## 5 0.6660 nan 0.1000 0.0082
## 6 0.6379 nan 0.1000 0.0101
## 7 0.6134 nan 0.1000 0.0103
## 8 0.5940 nan 0.1000 0.0058
## 9 0.5749 nan 0.1000 0.0046
## 10 0.5585 nan 0.1000 0.0052
## 20 0.4497 nan 0.1000 0.0000
## 40 0.3448 nan 0.1000 -0.0014
## 60 0.2772 nan 0.1000 -0.0000
## 80 0.2248 nan 0.1000 -0.0006
## 100 0.1891 nan 0.1000 -0.0007
## Stochastic Gradient Boosting
##
## 600 samples
## 788 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
##
## Summary of sample sizes: 540, 540, 540, 540, 540, 541, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
## 1 50 0.8896813 0.5042790 0.02687093 0.1364611
## 1 100 0.8875508 0.5105705 0.02901190 0.1374983
## 1 150 0.8890342 0.5237410 0.02760239 0.1335777
## 2 50 0.8871923 0.5185103 0.03187558 0.1435163
## 2 100 0.8876807 0.5330914 0.02902559 0.1315474
## 2 150 0.8895197 0.5496173 0.03089993 0.1260713
## 3 50 0.8875200 0.5286328 0.02802356 0.1216371
## 3 100 0.8898530 0.5491060 0.03333461 0.1458003
## 3 150 0.8877024 0.5531663 0.03474165 0.1362639
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 205 9
## yes 20 21
##
## Accuracy : 0.8863
## 95% CI : (0.8408, 0.9225)
## No Information Rate : 0.8824
## P-Value [Acc > NIR] : 0.47112
##
## Kappa : 0.5273
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.9111
## Specificity : 0.7000
## Pos Pred Value : 0.9579
## Neg Pred Value : 0.5122
## Prevalence : 0.8824
## Detection Rate : 0.8039
## Detection Prevalence : 0.8392
## Balanced Accuracy : 0.8056
##
## 'Positive' Class : no
##
## [1] "The untuned gbm model gives an accuracy of 88.63 %"
Lets look at the 20 most important contributors to the gbm1 model prediction.
The plot makes sense in the context of the problem in hand. We see that the most important predictior words, in decreasing order of importance are:
which fits well within the context of energy company and demand manipulation.
Since we find that the word ‘california’ is the most important predictor word, it would be insightful to know which words are closely associated, in statistical terms, correlated with it. Lets look at the set of words which are correlated with the occurrence of the word ‘california’. We specift the correlation threshold to be 0.7.
Lets repeat the process for the 5 most important predictor words. The correlation threshold for the second and other important predictor words are set to 0.6.
## $california
## electr consum profit public caus wholesal time busi
## 0.87 0.84 0.80 0.80 0.78 0.78 0.77 0.76
## everi price action energi found paid state practic
## 0.75 0.75 0.73 0.73 0.73 0.73 0.73 0.72
## act dollar use power
## 0.71 0.71 0.71 0.70
## $price
## california electr wholesal among consum level
## 0.75 0.73 0.72 0.71 0.71 0.71
## market profit action caus part demand
## 0.69 0.68 0.67 0.67 0.67 0.65
## public period practic suppli trader act
## 0.65 0.64 0.64 0.64 0.64 0.63
## result 2000 base power real time
## 0.63 0.62 0.62 0.62 0.62 0.62
## fact plant includ
## 0.61 0.61 0.60
## $load
## staff
## 0.62
## $capac
## bid
## 0.67
## $demand
## suppli electr generat part price market among custom
## 0.79 0.69 0.67 0.65 0.65 0.64 0.62 0.62
## plant rais caus competit energi power
## 0.62 0.62 0.61 0.61 0.60 0.60
Not surprisingly, we see that electr(icity), consum(er) and profit are most correlated with california. Its worth keeping in mind that the alphabets within brackets indicate the parts of words removed by stemming.
## [1] "The least accurate prediction was 85.88 %, given by tree model"
## [1] "The most accurate prediction was 88.63 %, given by rf1 model"
## [1] "The most accurate model rf1 gives an improvement of 3.202% over the least accurate model of tree"
## [1] "With this model, we can expect close to 88.63% of correct predictions of responsiveness on a similar email datset that the model hasnt 'seen' yet."
We have used text analytics to decipher and evaluate a set of emails with respect to a response variable that characterizes the emails into 2 mutually exclusive categories.
In the process of setting up the model, we explored the various preprocessing options like - converting the text to lower case - removing punctuations - definition and use of stop words - introduction to stemming, types and its use in text analytics - converting the data into a sparse matrix form - numerous representations of frequently occurring words using ggplot and word cloud - hierarchical clustering of the data to reveal clear patterns in the emails - setting up of parallel processing - CART models - Random Forest models - Boosting models - important predictor words from the model - correlations of most important predictors
This analysis forms part of whats called ‘predictive coding’. In essence, predictive coding gets the input from a human, who reviews samples of documents and marks them according to the need of the task (responsive or benign in out context) and uses this human decision as its input to predict or generalize these decisions across a larger collection of documents. So this allows to atleast partly replace, if not predominantly, the expensive and tedious manual investigation of a large set of documents by computer programs.
This is of great importance as recently, in April 2012, a state judge in Virginia issued the first state court ruling allowing the use of predictive coding in e-discovery in the case Global Aerospace, Inc. With such a ruling, its not difficult to see the optential and scope for predictive coding.
Footnote: If the file gives any error while being converted to htlm using the ‘Knit HTML’ option in RStudio, the follwing code snippet can be used. library(knitr) knit2html(“enron.Rmd”)