Download Decision Trees in R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
SUMMARY
Decision trees have many uses: exploratory data analysis, variable selection,
modeling and more. In today’s discussion we will cover:
 What are decision trees. Decision trees have many uses, are extremely versatile, easy to
interpret, and require little data preparation.
 Decision tree packages in R. rpart (package used today), C50, Cubist
 Enhancing tree outputs. One of the attractive features of trees is that they are easy to
interpret. However, in the rpart package the output could use a little enhancing.
 What are Trees
 Some packages in R
 Enhancing Tree Outputs
 References
A decision tree is an algorithm the can have a continuous or
categorical dependent (DV) and independent variables (IV).
There are many advantages to using trees1.
 Simple to understand and interpret. People are able to understand decision tree models after a brief
explanation.
 Requires little data preparation. Other techniques often require data normalisation, dummy variables
need to be created and blank values to be removed.
 Able to handle both numerical and categorical data.
 Uses a white box model. If a given situation is observable in a model the explanation for the condition is
easily explained by boolean logic
 Possible to validate a model using statistical tests. That makes it possible to account for the reliability of
the model.
 Performs well with large data in a short time.
Some things to consider when coding the model…
 Splits. Gini or information.
 Type of DV (method). Classification (class), regression (anova), count (poison), survival
(exp).
 Minimum of observations for a split (minsplit).
 Minimum if observations in a node (minbucket).
 Cross validation (xval). Used more in model building rather than in exploration.
 Complexity parameter (Cp). This value is used for pruning. A smaller tree is perhaps less
detailed, but with less error.
 What are Trees
 Some packages in R
 Enhancing Tree Outputs
 References
R has many packages for similar/same endeavors.
 rpart. Comes with R.
 C50.
 Cubists.
 rpart.plot. Makes rpart plots much nicer.
 What are Trees
 Some packages in R
 Enhancing Tree Outputs
 References
An alternative to the rpart plots is the prp function in the
rpart.plot package.
 extras. Values 1~9 displays extra “stuff”
 boxcol. Define colors in the leafs.
 xflip. Rotate the tree 180o
 nn. Add node numbers for easier interpretation
 What are Trees
 Some packages in R
 Enhancing Tree Outputs
 References
References
1. http://en.wikipedia.org/wiki/Decision_tree_learning
2. http://www.stanford.edu/class/stats315b/minitech.pdf
3. http://www.milbo.org/rpart-plot/prp.pdf