You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version
Interactive Analysis
You need an Altair Units License to use this feature.
When you are faced with a binary classification problem using RapidMiner, Decision Trees can provide a useful solution. The Interactive Analysis view is an extension to RapidMiner Studio that enables you to build a customised node-by-node segmentation model that fit the exact needs of your data. Decision Trees split a dataset based on the relationship between a dependent and an independent variable. Decision Trees are a versatile data mining technique for supervised learning. It also contains a process that you yourself can modify and put into production.
Decision Trees address three large classes of problems:
- Binary Classification
- Classification
- Regression
The Interactive Analysis view helps you evaluate your data with its intuitive and easy-to-use interface, by exploring unfamiliar variables and identifying highly-predictive independent variables that can then be used in other modelling techniques, for example, a logistic regression model.
When using RapidMiner Studio, the Decision Trees view appears next to the Design view, the Results view, Turbo Prep view and Auto Model view.
If your data is in a scattered or inconsistent state, not yet ready for model-building, see Turbo Prep.
Example: Predict Survival on the Titanic
To show how Decision Trees work, we'll use the Titanic dataset, included with
RapidMiner Studio, to predict survival. This is represented as a binary
variable on this dataset. To get started, choose the Decision Tree
view by
pressing the button at the top of RapidMiner Studio.
Select Data
After opening the Interactive Analysis view, the first step is to select the Titanic
dataset from the Samples repository. This can be found under Samples
>
data
. Select this dataset, then click Next
at the bottom of the screen.
Select Model
Having selected the Titanic dataset, we want to predict survival on the
Titanic, so you should select the "Survived" column, before clicking Next
.
Continuous target variables are currently not supported as of RapidMiner Studio 10.2. This feature will be added in a future release.
Model Settings
Since "Survived" has only two values, "Yes" or "No", the problem is a
classification problem. In general, for classification problems, Interactive
Decision Tree displays a split report with the number of data points in each
class. If you want to use a specific split search method or criteria you can
select this from the Training Parameters panel using the Split Search Method
list and the Measure
drop-down box. You then click Generate Split Report
to
refresh the split report.
Split Report
The report in this view generates a report for each variable in the dataset
(excluding the target "Survived" variable) containing univariate information
and information with respect to the target. A data quality report is generated
for each variable that summarises all the information in the Quality
column
and based on this, a recommendation is made in the Status
column whether the
variable should be included in the model using a traffic light system (red /
yellow / green). The variables with a green status are automatically selected
as dependent variables. There can be a number of reasons why a variable is or
is not selected for modelling. For example, the status on the Ticket Number
column is red as it shows an "ID-ness" of above 70%, that is, the number of
unique values is more than 70% of the total number of rows in the dataset which
would not make the variable a very effective predictor.
Not all of your data columns will help you to make a prediction. By discarding some of the data columns you can speed up your model and / or improve its performance. But how do you make that decision? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful.
A quick summary of what to look out for includes the following, whose values are displayed alongside the quality bars for each data column.
- Columns that too closely mirror the target column, or not at all (Correlation)
- Columns where nearly all values are different (ID-ness)
- Columns where nearly all values are identical (Stability)
- Columns with missing values (Missing)
The split report summarizes the situation with a color-coded status bubble (red / yellow / green). As a general rule, it is a good idea to deselect at least those columns that have a red status bubble, but of course you can deselect any columns you like, independent of their status. The input for the machine learning model only includes the selected columns.
In the case of the Titanic dataset, the "Name" and "Ticket Number" are equivalent to IDs. The "Cabin" values are missing for most passengers. Hence, these three columns, with a red status bubble, should be discarded when building a model. None of them is helpful in discovering a pattern.
"Life Boat" has a yellow status bubble, because the data in this column is highly correlated with "Survived". "Lifeboat" and "Survived" are effectively synonyms, so it is better to remove the data from the "Life boat" column and let the model discover the underlying reasons for survival.
Put somewhat differently, you expect the model to help you make a plan. A passenger can't know in advance whether they will be on a lifeboat, so that can't be part of the plan, but they can decide how much to pay for their ticket, and whether or not to bring their family along.
In this example, you should also deselect the data with the yellow status bubble,
"Life Boat", and press Next
.
Auto Grow Settings
Having selected the variables for the model, we now configure the growth settings of the Decision Tree. A Decision Tree begins at a base node that represents the entire dataset, usually a training dataset.
It is good practice to partition your dataset beforehand into a training dataset, which you can use to train the model, and a testing dataset, which you can use to check the accuracy of the model on unseen data. You would ideally like the model to have the same level of predictability for both the training dataset and testing dataset. If the predictions on your training dataset are more accurate than your testing dataset, you are overfitting the model and you might want to either decrease the proportion for the training dataset or resample.
The base node of the training dataset is then split by a variable into further nodes, the split is based on the variable. For example, a binary variable will split the base node into two nodes, these nodes can then be further split by another variable. Nodes are split by variable values if they are binary or discrete, continuous variables are split by one or more inequalities.
With these interactive Decision Trees, you can choose to either grow a tree yourself from the base node or have a tree grown for you. In the latter case, you would still be able to grow individual trees.
In our example we will start with an automatically grown tree, so ensure Auto
Grow Tree
is selected. The settings enable you to specify the restrictions
when automatically growing the tree. You can specify the minimum amount of
data in each node in Stopping Criteria
and the amount of levels the tree can
deviate from the base node in Maximum Tree Depth
In this example we will keep
all the default values and click Create
.
Results
Depending on your dataset and the models you selected, you might have to wait for the results. The progress bar at the top tracks the status of an ongoing calculation.
Once the results are ready, the decision tree is displayed in a canvas containing a view of multiple nodes, all representing a portion of the dataset shown in the percentage of the node. The base node is representative of the whole dataset so the proportion is always 100%. Each node also has a colour coded orange/purple split representing the proportion of the target variable; you can hover over a node to see the exact number. If you didn't choose to automatically grow the tree, you can do this manually by clicking on a root node (a node that isn't already split up) and selecting the grow button on the left to split the node by one level or the auto grow button in the middle to fully split the node.
In our example the Decision Tree is already fully grown out, from here we can gain insights about the data: The first level splits the data up by gender and we already find the target variable proportion is quite split between females and males with 19% of males surviving compared with 73% females. The female node is then further split by passenger class to show that 96% in first class survived, then 89% in second class and finally 49% in third class. The male node is further split by age with 51% surviving ages 0-18, 12.5% ages 18-20 and 19% ages 20-80. We can understand from this that first class females on board were very likely to survive, while adult males were much less likely. If we were to apply the model to some data, each data point would run through the tree and run through the value of each variable specified in the tree until it reaches one of the end nodes. From the end nodes a probability is then returned that is the target variable proportion in that node. The returned probabilities can make predictions by splitting the <0.5 and ≥0.5 probabilities into its own binary variable. This binary variable will then serve as the model prediction, for example a Decision Tree model of the Titanic dataset can make a variable that predicts a ≥0.5 chance of survival to survive and a <0.5 chance to not survive.
The Decision Tree can be exported from this view and applied into a RapidMiner
workflow; to do this, click the Export
button to open the Export Model
dialog box then select a repository folder and name the model in the Name
text box then click Next
. Once the model has finished exporting click Close
to close the Export Model
dialog box. Your Decision Tree model is then ready
to use in the RapidMiner Workflow.