Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version

Explain Predictions (Model Simulator)

Synopsis

This operator identifies the attributes that play the largest role when making a prediction.

Description

Given a model and an input, you can generate a prediction, but which of the attributes plays the largest role in forming that prediction? This operator takes a model and an ExampleSet as input, and generates a table highlighting the attributes that most strongly support (green) or contradict (red) each prediction. Alternatively, the table can be displayed with two extra columns (support and predict) containing numeric details.

For each Example in an ExampleSet, this operator generates a neigboring set of data points, and uses correlation to identify the local attribute weights in that neighborhood. Although the relationship between attributes and predictions may be highly non-linear globally, the local linear relationship is more than powerful enough to explain the predictions.

Another thing this operator can do is to calculate model-specific, but model-agnostic global attribute weights. Model-agnostic means that the weights can be calculated for all model types while other model-specific weight calculations only work for particular models (like Random Forest). Model-specific means that the weights are calculated specifically for this model instead of using model-independent weighting schemes like correlations.

The operator is deriving those weights directly from the explanations. If the true labels are known for the test data, all supporting local explanations add positively to the weights for correct predictions. All contradicting local explanations add positively to the weights for wrong predictions. If the true labels are not known, the global weights only use the supporting local weights instead.

This operator works with all data types and data sizes. It supports both classification and regression problems. The only model type which is not recommended is k-Nearest Neighbors, since this model typically suffers from long runtimes for scoring.

Input

  • model (Model)

    This input port expects a model for which the predictions should be explained. The model is also applied on the test data automatically and generates all predictions and confidences together with the explanations.

  • training data (IOObject)

    This input port expects either the ExampleSet which has been used to train the model. Or it expects a Statistics object built from the original training data with the help of the operator Statistics. These statistics define the ranges for creating the neighboring set of data points used to derive the local importance of factors.

  • test data (Data Table)

    This input port expects an ExampleSet with test data. This data will get predictions, condifences (if classification), and the explanations for those predictions.

Output

  • visualization output

    This output port delivers the test data with predictions and color highlighting of attributes: green when the value of the attribute supports the prediction, and red when the value of the attribute contradicts the prediction.

  • example set output (Data Table)

    This output port delivers the test data with predictions and two extra columns: one that details the attributes that support the prediction and one that details the attributes that contradict the prediction.

  • importances output (Data Table)

    This output port delivers the test data in a long table format including the importance of all attributes for each row. This can be useful if the data should be visualized later on.

  • global weights output (Attribute Weights)

    This output port delivers a set of weights for all the attributes. Attributes which contributed often and a lot for supporting correct predictions (if known) will gain higher weights.

Parameters

  • maximal explaining attributes The maximal number of attributes used to support the predictions, also the maximal number of attributes used for contradicting it. The whole point about explanations is that they allow you to focus on the factors that matter in each particular case. We recommend a value of 3 to achieve this but you can increase this number if you feel that you need more factors to explain the predictions to you. Please note that you might end up with less factors if only less attribute values than the maximal number support or contradict a prediction in this case. Range: integer
  • local sample size The number of locally generated samples around each prediction data point to identify the attributes with the biggest impact on this decision. You might want to increase this number for high-dimensional data sets in case the quality of predictions become worse. Please note that the runtime of this algorithm slows down with higher numbers. In general, a value of around 500 delivers high-quality explanations in a reasonable amount of time. Range: integer
  • only create predictions Indicates if only predictions should be created without the explanations. The operator then behaves like Apply Model for prediction models. This can be useful since calculating the explanations takes a lot of time. Range: boolean
  • normalize global weights Indicates if the global weights should be normalized to a range between 0 and 1. Range: boolean
  • sort_weights Indicates if the resulting weights should be sorted. Range: boolean
  • sort_direction The sorting direction for the global weights, i.e. either ascending or descending weight values. Range: selection

Tutorial Processes

Explaining Predictions for Titanic

This process trains a Naive Bayes model on the Titanic data. It then uses the Explain Predictions operator to create the predictions and all local explanations for the second data set.

You can see the three results. The first result is the set of global weights summarizing how much each factor has been contributing to the local explanations. Factors which contributed often and a lot are in general more important for this model. We can see that the gender plays an important role for the predictions also for this model.

Second, the data with additional columns for the predictions, the confidences, and the new explanations. The last result directly visualizes the explanations with colors. Green means a value which strongly supports the prediction. Red means that this value contradicts the prediction. Have a look at the 3rd row for example. The model predicts "Yes" for survival despite the fact that the gender is male. In general, most men died during the accident though so the model made this prediction based on the other values. In this case, this would be the age of 71, the amount of money paid, and the fact that this person traveled without parents or children.