You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version
Generate Attributes (Blending)
Synopsis
This operator constructs new user defined attributes using mathematical expressions.Description
The Generate Attributes operator constructs new attributes from the attributes of the input ExampleSet and arbitrary constants using mathematical expressions. The attribute names of the input ExampleSet can be used as variables in the mathematical expressions for new attributes. During the application of this operator these expressions are evaluated on each row. These variables are then filled with the row's attribute values. If a variable is undefined in an expression, the entire expression becomes undefined and '?' is stored at its location.
Please note that there are some restrictions for the attribute names in order to let this operator work properly:
- Attribute names containing dashes '-' or other special characters, or having the same name as a constant (e.g. 'e' or 'pi') must be placed in square brackets e.g. '[weird-name]' or '[pi]'.
- Attribute names containing square brackets or backslashes must be placed in square brackets and the square brackets and backslashes inside the name must be escaped, e.g. '[a\\tt\[1\]]' for an attribute 'a\tt[1]'.
If you want to apply this operator, but the attributes of your ExampleSet do not fulfill above-mentioned conditions, you can rename attributes with the Rename operator before application of the Generate Attributes operator. When replacing several attributes following a certain schema, the Rename by Replacing operator might prove useful.
A large number of operations and functions are supported, which allows you to write rich expressions. For a list of operations and functions and their descriptions open the Edit Expression dialog. Complicated expressions can be created by using multiple operations and functions. Parenthesis can be used to nest operations.
This operator also supports various constants (for example 'INFINITY', 'PI' and 'e'). Again you can find a complete list in the Edit Expression dialog. You can also use strings in operations but the string values should be enclosed in double quotes (").
Input
- table input (null)
This input port expects an ExampleSet.
Output
- table output (null)
The resulting ExampleSet with new attributes is delivered to this port.
- original (null)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the results workspace.
Parameters
- function_descriptions The list of functions for generating new attributes is provided here. Range:
- keep_all If set to true, all the original attributes are kept, otherwise they are removed from the output ExampleSet. Range: boolean
Tutorial Processes
Generating attributes through different function descriptions
The 'Labor-Negotiations' data set is loaded using the Retrieve operator.
Now have a look at the Generate Attributes operator's parameters. The keep all parameter is checked, thus all attributes of the 'Labor-Negotiations' data set are also kept along with attributes generated by the Generate Attributes operator.
Click on the Edit List button of the function descriptions parameter to have a look at descriptions of functions defined for generating new attributes. There might be better ways of generating these attributes but here they are used to explain the usage of the different types of functions available in the Generate Attributes operator. Please read the function description of each attribute and then see the values of the corresponding attribute in the Results Workspace to understand it completely. Here is a description of attributes created by this operator:
The 'average wage-inc' attribute takes sum of the wage-inc-1st, wage-inc-2nd and wage-inc-3rd attribute values and divides the sum by 3. This gives an average of wage-increments. There are better ways of doing this, but this example was just shown to clarify the use of some basic functions. The 'neglected worker bool' attribute is a boolean attribute i.e. it has only two possible values '0' and '1'. This attribute was created here to show usage of logical operations like 'AND' and 'OR' in the Generate Attributes operator. This attribute assumes value '1' if three conditions are satisfied. First, the working-hours attribute has value 35 or more. Second, the education-allowance attribute is not equal to 'yes'. Third, the vacation attribute has value 'average' OR 'below-average'. If any of these conditions is not satisfied, the new attribute gets value '0'. The 'logarithmic attribute' attribute shows the usage of logarithm base 10 and natural logarithm functions. The 'trigno attribute' attribute shows the usage of various trigonometric functions like sine and cosine. The 'rounded average wage-inc' attribute uses the avg function to take average of wage-increments and then uses the round function to round the resultant values. The 'vacations' attribute uses the replaceAll function to replace all occurrences of value 'generous' with 'above-average' in the 'vacation' attribute. The 'deadline' attribute shows usage of the If-then-Else and Date functions. This attribute assumes value of current date plus 25 days if class attribute has value 'good'. Otherwise, it stores the date of the current date plus 10 days. The 'shift complete' attribute shows the usage of the If-then-Else, random, floor and missing functions. This attribute has values of the shift-differential attribute, but it does not have missing values. Missing values are replaced with a random number between 0 and 25. The 'remaining_holidays' attribute stores the difference of the statutory-holidays attribute value from 15. The 'remaining_holidays_percentage' attribute uses the 'remaining_holidays' attribute to find the percentage of remaining holidays. This attribute was created to show that attributes created in this Generate Attributes operator can be used to generate new attributes in the same Generate Attributes operator. The 'constants' attribute was created to show the usage of constants like 'e' and 'PI'. The 'cut' attribute shows the usage of cut function. If you want to specify a string, you should place it in double quotes ("") as in the last term of this attribute's expression. If you want to specify name of an attribute you should not place it in the quotes. First term of expression cuts first two characters of the 'class' attribute values. This is because the name of the attribute is not placed in quotes. Last term of the expression selects first two characters of the string 'class'. As first two characters of string 'class' are 'cl', thus cl is appended at the end of this attribute's values. The middle term is used to concatenate a blank space between first and last term's results. The 'index' attribute shows usage of the index function. If the 'class' attribute has value 'no', 1 is stored because 'o' is at first index. If the 'class' attribute has value 'yes', -1 is stored because 'o' is not present in this value. The 'date string' attribute shows the usage of the date_str function. It prints the date of the 'deadline' attribute in a manually specified format and english locale. Please note that you need to specify the time zone that should be used to format the string (CET in this example). The 'macro' attribute shows how to use macros in functions. The 'macro eval' attribute shows how to use macros that contain a number. The macro function %{} always returns a string, so if you want to obtain the number you have to use the eval function or the parse function. The 'expression eval' attribute shows usage of the eval function. If there is a string containing an expression, for example coming from a macro %{expression}, you can evaluate this expression by using the eval function. The 'macro with attribute' attribute shows the usage of the #{} function. If there is a macro containing the name of an attribute, you can use this attribute in your expression by using #{attribute_macro} where attribute_macro is the macro containing the attribute name. Using eval(%{attribute_macro}) would lead to the same result, but the #{} function fails when the macro does not contain an attribute name, while eval(%{attribute_macro}) evaluates whatever is contained in the macro. The 'row number' attribute shows usage of the row_number function. It returns the number of the current row. This can be useful, for example, to handle certain rows differently. The 'lead' attribute shows usage of the lead function which lets you access values of attributes on different rows. This can be useful, for example, when working with time series. In this example the value of the following row of the row number attribute is accessed. Since for the last row there is no following row, the last row contains missing. The 'lag' attribute shows usage of the lag function. It works similar to the lead function, but it inverts the given offset. Therefore, by using offset 1, the preceding attribute is accessed. Since the first row does not have a preceding value, the first row contains missing. The lead and lag functions can be used to define attributes recursively. This means these functions can access values of the currently calculated attribute itself. A typical example for a recursive definition is shown in the 'Fib' attribute where we calculate the famous 'Fibonacci numbers'. Please note that the expected result type must be specified when using recursive definitions. The cell_value function works similar to lead and lag but it lets you choose the index you want to access directly. The 'cell value' attribute shows how you can use cell_value to pick the seventh fibonacci number from the 'Fib' attribute. (Please note that the row indices start at 1.) In combination with the row_number function the cell_value function can be used to build interesting expressions. For example, the 'loop Fib' attribute shows how to loop the first 7 fibonacci numbers of the 'Fib' attribute.