You are viewing the RapidMiner Developers documentation for version 9.8 - Check here for latest version
API changes in RapidMiner 9.8
From ExampleSet to Belt Table
Forget about the ExampleSet
class and start using com.rapidminer.belt.table.Table
, RapidMiner's new representation of example sets. The corresponding framework is called Belt.
It comes with several advantages compared to ExampleSet
:
- Column-oriented design: a column-oriented data layout allows for using compact representations for the different column types.
- Immutability: all columns and tables are immutable. This not only guarantees data integrity but also allows for safely reusing components, e.g., multiple tables can safely reference the same column.
- Thread-safety: all public data structures are thread-safe and designed to perform well when used concurrently.
- Implicit parallelism: Many of Belt's built-in functionality, such as the transformations shown in the examples below, automatically scale out to multiple cores.
To learn everything about the Belt framework please refer to the official documentation of the Belt project.
This page will focus on the differences between the old example set and the new Belt framework and present some examples on how to implement operators using the Belt framework and the Table
class.
If you are new to extension development for RapidMiner Studio, then Create your own extension is a great starting point for you.
Sum operator example
Let's start with an example. We will create an operator that takes a table with only numeric columns, calculates the sum for each row and adds these row sums as a new column to the resulting table.
First of all the doWork()
method. You receive the input table by calling:
IOTable ioTable = tableInput.getData(IOTable.class);
Table table = ioTable.getTable();
You need not worry if the actual data at the port is an IOTable or an ExampleSet since RapidMiner will automatically convert it to the requested format.
This makes the collaboration between new operators working on Table
s and old operators working on ExampleSet
s easy.
Then to make the code a little bit cleaner we will outsource the actual work to the calculateSum
method.
// read table, calculate sum and return new table
Table result = calculateSum(table);
Now deliver the resulting table to the output port.
IOTable newIOTable = new IOTable(result);
newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
tableOutput.deliver(newIOTable);
Since the Table
class itself is not an IOObject
we need to wrap it with the IOTable
class. Also it is important to copy the annotations of the input IOTable
to the new IOTable
because otherwise they will be lost.
Finally, it is good practice to also deliver the input table to an output port:
originalOutput.deliver(ioTable);
That's the doWork()
method.
Let's move on to implement the calculateSum(Table table)
method.
First of all check that the given Table contains only numeric columns.
The BeltErrorTools
class holds some convenience methods for this kind of checks.
BeltErrorTools.onlyNumeric(table, getName(), this);
Next, we will determine whether the result will be of type real or integer.
If any column is of type real, the result will also be of type real.
The table provides a ColumnSelector
that can be accessed via the select()
method.
A column selector can be used to filter the columns of a table via predicates.
The default predicates filter regarding type, category, capability and meta data (e.g. roles).
You can even define your own predicates for custom filter operations.
The ofTypeId
method does the trick:
boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();
Since the Column class is immutable, we need a column buffer to fill and instantiate a new column:
NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
: Buffers.integer53BitBuffer(table.height());
Tables can be read column-wise or row-wise. In this case we want to read it row-wise so that we can calculate the sum for each row:
NumericRowReader reader = Readers.numericRowReader(table);
for (int i = 0; i < buffer.size(); i++) {
// move must be called to advance the reader to the next row
reader.move();
double sum = 0;
for (int j = 0; j < reader.width(); j++) {
// reader.get(j) returns the value of the j-th column of the row
sum += reader.get(j);
}
buffer.set(i, sum);
}
The move method advances the reader to the next row. Please note that it must be called before the first row is read.
We have calculated the row sums and filled them into the buffer. Next, copy the original table and add a new column to it.
Since the Table
class is immutable we will use a table builder:
TableBuilder builder = Builders.newTableBuilder(table);
builder.add("Sum", buffer.toColumn());
Please note that the data stored in the buffer cannot be modified anymore after calling the toColumn
method. Attempting to do so will lead to an Exception.
Nearly done! All that's left to do is to build and to return the table.
And this is where Belt's implicit parallelism comes into play.
The build
method takes the operator's context that can be accessed via the BeltTools
class and runs the build process in parallel.
Table result = builder.build(BeltTools.getContext(this));
return result;
This concludes the example. Since for now ExampleSetMetaData
will be used as meta data class for Belt tables we will not go through the meta data transformation in detail.
import com.rapidminer.adaption.belt.IOTable;
import com.rapidminer.belt.buffer.Buffers;
import com.rapidminer.belt.buffer.NumericBuffer;
import com.rapidminer.belt.column.Column;
import com.rapidminer.belt.reader.NumericRowReader;
import com.rapidminer.belt.reader.Readers;
import com.rapidminer.belt.table.Builders;
import com.rapidminer.belt.table.Table;
import com.rapidminer.belt.table.TableBuilder;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.UserError;
import com.rapidminer.operator.ports.InputPort;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.ports.metadata.AttributeMetaData;
import com.rapidminer.operator.ports.metadata.ExampleSetMetaData;
import com.rapidminer.operator.ports.metadata.MetaData;
import com.rapidminer.operator.ports.metadata.MetaDataInfo;
import com.rapidminer.operator.ports.metadata.PassThroughRule;
import com.rapidminer.operator.ports.metadata.SimplePrecondition;
import com.rapidminer.tools.Ontology;
import com.rapidminer.tools.belt.BeltErrorTools;
import com.rapidminer.tools.belt.BeltTools;
/**
* This operator takes a {@link Table} with only numeric columns, calculates the sum for each row
* and adds it as a new column.
*
* @author Kevin Majchrzak
* @since 9.8
*/
public class SumOperator extends Operator {
private final InputPort tableInput = getInputPorts().createPort("example set input");
private final OutputPort tableOutput = getOutputPorts().createPort("example set output");
private final OutputPort originalOutput = getOutputPorts().createPort("original");
public SumOperator(OperatorDescription description) {
super(description);
// we want example set meta data as input
tableInput.addPrecondition(new SimplePrecondition(tableInput, new ExampleSetMetaData()));
// pass through the original data
getTransformer().addPassThroughRule(tableInput, originalOutput);
// generate meta data for new table
getTransformer().addRule(new PassThroughRule(tableInput, tableOutput, true) {
@Override
public MetaData modifyMetaData(MetaData metaData) {
if (metaData instanceof ExampleSetMetaData) {
ExampleSetMetaData emd = (ExampleSetMetaData) metaData;
boolean resultIsReal = emd.containsAttributesWithValueType(Ontology.REAL, true)
!= MetaDataInfo.NO;
AttributeMetaData sumAttribute = resultIsReal ? new AttributeMetaData("Sum", Ontology.REAL)
: new AttributeMetaData("Sum", Ontology.INTEGER);
emd.addAttribute(sumAttribute);
}
return metaData;
}
});
}
@Override
public void doWork() throws OperatorException {
// fetch table from input port
IOTable ioTable = tableInput.getData(IOTable.class);
Table table = ioTable.getTable();
// read table, calculate sum and return new table
Table result = calculateSum(table);
// wrap the result into an IOTable
IOTable newIOTable = new IOTable(result);
// copy the annotations from the original IOTable
newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
// deliver the new IOTable to the port
tableOutput.deliver(newIOTable);
// deliver original table to corresponding port
originalOutput.deliver(ioTable);
}
/**
* Takes a {@link Table} with only numeric columns, calculates the sum for each row and adds it as a new column.
*
* @param table
* the original table
* @return a new table with the original columns and a sum column
* @throws UserError
* if the table contains non-numeric columns
*/
private Table calculateSum(Table table) throws UserError {
// check that all columns are numeric
BeltErrorTools.onlyNumeric(table, getName(), this);
// If any column is of type real the result will be real. Otherwise, it will be integer.
boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();
// initialize numeric buffer needed to create sum column
NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
: Buffers.integer53BitBuffer(table.height());
// read the table row-wise and store the sum of each row in the buffer
NumericRowReader reader = Readers.numericRowReader(table);
for (int i = 0; i < buffer.size(); i++) {
// move must be called to advance the reader to the next row
reader.move();
double sum = 0;
for (int j = 0; j < reader.width(); j++) {
// reader.get(j) returns the value of the j-th column of the row
sum += reader.get(j);
}
buffer.set(i, sum);
}
// copy original table using table builder
TableBuilder builder = Builders.newTableBuilder(table);
// add the new column to the builder
builder.add("Sum", buffer.toColumn());
// build the new table in parallel using the operator's context
Table result = builder.build(BeltTools.getContext(this));
return result;
}
}
In this example you have seen how to fetch and deliver a table from and to ports. How to read a table and processed its data, create a new column using a buffer and return a modified table using the TableBuilder class.
There are alternative ways to implement the operator, of course. Look, for example, at the following code:
private Table calculateSum(Table table) throws UserError {
// check that all columns are numeric
BeltErrorTools.onlyNumeric(table, getName(), this);
// If any column is of type real the result will be real. Otherwise, it will be integer.
boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();
// this function will be applied in parallel to the table rows
ToDoubleFunction<NumericRow> sumUpRow = row -> {
double sum = 0;
for (int j = 0; j < row.width(); j++) {
sum += row.get(j);
}
return sum;
};
// the results will be collected in a numeric buffer
NumericBuffer buffer;
if(resultIsReal){
buffer = table.transform().applyNumericToReal(sumUpRow, BeltTools.getContext(this));
} else {
buffer = table.transform().applyNumericToInteger53Bit(sumUpRow, BeltTools.getContext(this));
}
// copy original table using table builder
TableBuilder builder = Builders.newTableBuilder(table);
// add the new column to the builder
builder.add("Sum", buffer.toColumn());
// build the new table in parallel using the operator's context
Table result = builder.build(BeltTools.getContext(this));
return result;
}
This code uses the Table
's transform method and a row transformer to achieve the same results as the calculateSum
method presented earlier.
Details on the transform
method can be found here.
Using the transform method comes with the additional advantage that the summations potentially takes place in parallel.
Belt once again makes use of the operator's context to automatically decide if and how to parallelize the computation.
The next example shows how to use generators to fill columns and how to add meta data like, for example, roles to a table.
ID generator example
Next, let's implement an operator that takes a table and adds an ID column to it. Here is the code of its doWork()
method:
@Override
public void doWork() throws OperatorException {
// fetch table from input port and initialize builder
IOTable ioTable = tableInput.getData(IOTable.class);
Table table = ioTable.getTable();
TableBuilder builder = Builders.newTableBuilder(table);
// add id column via generator
builder.addInt53Bit("ID", i -> i);
// set column role
builder.addMetaData("ID", ColumnRole.ID);
// add annotations and deliver results
Table result = builder.build(BeltTools.getContext(this));
IOTable newIOTable = new IOTable(result);
newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
tableOutput.deliver(newIOTable);
// deliver original table to corresponding port
originalOutput.deliver(ioTable);
}
We fetch the input table and initlialize the builder with it just as we did before. Then add the id column via:
builder.addInt53Bit("ID", i -> i);
This line of code makes use of one of the table builder's convenience methods that takes a label and a generator and automatically fills the column.
Furthermore, it does not fill the column straight away but does so later when the build
method is called.
Thereby, the builder can fill all columns in parallel.
Let's take a closer look at the generator.
For numeric column types it is represented via an IntToDoubleFunction
.
The generator consumes a row index and returns the value for that row.
Our implementation returns the row index itself as the result and, thereby, generates ids from 0 to the number of rows - 1.
Similar generator methods for other column types are also available in the table builder.
The next step is to set the column's role to ColumnRole.ID
.
The builder's addMetaData
method takes a column label and meta data to attach to the corresponding column.
Since ColumnRole
implements ColumnMetaData
it can be attached via this method.
Finally, the resulting table is wrapped into an IOTable, the annotations are copied and the table is delivered to the output port.
ColumnMetaData
ColumnMetaData
represents additional information that can be attached to columns. Classes implementing the ColumnMetaData
by default are:
- ColumnRole: Representing the roles used in Studio to mark special columns like, for example, labels.
- ColumnAnnotation: A textual description of the column.
- ColumnReference: A reference to another column that is somehow related to the column. An example would be a prediction column referencing the label column that it refers to.
Custom meta data can be added to the columns by implementing the ColumnMetaData
interface.
Please note that column annotations and references are not visualized in RapidMiner Studio yet, but we plan on doing so in the near future.
Two important changes have been made to column roles. Firstly, roles need not be unique anymore. A table can have multiple label, prediction and even id columns. This comes in handy, e.g., when working with learners that expect multiple labels. Secondly, in Belt the set of column roles is fixed to BATCH, CLUSTER, ID, LABEL, OUTLIER, PREDICTION, SCORE, WEIGHT, INTERPRETATION, ENCODING, SOURCE and METADATA. While the first eleven of them are the default roles, METADATA stands for anything other than the known roles. Columns marked as METADATA will usually be ignored by operators (e.g. when creating models). Legacy roles that do not exist in Belt will be mapped to METADATA.
Automatic conversion between Table and ExampleSet
Table
will be converted to ExampleSet
and vice versa depending on the format the operator requests a port to deliver it in.
This conversion is done very efficient so that in most cases this will not impact the overall performance of a process.
Please note:
- Since
ExampleSet
expects roles to be unique, non unique roles will have an index appended to their name when converting fromTable
toExampleSet
. When such a role is converted back at a later point in the process, the unnecessary index will automatically be removed. - Attribute / column types will be mapped to the next best representation in the converted format. Some of the Belt column types do not have a representation in the old API. Therefore, attempting to deliver an
IOTable
holding column types not included inBeltConverter.STANDARD_TYPES
will lead to an exception. This restriction may be removed in one of the future releases.
MetaData class for IOTables
To this point ExampleSetMetaData
is the MetaData
class used to describe IOTable
s at the operator ports.
This works to a certain degree well because ExampleSet
and Table
both represent data tables and
they are conceptually similar.
Nevertheless, in the near future we will release an IOTable
specific meta data class
that can better represent the new Belt tables.