You are viewing the RapidMiner Studio documentation for version 9.8 - Check here for latest version
What's New in RapidMiner Studio 9.4.0?
Released: Sep 25th, 2019
The following describes the bug fixes in RapidMiner Studio 9.4.0:
New Features
- Added maps to seamlessly visualize geospatial data. You can choose from multiple map types with many different configuration options, as well as dozens of maps for geographic regions, continents, and of course many individual countries. Main features:
- Choropleth maps: Used to display numeric values associated to regions (e.g. a country or a state) via a color gradient. The region is defined in the data by the join column, which can be either the ISO 3166 two-letter code or the actual name of the region. If your data has multiple entries per region, you will have the option to simply aggregate on the join column (just like you can for many plots).
- Categorical maps: Used to visualize regions that belong to a number of distinct categories. The rows are joined to the map again via ISO 3166 codes or via actual region names. Each distinct category in the value column will then produce one color group.
- Point maps: These maps offer latitude and longitude support. Each row becomes a marker for its location. For best effect, you can choose the appropriate map to display your locations (e.g. a world map or a specific country). It also offers optional support for a size column (think bubbles instead of scatter dots), as well as a color column. The color can either be numerical, in which case you get a color gradient for your points, or it can be categorical, in which case you get distinct color groups you can individually toggle on/off on the map via the legend.
- Just as the charts, the new maps allow you to quickly select the basic settings to get started, but also to fine-tune details like marker size and shape, the map background color, whether to display region or point labels, and much more.
- Visualizations: Added new plot: Sunburst chart. This chart is interactive: When selecting multiple levels, you can drill down into each level to easily inspect details for that level.
- Visualizations: Added new plot: Chord diagram
Visualizations: Added new plot: Parliament chart
Added Multi Label Modeling to train a Multi Label Model. The inner subprocess is executed for each selected label attribute and a prediction model is trained
Added Multi Label Performance to evaluate the prediction of such a Multi Label Model. The inner subprocess is executed for each pair of prediction and label attribute and the performance can be calculated.
New operator Replace All Missings which universally handles all data types and can deal with missing as well as infinite values and provides all changes as a single preprocessing model (simpler to use and more robust than the combination of other missing value handling operators but less flexible configurations)
- New operator Handle Unknown Values which remembers seen nominal values and creates a preprocessing model based on that. Later this model can replace unknown values by missings.
- New operator One Hot Encoding which can remove nominal values with too many values and transform the remaining ones into a set of numerical columns using a one-hot encoding approach with comparison groups (simple to use and more robust than the combination of other type conversion operators but less flexible configurations)
- New operator Append (Robust) which will append two data sets even if their value types are not matching or one of the sets does contain additional values compared to the other one.
- New operator Rescale Confidences (Logistic) which rescales confidences to use the full 0-1 spectrum. While based on Platt scaling, this one uses an explicit logistic regression model and also works for more than two classes.
New operator Cost-Sensitive Scoring: this is a novel approach for cost-sensitive learning which also works for more than two classes. In contrast to MetaCost, this operator does not rely on a bagged model which keeps training times down and models more simple. Instead it uses a stochastic neighborhood scoring to create the necessary variances in confidences for the optimization of the expected cost.
Added optional bucket parameter to Amazon S3 connections. This enables you to only connect to a single bucket, and thus the AWS ListAllMyBuckets permission is no longer needed
Time Series Analysis features:
- New model: Multi Horizon Forecast Model. A meta model, able to predict multiple horizon attributes at once using machine learning models
- New operator Multi Horizon Forecast to train a Multi Horizon Forecast Model
- The inner subprocess is executed for each horizon attribute (selected by regular expression on the attribute roles) and a Prediction Model is trained
- The Prediction Models are collected and built together the Multi Horizon Forecast Model
- New operator Sliding Window Validation which performs a sliding window validation for a general machine learning model
- New operator Multi Horizon Performance to evaluate the prediction of a 'Multi Horizon Forecast Model'
- The inner subprocess is executed for each pair of prediction and horizon attribute and a Performance can be calculated.
- The performances provided as a collection and as one averaged performance (if averaging is possible
Enhancements
- Import Data from database now supports the new connection management
- Added support for JDBC connections that need .dll, .so, or .dylib files to work. You can now simply add them as additional libraries in the Driver tab, where they will then be used. See for example Windows Authentication for MSSQL
- Selecting a repository entry will now behave more consistently when loading and prevent invalid selections better
- When selecting a connection for operators, the dialog shows only compatible connections instead of the entire repository
- When selecting a process for operators (e.g. via Execute Process), the dialog shows only processes instead of the entire repository
- Improved preselected repository when creating a new connection
- Added tooltip for connection test results to make it easier to read long error messages
- Handle Exception by default no longer logs a detailed stacktrace together with the error message, because the long stacktrace pollutes the log. This behavior can be re-enabled by enabling the new "add details to log" parameter.
- Added AWS tag to Amazon S3 operators
- Visualizations: Tooltips (prefix, suffix, decimals) can now also be configured individually for each plot. Configuring them for a plot takes precedence over the global tooltip configuration
- Visualizations: They can now have their default row limits increased by using the Visualizations row limit modifier setting in the preferences. Note that the default limit is chosen for performance reasons, you may see a drastic decline in chart performance when increasing it
- Visualizations: Improved default settings for some plots for huge data sets to get a more reasonable default plot
- Visualizations: Exporting an image of a chart on regular displays now resembles the displayed chart more closely (most notable for data labels)
- Auto Model: all predictive processes are unified, cleaned up, and better documented
- Auto Model: improved axis range calculation for model overview performance chart to avoid edge cases where all performances are extremely close together and all shown numbers have been the same
- Auto Model: scoring processes now deliver example sets as main result instead of the explained predictions object
- Auto Model: using new preprocessing model for missing value handling
- Auto Model: using new preprocessing model for unknown value handling
- Auto Model: using new preprocessing model for one-hot encoding
- Auto Model: all results are annotated and those annotations are also used as object names for stored results now
- Auto Model: added denormalization of data in clustering processes to show visualizations in original data space
- Turbo Prep: improved match calculations which are now more accurate as well as faster for numerical columns in Merge
- Turbo Prep: improved tooltip explaining the calculation of ID-ness for real-valued columns
- Time Series: Improved UserError and MetaData errors in case no attributes are selected for time series operators
- Time Series: Improved meta data information for most of the time series operators, including the information about the selected time series attributes and windowed time series and horizon attributes
- Time Series: Improved memory footprint of parallel execution of Process Windows operator and Forecast Validation operator
Bugfixes
- Fixed the Davies-Bouldin criterion of Cluster Distance Performance for empty clusters
- Fixed an issue that could cause Studio startup to hang for some time during the browser test
- The setting "maximum number of nominal values in meta data" is now respected everywhere
- Fixed some issues that could freeze Studio when dealing with large nominal meta data
- Fixed freeze when displaying large tooltip
- Removed warning for Cross Validation when sampling type is stratified sampling and split on batch attribute is selected
- Fixed an issue that could cause the Replace operator right-click action to fail
- Fixed UI freezes when working on very large processes while working with annotations or renaming operators
- Selecting a repository entry in an operator will now use a relative path also when using double-click
- Cleared operator parameters do now correctly get stored when they lose focus to another operator
- Fixed bug that could cause rare places (e.g. the old Reporting Extension) to display a random value instead of a missing value (indicated by '?')
- Improved CSV parsing for too many values in Create ExampleSet
- Fixed Import Data wizard being able to overwrite connections
- Fixed Generate Sales Data meta data
- Fixed unhandled error when delimiter was not set in Amazon S3 connections
- Reading an unknown file on Amazon S3 now leads to a proper error message when KMS is used
- Visualizations: Dates are now displayed according to the selected timezone
- Visualizations: Limits on nominal columns are no longer sometimes reporting too many values despite the data actually not containing that many distinct nominal values
- Visualizations: When zooming into a Scatter plot with more than 5,000 values, there are no longer extra dots appearing near the x-axis
- Visualizations: Sankey tooltip now also respects prefix, suffix, and decimal settings
- Visualizations: Fixed missing thousands separator for some tooltips when a fixed amount of decimals was set
- Visualizations: Fixed possible process error when trying to generate a report via Reporting Extensions which contained an unknown plot type in its configuration.
Development
- Added com.rapidminer.example.set.TableSplitter which provides a general framework to split Belt (the codename for the new data core) tables
- Added com.rapidminer.connection.ConnectionInformationFileUtils#addNativeLibraries(ConnectionInformation) which adds native libraries (.dll, .so, .dylib files) contained in a CI to the Java native lib lookup paths, so later calls to System.loadLibrary(String) by 3rd party libraries will work
- Visualizations: ChartEventCallbackHandler now gets the full series name instead of a potentially abbreviated version
- Deprecated Register Visualization from Database operator as it has not been working for years