You are viewing the RapidMiner Radoop documentation for version 10.2 - Check here for latest version
Connecting to Cloudera Data Platform
For your reference Cloudera Data Platform Private Cloud Base version 7.1.7 was used while creating this document. The following setup guide suits a Kerberised CDP cluster with TLS authentication supporting High Availability for Hive, HDFS and YARN services, which is the most common production use-case.
Configuring the Hadoop cluster
The cluster side configurations listed below can be done by a user with admin privileges in the Cloudera Manager instance used to administer your CDP cluster.
Spark dependencies
Setup Cloudera Spark3
Radoop 10.2 does support Cloudera's Spark3 distribution out of the box. Please install the Spark3 parcel following Cloudera documentation.
Add Java 11 to worker nodes
Cloudera supports running their cluster on Java 11, thus all worker nodes must be equipped with that version. This can be achieved either by running the whole cluster on Java 11 (which can be effortlessly configured in Cloudera Manager) or by installing a Java 11 on all the worker nodes into a local file system directory of your choice. If the latter was chosen, then please share that location with Radoop users since they will need it during their connection setup.
Radoop Proxy 2.0
Radoop 10.2.0+ requires Radoop Proxy 2.0 (or later) to submit Spark applications. It can be installed and managed via Cloudera Manager.
Hive setup
Allow changes of advanced HiveQL properties
Radoop relies on its ability to set certain advanced HiveQL properties along query execution. These must be explicitly enabled - whitelisted - on the cluster.
- Navigate to Hive on Tez/Configuration in Cloudera Manager
Search for
Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml
add the following both for Service and Client configurations (it must contain no whitespaces):Name: hive.security.authorization.sqlstd.confwhitelist.append Value: radoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name|hive\.remove\.orderby\.in\.subquery|radoop\.testing\.process\.name
If everything went well it should look like this:
Enabling Radoop UDFs in Hive
Complex functionality of Radoop is partly achieved by defining custom functions (UDF, UDAF and UDTF) to Hiveserver2 extending its capabilities.
- Hive UDF functions for Radoop 10.2+ requires Hive to run on Java 11
- Install RapidMiner Parcel
- Navigate to Hive on Tez/Configuration in Cloudera Manager
- Search for
hive_aux_jars_path_dir
and add the following value/opt/cloudera/parcels/RAPIDMINER_LIBS/lib/radoop/
- Restart Hive service (and possibly other stale services) to pick up changes.
- Register Hive UDF functions for Radoop
YARN configuration
Set YARN's logging configurations to allow reading of YARN application logs submitted by Radoop. Collecting and reading the CDP default IFiles is not supported currently thus using TFiles needs to be configured explicitly. In order to read YARN logs, enabling read permission for the log folder in HDFS is required.
- Navigate to YARN/Configuration
- Search for
yarn_log_aggregation_file_formats
and set it's value toTFile
- Search for
yarn_log_aggregation_TFile_remote_app_log_dir
and set it's value to/tmp/logs
- Search for
yarn_log_aggregation_TFile_remote_app_log_dir_suffix
and set it's value to/logs
- To finish YARN setup restart stale services.
Networking
Please follow the general description for networking setup for accessing Hadoop cluster.
Security configuration
- To ease Radoop Connection setup for Radoop users it is recommended to create and share a technical account in Cloudera Manager with which the Radoop Connection Import Wizard can perform its job. Such account can be created in Cloudera Manager under Administration/Users & Roles select Add Local User and set it's Roles to Read-Only.
- Radoop users are going to require a handful set of permissions (eg: access HDFS, execute HiveQL, submit YARN job) which should be already in place of a working cluster. For the exact set please refer to Configuring Apache Ranger authorization at Hadoop Security.
- On a Kerberized cluster Radoop users need their keytab file and KDC details in order to authenticate to the cluster.
- Radoop users will need the CA certificate and other trusted certificates in PEM format to establish secure communication with Hadoop services via TLS.
Setting up the Radoop connection in RapidMiner Studio
Operating a CDP cluster can happen on multiple environments with different network setups. During the setup process it is crucial to take into consideration whether the cluster is running on a separate, isolated network. In the latter case the Hadoop cluster is not aware of its nodes external addresses hence using Radoop Proxy is required in order to operate properly.
The configurations in the following section need to be set on both secured and non-secured clusters. We strongly recommend using the Import from Cluster Manager tool to create the connection, as several advanced properties required for correct operation are seamlessly gathered from the cluster during the import process.
Auto-TLS Encryption ships with CDP clusters and Radoop Proxy also supports SSL. If any of those is equipped with an untrusted (aka self-signed) certificate for SSL you need to add the certificate(s) to the
cacerts
folder in RapidMiner Studio home in order to establish secure communication channel.Use Import from Cluster Manager to create the connection directly from the configuration retrieved from Cloudera Manager. The import process doesn't use Radoop Proxy thus Cloudera Manager has to be accessible over network for this task. If SSL is enabled, pick the hostname which corresponds the certificate installed in the previous step.
When using Kerberos, set Client Principal with the corresponding Keytab File, KDC Address and the REALM on the Global tab.
On the Hive tab, enter the Database Name to connect to. Choose a database where privileges for all operations are granted for the given user. In case your Hadoop administrator installed RapidMiner Parcel, tick UDFs are installed manually, otherwise Radoop will register UDFs at runtime.
In case of using Radoop Proxy there should be a proxy connection ready to it. As a final step for a Radoop Connection tick Use Radoop Proxy on the Radoop Proxy tab and select a Radoop Proxy Connection which had been created for this cluster.