Categories

Versions

Python Extension Loader

An Altair AI Suite extension that can add Python-based AI Suite extensions to the platform. Part of the Altair AI Suite Python SDK.

The Altair AI Suite Python SDK consists of two parts:

  • Python AI Tools Devkit (PYSDK): This is used to create a Python extension - a .zip file - out of your Python package, as well as create the environment definition to be used by PEL
  • Python Extension Loader (PEL): Loads the created Python extensions into the AI Suite to allow workflows to use the defined operators.

The PEL loads the operator declarations (name, parameters, ports, etc.) defined in the respective Python extension. When such an operator should be run, it delegates to the required Python distribution for executing the function behind an operator and adds the Python extension .zip file as a library. This distribution is then called with a wrapper script and an extensive argument JSON describing how the library should be called.

The distribution is usually managed by Miniforge - meaning the user does not need to take any action himself, the required Python environment is installed and managed automatically for him. Inputs and outputs are either part of the wrapper argument JSON, or are serialized to and from Java.

In other words, the PEL registers new operators that can be added to workflows and are not distinguishable from Java-backed operators. They can be used in local or AI Hub setups as long as the indicated Python distribution (via the environment entry in the Python extension .toml file (and potentially the bundled conda env.yml file)) is available.

Table of contents

Python Extension Lookup

To change this location, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.extLoc set to the directory where all Python extensions are located
  2. Use a system property with the key pysdk.pel.extLoc set to the directory where all Python extensions are located
  3. Use an environment variable with the key PYSDK_PEL_EXT_LOC set to the directory where all Python extensions are located

Look for the following log line to indicate you set this up correctly:

INFO: Looking for Python extensions in /path/to/py-extensions

In case of an error, you will see this in your log:

Failed to load Python extensions

Python Distribution Lookup

There are four different approaches:

  1. Environment creation via Miniforge (online)
  2. Environment creation via Miniforge (offline)
  3. Extractable Distributions (via archive file)
  4. Local Distributions (already pre-existing on disk)

1. Environment creation via Miniforge (online)

Requirements:

  • Miniforge has not been disabled via the setting below
  • A conda environment definition file env.yml is present in the root directory of the Python extension

Description

This is the default mode. This will create the environment based on the .yml file via Miniforge on demand (if it has not already been created by an earlier start / another Python extension) via the conda-forge channel.

The Miniforge installer will be downloaded and verified automatically when needed (but can also be provided via the resource location). Currently supported Miniforge installers:

  • Miniforge3-24.3.0-0-Windows-x86_64.exe
  • Miniforge3-24.3.0-0-MacOSX-arm64.sh
  • Miniforge3-24.3.0-0-MacOSX-x86_64.sh
  • Miniforge3-24.3.0-0-Linux-aarch64.sh
  • Miniforge3-24.3.0-0-Linux-ppc64le.sh
  • Miniforge3-24.3.0-0-Linux-x86_64.sh

To disable the use of Miniforge, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.disableMiniforge set to true
  2. Use a system property with the key pysdk.pel.disableMiniforge set to true
  3. Use an environment variable with the key PYSDK_PEL_DISABLE_MINIFORGE set to true

Look for the following log line to indicate you set this up correctly:

INFO: Disabled Python distribution Miniforge usage

Sometimes the default Miniforge installation location is not desirable (e.g. due to Conda bugs with whitespaces or special characters, e.g. https://github.com/conda/conda/issues/10239). In those cases, you can change the location where Miniforge is installed to / is looked up in.

Note: The Windows installer is notoriously finicky, and does not like long-ish paths (~60 or more chars) or special characters or whitespaces at all. The default path is chosen with care, so modify it only if it does not work!

To change the installation directory of Miniforge, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.miniforgeDir set to the directory where Miniforge should be installed to / looked up in
  2. Use a system property with the key pysdk.pel.miniforgeDir set to the directory where Miniforge should be installed to / looked up in
  3. Use an environment variable with the key PYSDK_PEL_MINIFORGE_DIR set to the directory where Miniforge should be installed to / looked up in

Look for the following log line to indicate you set this up correctly:

INFO: Set Miniforge directory to /path/to/miniforge

2. Environment creation via Miniforge (offline)

Requirements:

  • Miniforge has not been disabled via the setting above
  • A conda environment definition file env.yml is present in the root directory of the Python extension
  • Air-gapped mode is enforced via the setting below
  • Offline Miniforge distribution installer is present in the lookup directory

Description:

This will create the environment via Miniforge as described above, but use an offline disk-based channel for doing so. This is needed for air-gapped systems which have no access to the internet.

To set air-gapped mode, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.airGapped set to true
  2. Use a system property with the key pysdk.pel.airGapped set to true
  3. Use an environment variable with the key PYSDK_PEL_AIR_GAPPED set to true

Look for the following log line to indicate you set this up correctly:

INFO: Set Python distribution to air-gapped mode

The lookup directory for the archive defaults to

  • ~/.RapidMiner/internal cache/temp, or
  • ~/.AltairRapidMiner/AI Studio/{version}/internal cache/temp,

but can be changed via the setting below:

To set the location in which the archives are looked up in, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.resourceLoc set to the directory where the Python distribution archives are located
  2. Use a system property with the key pysdk.pel.resourceLoc set to the directory where the Python distribution archives are located
  3. Use an environment variable with the key PYSDK_PEL_RESOURCE_LOC set to the directory where the Python distribution archives are located

Look for the following log line to indicate you set this up correctly:

INFO: Set Python resource lookup directory to /path/to/py-resource-loc

3. Extractable Distributions (via archive file)

Requirements:

  • Miniforge has been disabled via the setting above
  • OR No conda environment definition file env.yml is present in the root directory of the Python extension
  • OR No Offline Miniforge distribution installer is present in the lookup directory
  • AND The distribution archive is present in the lookup directory (see details below)

Description:

This will look for an archive[tar|tar.gz|zip] file for the requested Python distribution. The archive format changes depending on whether a version is provided or not. Format is pydist-%s-%s-%s.%s, where the 4 elements are constructed as follows:

  • The key of the distribution
  • If a version is provided: version, and nothing if no version is provided
  • Either windows or macos or linux
  • A file suffix, e.g. zip or tar.gz

Examples for a full name might be

  • pydist-custom-1.0.0-windows.zip or
  • pydist-other-0.1.0-macos.tar.gz

This will install the contents of the archive on disk and verify its contents on each Studio start.

The extraction folder used for the contents of the archive is the lookup directory for the distributions and can also be changed via settings (see below).

4. Local Distributions (already pre-existing on disk)

Requirements:

  • Miniforge has been disabled via the setting above
  • OR No conda environment definition file env.yml is present in the root directory of the Python extension
  • OR No Offline Miniforge distribution installer is present in the lookup directory
  • AND No matching distribution archive is present in the lookup directory

Description:

This is the both the option usually used by headless execution agents like the job agent, scoring agent, etc. and as the last resort if none of the previous options worked. It expects that the required Python distribution is already located on disk and ready to use as-is.

The lookup directory for the distributions defaults to

  • ~/.RapidMiner/internal cache/py-dists or
  • ~/.AltairRapidMiner/AI Studio/{version}/internal cache/py-dists,

but can be changed via the setting below:

To set the location in which the distributions are looked up in, you have 3 options, the first one to be specified is used:

  1. Use the Settings with the key pysdk.pel.distLoc set to the parent directory where the Python distributions are located
  2. Use a system property with the key pysdk.pel.distLoc set to the parent directory where the Python distributions are located
  3. Use an environment variable with the key PYSDK_PEL_DIST_LOC set to the parent directory where the Python distributions are located

Look for the following log line to indicate you set this up correctly:

INFO: Set Python distribution lookup directory to /path/to/py-dists

The full path for a distribution is then constructed in the following way:

Let's assume the folder specified above via the settings is /path/py-dists to keep the example simple. The name of each folder within that path is simply in the form of key-version of a distribution. In case of a non-versioned distribution, it takes the form of just key.

Examples for full paths might therefore look like:

  • /path/py-dists/other-1.0.0
  • /path/py-dists/custom

Within each of those, the fully usable Python distribution for the correct OS is expected to be ready for use.

External Configuration Overview

PEL behavior can be customized, to allow it to work in many different scenarios and environments, from a regular, internet-facing laptop using AI Studio, all the way to an air-gapped system running a headless RTSA.

The table below contains all the currently available settings. Each setting can be set in 3 different ways, and the first one to be specified is used:

  • via Java with the Settings mechanism (note: using context Settings.CONTEXT_ALTAIR_LIB)
  • via system properties
  • via environment variables
Description Settings Key System Property Environment Variable
Determines that no distribution mechanism requiring an online connection is used.
If set, Python distributions must be provided differently:
  • via disk-based conda channel for Miniforge
  • via distribution archive
  • ready to use on disk
pysdk.pel.airGapped pysdk.pel.airGapped PYSDK_PEL_AIR_GAPPED
Disables the use of Miniforge altogether.
Python distributions must either be provided via archive or ready to use on disk.
pysdk.pel.disableMiniforge pysdk.pel.disableMiniforge PYSDK_PEL_DISABLE_MINIFORGE
The directory where Miniforge is located / should be installed in. With this setting you can use an existing Miniforge installation.
Not used if Miniforge is disabled.
pysdk.pel.miniforgeDir pysdk.pel.miniforgeDir PYSDK_PEL_MINIFORGE_DIR
The directory is used as the parent directory for all Python distribution lookups for the referenced distributions. Subfolders are treated as Python distributions and expected to be named as {key}-{version}. pysdk.pel.distLoc pysdk.pel.distLoc PYSDK_PEL_DIST_LOC
The directory where any required resources are looked up.
  • the provided directory is used as the lookup directory for the Miniforge installer in case it is needed
  • the provided directory is used as the lookup directory for the Miniforge offline disk-based channel archive to create environments on air-gapped systems
  • the provided directory is used as the lookup directory for all Python distribution archives for the referenced distributions in case archive mode is needed
pysdk.pel.resourceLoc pysdk.pel.resourceLoc PYSDK_PEL_RESOURCE_LOC
Determines the lookup location where Python extensions are searched in pysdk.pel.extLoc pysdk.pel.extLoc PYSDK_PEL_EXT_LOC
The debug mode to use. Will trigger additional logging. One of:
  • NONE - No additional logging
  • OPERATOR - operators log additional information at INFO level (as opposed to FINE)
  • ALL - Python script logs at INFO level (as opposed to FINE); also logs additional information
pysdk.debugMode pysdk.debugMode PYSDK_DEBUG_MODE
The lower bound of the port range to use for the python servers. pysdk.pel.server.minPort pysdk.pel.server.minPort PYSDK_PEL_SERVER_MIN_PORT
The upper bound of the port range to use for the python servers. pysdk.pel.server.maxPort pysdk.pel.server.maxPort PYSDK_PEL_SERVER_MAX_PORT
Time in seconds the server is kept alive in idle state before shutting down. pysdk.pel.server.idleShutdown pysdk.pel.server.idleShutdown PYSDK_PEL_SERVER_IDLE_SHUTDOWN

extension.json

The extension.json file is created by the Altair AI Tools Devkit and added to the created Python extension .zip file. It contains all the information required by the PEL to create a "real" AI Suite extension out of it. This encompasses a description of each operator, as well as information on the required Python environment.

JSON Format Example

Here is an example of such a JSON file. Below it, you will find an accurate description of each supported element, as well as potential additional options.

{
    "extension": {
        "name": "Python Samples",
        "namespace": "pysa",
        "version": "0.1.0",
        "license": "Apache License, Version 2.0,
        "environment": "bundled:0.1.3",
        "module": "samples",
        "dependencies": [
            {
                "name": "pytensors",
                "min_version": "0.3.0",
                "module": "tensors"
            }
        ],
        "sdk_version": "0.1.0"
    },
    "operators": {
        "hello_world": {
            "name": "Hello World",
            "implementation": "hello_world",
            "parameters": [
                {
                    "type": "string",
                    "name": "name",
                    "description": "the name to greet (optional)",
                    "categories": null,
                    "default": null,
                    "optional": true
                },
                {
                    "type": "integer",
                    "name": "times",
                    "description": "the number of greetings to generate",
                    "categories": null,
                    "default": 3,
                    "conditions": [
                        [
                            "!=",
                            "name",
                            "stranger"
                        ]
                    ],
                    "optional": false
                }
            ],
            "inputs": [],
            "outputs": [
                {
                    "name": "result1",
                    "type": "table"
                }
            ],
            "icon": "message.png"
        },
        "scale": {
            "name": "Normalize",
            "implementation": "scale",
            "parameters": [
                {
                    "type": "category",
                    "name": "method",
                    "description": "the normalization method",
                    "categories": [
                        "MAX_ABSOLUTE",
                        "MIN_MAX",
                        "STANDARD"
                    ],
                    "default": "MIN_MAX",
                    "optional": false
                },
                {
                    "type": "string",
                    "name": "selected_col",
                    "description": "Selected column of the input dataframe",
                    "categories": null,
                    "default": null,
                    "optional": false,
                    "annotations": [
                        ["SelectedColumnAnnotation", "data"]
                    ]
                }
            ],
            "inputs": [
                {
                    "name": "data",
                    "type": "table",
                    "optional": false
                }
            ],
            "outputs": [
                {
                    "name": "normalized",
                    "type": "table"
                }
            ]
        }
    }
}

Extension Block

The extension block contains some meta information required by the RapidMiner extension format, as well as some information on the Python script.

  • name: The name of the extension. Will for example appear as the name of the sub-folder in the Extensions section of the available RapidMiner operators
  • namespace: The namespace of the extension. Will be used in the process XML as operator key prefix, just like any other RapidMiner extension. Will later be required to be unique, but is currently not validated for uniqueness among the installed Python extensions
  • license: The name of the license under which the Python extension is released. Must match the actual contents of the LICENSE file in the root level. Note: If you have 3rd-party package dependencies, add their licenses under licenses/package_name.license_name.license files.
  • version: The version of the Python extension. Will later be used to determine which Python extension to load if multiple versions of a single one are present, but is currently not being used
  • environment: The key and optional version of the Python distribution required to run the contained Python code. Format is either key:version or key. Environments can either be created on demand or are expected to be located on disk already. Which mode is used depends on the (optional) settings set during PEL startup. By default, the envs are build using Miniforge. The available options are:

    • Miniforge builds the environment on-demand from the conda-forge channel
    • Miniforge builds the environment on-demand from an offline disk-based channel (for air-gapped systems)
    • Distribution archives containing the entire distribution can be extracted
    • The distribution is already on disk and ready to be used
  • module: The name of the Python module in which the referenced script functions can be imported from

  • dependencies: List of Python extension dependencies which the extension may use functionality from.

    • name: Namespace of the Python extension (dependency)
    • min_version: Required minimum version of the Python extension (dependency)
    • module: Same as the previously described module just for the dependency extension. Required by PEW
  • sdk_version: Version number of the Python SDK that was used to build the Python extension.

Operators Block

The operators block contains a dictionary of all the operators contained in this Python extension. Each of these is referenced by its key. These keys must be unique to be used as a RapidMiner operator key, which is the reason for having a dictionary here.

Each operator contains the following:

  • name: The name of the operator as it should appear to the user
  • implementation: The name of the Python function that contains the code backing this operator. Note that the required inputs of this function are split into the parameters and inputs elements, depending on their type.
  • icon: Optional. The name of one of the icons provided by AI Studio. If not set, a default icon will be used.
  • tags: Optional. Contains an array of tags for the operator. Tags are used when searching for operators. By default, no tags are present.
  • synopsis: Optional. A short 1-2 sentence synopsis what this operator is doing.
  • description: Optional. A longer description text describing in more details what this operator is doing.
  • parameters: An array containing the parameters of the operator, which are used for getting settings into the script function as arguments. Each element within must consist of the following:

    • name: The name of the parameter, will be used as the key internally.
    • description: The description of the parameter, to inform the user what this parameter is for.
    • type: The type of the parameter. Depending on the type, some of the following elements can become optional. Currently supported types are:

      • integer: An int value will be provided as Python function input.
      • real: A float value will be provided as Python function input.
      • boolean: A bool value will be provided as Python function input.
      • string: A str value will be provided as Python function input.
      • category: An Enum constant will be provided as Python function input. Must also provide the categories with Enum values available for selection.
    • categories: Optional. Only used if the Python function requires an Enum. Lists the selectable values in an array.

    • default: Optional. A pre-selected default value for the parameter. Must conform to the type of the parameter.
    • optional: if true, the parameter can be ignored and left empty. If false, providing a proper value to the parameter will be enforced.
    • annotations: Optional. Array of parameter annotations. Each annotation itself is an array with variable number of elements. The first element is the annotation type, the remaining are the parameters. List of supported annotation types:

      • SelectedColumnAnnotation: Represents a selected column of an input dataframe. Parameters:

        1. Name of the input port it refers to (must be a dataframe).
      • TextParameterAnnotation: Represents longer texts. Parameters:

        1. Text type (for syntax highlighting). Supported values: PLAIN (default), JSON, XML, HTML, SQL and PYTHON.
      • ConditionalAnnotation: Represents operator parameter conditions (supported types: boolean, string and category). Parameters:

        1. Operator string. Must be == or != for (un-)equal string conditions.
        2. Name of the operator parameter the condition references. The type of the referenced parameter determines the type of condition.
        3. Value which the referenced parameter will be compared to. Must be either true or false for boolean conditions. Can be null, to check if a parameter is set.
  • inputs: An array containing the input ports of the operator, which are used for getting data into the script function as arguments. Each element within must consist of the following:

    • name: The name of the Python script function argument this input is for
    • type: The input type, currently supported values are:

      • table for data tables which get converted to a Pandas DataFrame
      • file for binary files which get converted to a Path object
    • optional: if true, the presence of actual input data at this port will not be enforced. If false, the absence of data will be flagged as an error.

  • outputs: An array containing the output ports of the operator, which are used for getting data from the script result back to the process. Each element within must consist of the following:

    • name: The name of the Python script function return value of this output. Used for identification if multiple output values are present.
    • type: The output type, currently supported values are:

      • table for Pandas DataFrame which will get converted to a RapidMiner data table
      • file for Path objects which will get converted to binary files