Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version

Loop Files (Concurrency)

Synopsis

This operator executes the inner process tasks on every selected file.

Description

With this operator you can select and filter files of a directory and execute the inner process on every selected file. Macros can be used to extract the file name, file path and file type.

In contrast to the core operator this advanced implementation allows the parallel execution of the inner processes. You can activate the Background Processes Panel and change the Concurrency Level or use the operator Set Concurrency Level.

Input

  • in (IOObject)

    This port expects an IOObject which is passed on to the inner process without being altered. It will be reproduced if used. If you want to execute an inner loop, you have to connect the operators of the subprocess with the inner loop input port. In the first iteration it delivers the IOObject of the in input port.

Output

  • output_collector (Collection)

    This port collects every result of the inner process. It will be reproduced if used.

Parameters

  • directory

    Select the directory from where to start scanning for files.

    This parameter is only available if the file set port is not connected.

    Range: string
  • filter_type Specifies how to filter file names. You can either use standard, command shell like glob filtering or a regular expression. Range: selection
  • filter_by_glob

    "Specifies a glob expression which is used as filter for the file and directory names.

    Here is a short overview:

    • * : any number of characters
    • **: same as '*', but crosses directory boundaries. Useful to match complete paths.
    • ? : matches exactly one char
    • {}: contains collections that are separated by ','. The glob filter will try to match the string to any of the strings in the collection.
    • []: contains a range of chars or a single char (e.g.[a-z]).
    • String(*): \*
    • String(?): \?
    • String(**): \*\*

    Range: string
  • filter_by_regex "Specifies a regular expression which is used as filter for the file and directory names, e.g. 'a.*b' for all files starting with 'a' and ending with 'b'. Ignored if empty.", Range: string
  • recursive Set whether to recursively search every directory. If set to true, the operator will include files inside sub-directories (and sub-sub-directories ...) of the selected directory. Range: boolean
  • skip_inaccessible Set whether to ignore files/directories which cannot be accessed. If set to true, the operator will continue running, even if a file/directory cannot be accessed - it will simply be ignored and logged. By default, any inaccessible file/directory will cause the operator to abort. Range: boolean
  • enable_macros If this parameter is enabled, you can name and extract three macros (for file name, file type and file folder)and use them in your subprocess. Range: boolean
  • macro_for_file_name If filled, a macro with this name will be set to the name of the current entry. To get access on the full path including the containing directory, combine this with the folder macro. Can be left blank. Range: string
  • macro_for_file_type Will be set to the file's extension. Can be left blank. Range: string
  • macro_for_file_folder If filled, a macro with this name will be set to the containing folder of the current file. To get access on the full path you can combine this with the name macro. Can be left blank. Range: string
  • reuse_results Set whether to reuse the results of each iteration as the input of the next iteration. If set to true, the output of each iteration is used as input for the next iteration. Enabling this parameter will force the operator to NOT run in a parallel fashion. If set to false, the input of each iteration will be the original input.", Range: boolean
  • enable_parallel_execution This parameter enables the parallel execution of the inner processes. Please disable the parallel execution if you either run into memory problems or if you need an inner loop. The end result will be propagated to the outside process and can be used in the usual way. Range: boolean

Tutorial Processes

Generating an ExampleSet with names of all files in a directory

This Example Process shows how the Loop Files operator can be used for iterating over files in a directory. You need to have at least a basic understanding of macros and logs in order to understand this process completely. The goal of this process is to simply provide the list of all files in the specified directory in form of an ExampleSet. This process starts with the Loop Files operator. All parameters are used with default values. The Log operator is used in the subprocess of the Loop Files operator to store the name of the files in the log table in every iteration. The Provide Macro As Log Value operator is used before the Log operator to make the file name macro of the Loop Files operator available to the Log operator. The name of the file name macro (specified through the file name macro parameter) is 'file_name', therefore the macro name parameter of the Provide Macro As Log Value operator is set to 'file_name'. After the execution of the Loop Files operator, names of all the files in the specified directory are stored in the log table. To convert this data into an ExampleSet, the Log to Data operator is applied. The resultant ExampleSet is connected to the result port of the process and it can be viewed in the Results Workspace. As the path of the RapidMiner repository was specified in the directory parameter of the Loop Files parameter, the ExampleSet has the names of all the files in your RapidMiner repository. You can see names of all the files including files with '.properties' and '.rmp' extensions. If you want only the file names with '.rmp' extension, you can set the filter parameter to '.*rmp'.