Enabling a function for parallel execution

One enormous step forward for PL/SQL in Oracle9i is the ability to execute functions within a parallel query context. Prior to Oracle9i, a call to a PL/SQL function inside SQL caused serialization of that query—a major problem for data warehousing applications. With Oracle9i, you can now add information to the function header to instruct the runtime engine how that function can and should be used.

In general, if you would like your function to execute in parallel, it must have a single, strongly typed REF CURSOR input parameter.[2]

[2] The input REF CURSOR need not be strongly typed to be partitioned by ANY.

Here are some examples:

· Specify that the function can run in parallel and that the data passed to that function can be partitioned arbitrarily:

· CREATE OR REPLACE FUNCTION my_transform_fn ( · p_input_rows in employee_info.recur_t )· RETURN employee_info.transformed_t· PIPELINED PARALLEL_ENABLE ( PARTITION p_input_rows BY ANY )

In this example, the keyword ANY expresses the programmer's assertion that the results are independent of the order in which the function gets the input rows. When this keyword is used, the runtime system randomly partitions the data among the query slaves. This keyword is appropriate for use with functions that take in one row, manipulate its columns, and generate output rows based on the columns of this row only. If your program has other dependencies, the outcome will be unpredictable.

· Specify that the function can run in parallel, that all the rows for a given department go to the same slave, and that all of these rows are delivered consecutively:

· CREATE OR REPLACE FUNCTION my_transform_fn ( · p_input_rows in employee_info.recur_t )· RETURN employee_info.transformed_t· PIPELINED· CLUSTER P_INPUT_ROWS BY (department)· PARALLEL_ENABLE ( PARTITION P_INPUT_ROWS BY HASH (department) )

Oracle uses the term clustered to signify this type of delivery, and cluster key for the column (in this case, "department") on which the aggregation is done. But significantly, the algorithm does not care in what order of cluster key it receives each successive cluster, and Oracle doesn't guarantee any particular order here. This allows for a quicker algorithm than if rows were required to be clustered and delivered in the order of the cluster key. It scales as order N rather than order N.log(N), where N is the number of rows.

In this example, we can choose between HASH (department) and RANGE (department), depending on what we know about the distribution of the values. HASH is quicker than RANGE and is the natural choice to be used with CLUSTER...BY.

· Specify that the function can run in parallel and that the rows that are delivered to a particular slave process, as directed byPARTITION... BY (for that specified partition), will be locally sorted by that slave. The effect will be to parallelize the sort:

· CREATE OR REPLACE FUNCTION my_transform_fn ( · p_input_rows in employee_info.recur_t )· RETURN employee_info.transformed_t· PIPELINED· ORDER P_INPUT_ROWS BY (C1)· PARALLEL_ENABLE ( PARTITION P_INPUT_ROWS BY RANGE (C1) )

Because the sort is parallelized, there should be noORDER...BY in the SELECT used to invoke the table function. (In fact, an ORDER...BY clause in the SELECT statement would subvert the attempt to parallelize the sort.) Thus it's natural to use the RANGE option together with the ORDER...BY option. This will be slower than CLUSTER...BY, and so should be used only when the algorithm depends on it.

The CLUSTER...BY construct can't be used together with the ORDER...BY in the declaration of a table function. This means that an algorithm that depends on clustering on one key, c1, and then on ordering within the set row for a given value of c1 by, say, c2, would have to be parallelized by using the ORDER ... BY in the declaration in the table function.