Thursday, June 3, 2010

Forward looking OpenCL 'C' Programming Techniques: Kernel execution dependencies

You may be working with an implementation that divides several processing steps into separate kernels. These kernels may be algorithmically independent of each other. For this case you if you are developing for single device you may wish to place the enqueue kernel calls in some arbitrary order followed by a single call to 'clFinish()'. Then at some point in the future, the kernels can be spread over multiple devices in a consistent pattern without the need to recall algorithmic interdependencies.

E.g. The following examples attempts to better illustrate this concept:
(Upper case letters denote execution independent kernels and lower case letters denote that the prior results are a dependance for forward computation. Vertical spacing conveys 'EnqueueKernel' calls.)

Current implementation:
You may be working with an implementation that divides several processing steps into separate kernels. These kernels may be independent of each other. If this is the case and you are developing for single device execution you might place the enqueue kernel calls in some arbitrary order followed by a single call to 'clFinish()'.

At some point in the future - distributing execution of the kernels over multiple devices can be easier without excessive recall about the implementation as to what execution dependencies exist.

E.g. Current single execution pattern for single device development:
(Uppers are execution independent, lowers are not, vertical spacing denotes 'EnqueueKernel')

Current implementation:

Dev 1 (A)
Dev 1 (B)
Dev 1 (C) (clFinish)
Dev 1 (d) (clFinish)
Dev 1 (e) (clFinish)
Dev 1 (F)
Dev 1 (G)
Dev 1 (H) (clFinish)
Dev 1 (i) (clFinish)
Dev 1 (j) (clFinish)

Notice that a single device is used and that multiple kernels are sequentially processed by that single device. Now note the placement of the clFinish calls. These come before a processing step that requires that all prior steps be complete before continuing.

Lets say at some point in the future your application requires further performance tuning and thus you add two more devices to your system. In this case, the effort can be divided among existing resources by using the clFinish() calls...:

Dev 1 (A), Dev 2 (B), Dev 3 (C) (clFinish)
Dev 1 (d) (clFinish)
Dev 1 (e) (clFinish)
Dev 1 (F), Dev 2 (G), Dev 3 (H) (clFinish)
Dev 1 (i) (clFinish)
Dev 1 (j) (clFinish)

Note that the clFinish calls can be used as markers to dived the execution threads

Wednesday, June 2, 2010

Java NIO ByteOrdering with JNA and native C code on OSX

Say you want to interact with a C program that modifies a NIO Buffer allocated from Java. Java NIO provides a way to help with this but the default ByteOrder on NIO ByteBuffers is BigEndian. By allocating the buffer with the same ordering using ByteOrder.nativeOrder(), a mismatch will not occur between native and JVM code.

Example Java class:
...
public class JNAByteOrderExample {
public static void main(String[] args)
{
FloatBuffer fb = ByteBuffer.allocateDirect(10*4).order(ByteOrder.nativeOrder()).asFloatBuffer();

for(int index =0; index<...)
{
System.out.print( " " + data.get(index) );
}

static
{
Native.register("/libdemo.dylib");
}
public static native int demo(FloatBuffer fb);
}

Example C code:

int demo( float * data )
{
int i = 0;
for( i = 0; i ...)
{
data[i]=777;
}
return 0;
}

Compile the C file with:
gcc -bundle -o libdemo.dylib -dynamic demo.c

Configuring the Eclipse 3.5.2 SDK JVM on OSX...

Many websites suggest editing the eclipse.ini file with a multiline option like:
...
-vmargs
-Xms1g
-Xmx2g
...

This did not work for me. As a work around, I edited the 'Run Configurations' for the program to include these settings. I selected the 'Arguments' tab and added the following in the 'VM Arguments:' field:
-Xms1g -Xmx2g

A temporary fix!