Code Acceleration on GPU architectures |

Port and optimize for GPU architectures based on nVidia/AMD chips with either CUDA or OpenCL. Often the following steps are required to
extract reasonable performance:

Identify and make initial port of acceleration candidates to GPU.
Review hierarchical data flow (host-device-within device).
Tailor data structures for GPUs and eliminate redundant data moves.
Reduce number of registers/operations and mitigate warp divergence.
Auto-tune optimized kernels for the best performance on target architectures.