Port and optimize for GPU architectures based on nVidia/AMD chips with either CUDA or OpenCL. Often the following steps are required to
extract reasonable performance:
- Identify and make initial port of acceleration candidates to GPU.
- Review hierarchical data flow (host-device-within device).
- Tailor data structures for GPUs and eliminate redundant data moves.
- Reduce number of registers/operations and mitigate warp divergence.
- Auto-tune optimized kernels for the best performance on target architectures.
