Gpu Architecture And Performance Computer Science Essay

Published: November 9, 2015 Words: 2259

Abstract

GPU architectures are more and more important in the multi-core era. It is designed to deal with the massive data on a parallel architecture with their large number of parallel processors. How to enable the flexibility and make it more programmable are very important. And we also need to improve the high performance computing on GPU due to the increasingly demands. To evaluate the performance of GPUs, we can analyze them in different angles, such as architectures, algorithms, and platforms.

Keywords

GPU CUDA GPGPU OpenCL

Introduction

GPU (Graphics Processing Unit) [5] is specially designed in electronic circuit to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer interned for output to a display. The modern GPU is very efficient at manipulating computer graphics, and when executing an algorithm which processes large blocks of data in parallel, GPU is more effective than the general-purpose CPU, due to its parallel computing architecture. In detail, this is especially because that GPU is specialized designed for compute-intensive, highly parallel computation, which is typically required on the graphic rendering. So the architecture is designed in such a way that more transistors are devoted to data processing than data caching and flow control.

This paper mainly consists from three parts. The first part is an overall introduction about GPU, and the development trend of GPU, which is based on several resources. The second part is from another paper which talked about optimization techniques and performance analysis of two algorithms on novel GPU architecture. The third part is focusing on use a framework to make GPU execution achieve high performance and scalability

1. GPU Development Track-overall view of GPU

The early GPUs are fixed-function graphic pipelines, the graphic hardware was configurable but not programmable by the application developer. But due to the more sophisticated demands, new features need to be built-in fixed functions. During the last 30 years, graphics architecture has evolved from a simple pipeline for drawing wireframe diagrams to a highly parallel design consisting of several deep parallel pipelines capable of rendering complex interactive imagery that appears three-dimensional. Concurrently, many of the calculations involved became far more sophisticated and user programmable.

As early designed GPU is only match the features that required by the graphics API, when programmer want to access the computational resources, they have to cast their problem into native graphics operations. In order to solve this problem, intrepid researchers demonstrated a handful of useful applications with painstaking efforts, called GPGPU (general purpose computing on GPUs).

OpenCL (Open Computing Language) is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. One of the most important benefits of OpenCL is portability. [10] OpenCL can be ported into embedded devices, digital signal processors, and field-programmable gate arrays. OpenCL-coded routines, called kernels, can execute on GPUs and CPUs from such popular manufacturers as Intel, AMD, NVIDIA, and IBM. OpenCL kernels can be run on different types of devices, in addition, a single application can dispatch kernels to multiple devices at once. However, OpenCL's data structures and functions are unique, and it is not easy to learn. OpenCL isn't derived from MPI or PVM or any other distributed computing framework.

CUDA(Parallel Programming and Computing Platform)[11] is a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant. CUDA is a hardware and software coprocessing architecture for parallel computing that enables NVIDIA GPUs to execute programs written with C, C++, Fortran, OpenCL, DirectCompute, and other languages. Because most languages were designed for one sequential thread, CUDA preserves this model and extends it with a minimalist set of abstractions for expressing parallelism.

By design, CUDA enables the development of highly scalable parallel programs that can run across tens of thousands of concurrent threads and hundreds of processor cores. A compiled CUDA program executes on any size GPU, automatically using more parallelism on GPUs with more processor cores and threads.

CUDA four processes:

Copy data from main memory to GPU memory

CPU instructs the process to GPU

GPU execute parallel in each core

Copy the result from GPU memory to main memory

http://upload.wikimedia.org/wikipedia/commons/5/59/CUDA_processing_flow_%28En%29.PNG

Figure 0. CUDA processing flow

2. Two algorithms implemented on the novel architecture

In Paper [7], the researchers evaluate two life science algorithms, which are Needleman-Wunsch (N-W) sequence alignment and Direct Coulomb Summation (DCS), for GPU. for Needleman-Wunsch it is difficult to get good performance numbers, Direct Coulomb Summation is particularly suitable for graphics cards. They use the recent NVIDIA Fermi architecture to evaluate the performance impacts of novel hardware features like the cache subsystems on optimizing transformations. In the results they not only analyze the theoretical potential of the optimizations, but also the effects on execution times.

In the experiment, the new hardware features introduced by NVIDIA Fermi play a quite important role. Until now, on-chip shared memory has been used as explicitly managed cache and was a major source for performance improvements. With Fermi, a cache subsystem for global memory accesses is available as well, which compete with shared memory.

The applied optimizations techniques are evaluated not only in terms of runtime measurements, but also theoretical performance estimations are conducted to validate the results. The experiments measures both the kernel execution time which indicates the time interval from deploying the kernel on the GPU until all threads have finished, and the overall execution time which includes allocation/release of the arrays in the CPU main memory and in the graphics card memory together with the transfer time of the data between both devices.

Needleman-Wunsch execution target on single and very large alignments which must be parallelized efficiently. Direct Coulomb Summation is well-suited as example for GPU since the computational structure with high data parallelism and high arithmetic intensity. In the DCS algorithm, they basically use the publicly code presented in [9].

The definition of the measurements:

Kernel execution time is defined as time on host required for deploying the kernel, executing the function and waiting until all threads are finished.

Overall execution time is defined as the runtime of the kernel is measured, and allocation and release of the arrays in the CPU main memory and in the graphics card memory together with the transfer time of the data between both devices.

The two algorithms implemented on NVIDIA GTX 480, based on CUDA.

1. Use of on-chip memory

In this case, N-W use shared memory, and DCS use constant memory.

Figure 1. Shared memory speedup for NW on GTX 480.

Figure 1 shows the N-W algorithm implementation. Since shared memory is small and fast, it can be seen as explicitly managed cache with all the potential inherent for caches. the shared kernel has a clear speedup between 3.7 and 8.7 w.r.t. the sequential CPU version, and the global kernel achieved a speedup between 1.0 and 1.9 only. However for the overall execution time, both versions got similar performance numbers and a slight slowdown.

Figure 2. Constant memory speedup for DCS on GTX 480.

Figure 2 shows the speedup which has been obtained by using constant memory and single precision (SP) arithmetic. In this case, it can be observed that the kernel execution on constant memory achieved speedup up to 31, and the overall execution on constant memory speedup is up to 28. In addition, the global memory cases also got good speedup, the kernel speed up profits from the cache subsystem, and the overall version obtains the speedup by small data transfer.

2. Use of shorter data types

Figure 3. Short data type overall speedup for NW on GTX 480.

Figure 3 shows the speedup got by using short instead of int w.r.t. the int sequential CPU version which is around 1.8. For N-W, it is possible to use short instead of int as long as the sequence length does not exceed 8000 to guarantees the computed values does not exceed the range of maxshort.

Figure 4. Overall costs of DP arithmetic for DCS.

Figure 4 shows the cost DP arithmetic compare to the SP arithmetic. For DCS we change float to double, to guarantee the double precision (DP) results, mixed precision algorithms have been developed. When do the bulk of computation , use SP, and DP is selectively used to refine result. In the figure above, SP GPU execution time speedups are close together and much higher than the DP execution time speedup. This is because SP floating point operations are at least two times faster than double precision (for CPUs), and for GPUs it is usually more.

The better performance achieved results in use of shorter data types . Shorter data types has two effects, one is that, it only need half of the amount of data to be transferred between CPU and GPU, by replace double and integer for float and short. Since the bandwidth between CPU and GPU is a main bottleneck for many of the performance. And the other effect is SP floating point operations are at least two times faster than double precision (for CPUs), So for GPUs it is usually more; for GTX 285 single precision is 12 times faster, for GTX 480 it is 8 times faster, and for Tesla C2050 it is 2 times faster like for CPUs. Of course, it must be semantically valid to use shorter data types.

3. Make GPU execution more efficiency and scalability

In Paper [8], the researchers propose a framework for efficient and scalable execution of domain-specific parallel templates on GPUs. There are many challenges will impact the performance of a GPU program such as how the computation is organized into threads and groups of threads, register and on-chip shared memory usage, off-chip memory access characteristics, synchronization among threads on the GPU and with the host, and data transfer between the GPU memory and host memory. Since these challenges are all low level computing, algorithm designer who are used to programming in very high-level languages still remain inaccessible to GPU.

The domain-specific templates include computations that can be represented as a graph of parallel operators. The domain-specific templates can be worked as bridges that connect domain experts (algorithm designers) and current GPU programming frameworks (such as CUDA). Furthermore, the information embodied in these templates can be exploited to achieve GPU execution with high efficiency (performance) and scalability.

To implement the efficiency and scalability, there are two challenges:

Challenge 1: Scaling to data sizes larger than the GPU memory

As the data size of target task is larger than the GPU memory, the algorithm need to be split to process the oversized input. And the application programmer need to separately consider all the cases described above, determine the optimal operation for each cases.

Challenge 2: Minimizing data transfers for efficient GPU execution

An observation that falls out of the above example is that, as data sizes increase, a given computation needs to be divided up into smaller and smaller units in order to fit into the GPU memory, leading to increased data transfers between the CPU and GPU. This leads to the next challenge, namely minimizing the overheads of data transfer between the host and GPU.

The limited CPU to GPU communication bandwidth is the major limiting factor for the performance increasing. This limitation is especially significant for applications that do not fit in the GPU memory and hence need to frequently transfer data structures between the host and GPU memory.

Solutions:

In order to deal with the challenges mentioned above, researchers proposed GPU execution framework which generates an optimized execution plan for the template that specifies the exact sequence of offload operations and data transfers that are required in order to execute the template.

1) Operator Splitting: achieving scalability with data size.

The operator splitting algorithm can be summarized as: Compute the memory requirements of all operators (sum of sizes of data structures associated with each operator). Note that any operator whose memory requirements are larger than the available GPU memory cannot be executed without any modifications. Split the operators whose memory requirements are greater than the GPU memory. This step ensures feasibility for all operators. When an operator is split, other operators that produce/receive data from the split operator also need to be modified. Perform above steps until it is feasible to execute all operators on the GPU.

2) Operator and data transfer scheduling:

Once the operators are split, every individual operation can be run on the GPU, they will have to be scheduled for transferring. The data transfer optimization problem can be thought of as composed of two sub-problems - find a good operator schedule, and then, find the optimal data transfer schedule given this operator schedule.

The data transfer scheduling algorithm can be summarized as: Calculate the "latest time of use" for each data structure (since the operator schedule is known, this can be computed statically). When a data structure needs to be brought into the GPU memory (i.e., it is the input or output of the operator being executed at the current time step), and there is insufficient space, move the data structures that have the furthest "latest time of use" to the CPU until the new data structure can be accommodated. Remove data eagerly from GPU memory i.e., delete them immediately after they become unnecessary.

Result:

As the above two algorithm applied, it is clear that this framework reduced the amount of communication between the host and GPU. For scalability, the result shows that this method make the speedup achieve to 1.7X to 1.8X compare to baseline GPU implementation.