FDTD, proposed by Yee [1] in 1966, is a numerical method used to solve partial differential equations that uses time difference as approximations to solve the partial derivatives. FDTD solves Maxwells curl equations, and are used in simulating electromagnetic field computations, however for the numerical method to stabilize it needs a considerable amount of running time.
These complex computations have been tackled by using supercomputers [2] [3], because of their architectural nature, as supercomputers have many cores to run the parallel codes, and can reach high teraflops in performance. However this approach is not encouraged as the use of a supercomputer is expensive and impractical for most scientists. The second approach is cluster computing, where sets of high performance machines are linked via a network and the code is distributed among them. This approach suffers from the network bandwidth overhead as the machines on the network need to have high bandwidth among its nodes or else the performance will degrade. Parallel code on HPCs also include overhead to accommodate the heterogeneous nature of HPCs and thus includes Message Passing Interface MPI code to handle the distributed issues (e.g.: job scheduling; leader selection). Another matter is establishing an HPC network is not always affordable and sometimes costs may be higher than research grants. The third approach is to use GPGPU such as the CUDA enabled GPUs running on Personal Computers PCs. A GPGPU enables the user to program massively parallel code on the machine utilizing its parallel cores that run thousands of threads simultaneously in a Single Instruction Multiple Thread SIMT paradigm. Since FDTD computations have high data parallelism, attempts have been made to parallelize the FDTD code for GPGPUs. This approach is more cost effective as GPGPU PCs are affordable and far cheaper than the first two approaches which costs range from millions to hundreds of thousands. Another benefit is that CUDA enabled GPUs provide APIs for parallel programming, CUDA is an extension of C/C++ language with little learning curve. It also gives performance advantages that reach over one thousands of GFLOPS per second as of 2009 [4].
GPUs were introduced in 1999 [5] by NVIDIA's GeForce 256. A GPU differs from a CPU, while CPUs are suited for sequential code, modern though CPUs use parallel techniques, such as multiple thread issuing and complex parallel modules for instruction level parallelism. Still CPUs are more suitable for sequential code and flow control, thus not a good candidate for problems with a high level of data parallelism. In this case, GPUs are more suitable, they have parallel functional units (e.g.: ALUs, FP units), also GPUs are optimized for parallel code processing such as graphical data computations since this is their main function inside the graphic card. Applications of these graphic cards range from desktop 2D graphics to 3D games graphics and medical screening such as MRI. GPGPU processing is used for protein structure prediction, and wave propagation [6]. Compared to CPUs; GPUs are not easily programmed, in their early development from 1999 till 2004; GPUs were programmed using graphical programming languages such as OpenGL, Cg, through using graphic APIs to compute scientific calculations. Programmers programmed code scientific parallel code had to process the data as graphic units of pixels and shaders. This overhead was eliminated with the advent of the NVIDIA GeForce 8800 which was CUDA enabled. Recent GPGPUs can be programmed using also OpenCL, or ATI's Compute Abstract Layer CAL, however CUDA is still used for graphical applications programming.
Current GPUs have billions transistors; hundreds of Streaming Multi Processor SM, each SM has Streaming Processors SPs. SPs run 32 threads at a time, thus some GPUs can run 12288 threads concurrently [7]; a global memory; on-chip shared memory; texture and constant memory on-chip and have high bandwidth in magnitudes of 100 GB/s.
FDTD Background
FDTD method introduces by Yee [1] provide a time domain, step wise, approximations for the partial differential equations of Maxwell's electromagnetic curl equations. These equations are used to compute the electric field E and magnetic field H values respectively in distinct points in space called the Yee Cells. They compute the temporal changes and spatial changes that take place due to changes in the electric and magnetic fields that co-exist through a period of time using a proper time difference. A Yee cell has the following components for the electric field and magnetic field respectively, Ex; Ey; Ez; and Hx; Hy; Hz. These components are calculated and updated iteratively in successive loops, as follows in Fig. 1 for the electric components and Fig. 2 for the magnetic components in a Yee cell [8]. Where µ is the magnetic permeability and ε is the permittivity. A Yee cell is visualized in Fig. 3.
Figure 1 Electric Field Components in a Yee cell
Figure 2 Magnetic Field Components in a Yee cell
Prevous work
Several attempts have been and are being made to implement FDTD parallel computation on GPUs and optimize the efficiency of the code and reduce the overall running time. The main argument is almost the same, which is avoiding the use the slow global memory and cache the needed data in the on-chip fast shared memory or texture memory. The global memory is a DRAM that has about 400 to 600 clock cycle latency compared to the faster shared memory which takes 4 clock cycles. Normally the GPU executes other instructions during this period to hide the latency, but this is not always guaranteed and thus the performance suffers sharply due to global memory accesses.
A further reading on the development of a low dispersion 3D FDTD can be traced in [9] the paper talks about a certain FDTD algorithm called FV24, and how it reduces the dispersion level, such dispersion is like noise it is unwanted data that goes into the computation, and low dispersion algorithms are more efficient than algorithms with high dispersion levels. In [2] [10] the authors use a CUDA GPU to implement a parallel FDTD algorithm, to calculate the values of the problem size. CUDA memory hierarchy consists of the slow-wide to the fast-small as follows: global memory; constant memory; on-chip texture memory; and the shared memory which is as faster on chip memory, nearly as fast as the threads' local registers [11].
The authors make notes on enhancing the performance by utilizing the shared memory, instead of using the global memory to retrieve the variables, another key issue to take in consideration when designing a parallel algorithm for CUDA is the amount of memory each thread is allowed to use. As these threads are launched as "warps" containing 32 thread simultaneously, so if the threads use more register banks than what is available on the SP. A warp will launch fewer threads than 32; this fact sharply affects the performance negatively. A GPU programmer needs to be careful when parallelizing algorithms as peak performance can be reached only with proper caching of memory elements and efficient use of the warp scheduler. Improvements are gained from exploiting the shared memory and texture memory, also the multiprocessor occupancy is an important measure of hardware resource utilization. The visual profiler, provided by NVIDIA as a performance measuring tool [12], is used to measure the efficiency of the code. The use of the constant memory is also reported to have performance improvements on the execution time; constant memory is an on-chip cache for the global memory, it is a read-only memory though.
The research found in [13] gives details into the implementation of the FDTD parallel algorithm in a CUDA GPGPU. To optimize the code you to focus on: break the loops into the blocks and threads of the CUDA GPU; ensure that memory accesses to the global memory is coalesced meaning consecutive memory blocks; cache memory elements into higher memory banks -shared memory-; use a multiple of 32 threads for each block to provide optimal scheduling and memory coalescing. The analysis finds in its wake that a problem size of a multiple of 16 provides efficient memory coalescing of bytes, while this finding may seem too much of a constraint, however having a size that isn't a multiple of 16 will result in padding the block of 16 in larger blocks. The paper provides a strategy to optimize the FDTD algorithm which is given in a FORTRAN subroutine. This piece of code is used to update the magnetic fields. The global memory access is optimized when coalesced, shared memory is used for data use in a single iteration, also threads on the boundary or on edge of a block need to bring data to their shared memory that is stored in the shared memory of the neighboring block. Also in [14] The FDTD implementation performance is resultant to be dominated by the block size.
Figure 3 A visualization of a Yee cell
The findings in [7] show that an FDTD parallel algorithm is memory intensive and not arithmetically dominated. This fact rules that accessing global memory several times cannot be compensated by arithmetic instructions, so an optimized memory access needs to be employed. A multiprocessor issues at most 8192 registers. Also shared memory can be as fast as the registers, as long as unaligned access to it is avoided which a bank conflict is. The size of the shared memory used in their implementation is 16KB. We also find that the block size play a major role in determining the data that will be brought from the global memory to the shared memory at a time, as boundary cells or threads need extra data available in the neighboring block data to be brought from the global memory to their shared memory to update their values.
In [10] the findings also agree upon that the block size used is also 16, as this reduces the divergence in the warp scheduling. The authors use an FDTD computation in CUDA to perform radio coverage simulation. A comparison was done against a sequential implementation of the FDTD computation coded in Matlab on an AMD Athlon 64 X2 Dual Core Processor 4600+ at 2.41Ghz, and an amount of RAM of 3.25GB, the computations took 78 minutes to run, against a parallel code in CUDA GPGPU of NVIDIA GeForce 8600M GT which gives a speed up of 100 times. This clearly shows the potential of the massively parallel code in a GPGPU and its effect in revolutionizing the scientific research programs. On the subject of the CUDA manual caching of memory, we see in [3] that a simple version of the FDTD kernel, in pseudo code, is presented, the paper discusses the use of IBM's cyclone-64 architecture and how to use it in the FDTD code. The authors then present the notion of tiling in memory caching pattern and it direct effect on the speedup of computations since the tiles determine what data is needed in a single iteration. CUDA however has no such capabilities. Data is transferred in bytes and no other transfer patterns are available.
In [7] [15] the papers discuss the use of FDTD computation to simulate acoustic waves in non-homogeneous media using GPU. The FDTD being a numerical method to solve partial differential equation replaces partial derivatives with central differences and discretizes the non-finite domain into a set of finite-difference equation. The numerical stability and non-dispersion criterion governs the FDTD computation to be long and iterative. The implementations given in the paper uses two approaches one where the global memory data is cached to the texture memory -which is a read only memory- in which a call to the kernel binds the set of data to the texture memory; the texture memory has almost half that of latency of the global memory (about 200-300). The other approach was using the shared memory to cache the global memory data. The shared memory while smaller than the texture memory provides faster access; access speed of the register (thread level storage banks). The results show that -not surprisingly- that the shared memory implementation has a run time of less than the half of that of the texture memory run, while keeping in mind that both implementations are faster than the global memory utilization. The results also show a true harsh fact about CUDA, it enables scientists and programmers to harness massively parallel algorithm, on the other hand however CUDA programming while it may not have a large learning curve, it needs the programmer to have knowledge of the underlying hardware such as the maximum number of blocks and threads one can issue, memory systems sizes, commands available. Also the programmer needs to cache memory accesses by using the faster shared memory, or otherwise the performance may not be as expected (sometimes dropping speedup in magnitudes of ten). Using the memory hierarchy in CUDA one must make sure the data is coalesced (in the same row in memory), and the threads access the shared data synchronously, as data corruption is possible. CUDA C is not portable unfortunately, only on NVIDIA CUDA enabled GPUs. On the other hand OpenCL provides grounds for a cross platform paradigm it can run on NVIDAI's GPUs and ATI's GPUs. The paper proves that use of shared memory and registers instead of the global memory improves performance, and increases the bandwidth; the literature also gives a scheme based on a coloring graph for memory reuse is proposed.
Another criticism of CUDA is in [8] where the FDTD computation done on CUDA isn't optimized and gives a unsatisfactory results in which an NVIDIA GeForce 8800 versus a CPU implementation on a 32 bit Intel Core 2 quad Q9300 at 2.5GHz. The GPU was 33% slower than the CPU run, which really shows that a GPU implementation does not have to be always the better choice, and one must keep in mind the parallel potential of the algorithm, the memory hierarchy, and scheduling efficiency to get a good amount of speedup using a GPU. The equations to implement the FDTD are stated mathematically, the authors claim that multi-threaded C++ code is better than using CUDA C. While this case is not to be generalized as several GPU implementations have been done -without proper optimization- and result in significant speedups (e.g.: matrix multiplication), however these algorithms have huge data parallelization in them which gives a clue of a successful GPU programming.
Optimizing FDTD using a GPU
The FDTD code we shall enhance was used as a part of the research in [BARRAK]. The FDTD code is given in two implementations, one which is the sequential C++ code, the other being the parallel in CUDA C code. A review has been done on the parallel version, and resulted in the fact that the code was not caching any type of data, and instead only the global memory was used. The investigation done to the code also estimated that such code could have runtimes that are far less what is expected from such a tool as CUDA. The attempts to improve the runtime and efficiency of the code will work on improving the data caching by utilizing the CUDA threads' local memory, the blocks' shared memory, and adjusting the blocks and threads sizes and numbers. The code has a host and kernel parts, in the initial parallel CUDA C implementation the variables are constant and block level variables. The host code make several kernel calls to make the computation for the FDTD components, and with each host call for the kernel the global memory is copied from and then to the host from the kernel which is on the device.
The code we shall optimize is found in [BARRAK] will be working on the parallel version in CUDA C. the parallel code is an FDTD computation which uses the Convolutionaly Perfectly Matched Layer (CPML) [10]. In this FDTD computation we separate kernel threads in which they fall either in the Absorbing Boundary Condition ABC or the Computational Domain, this is done using multiple and different calls to the kernel, the idea behind this is to eliminate the divergence that will take place in the threads code if we did not separate the kernels calls, by evaluating the threads' index using an if statement which will branch to two different paths, this affects the scheduling of the warps when multiple threads follow different execution paths which lessens the number of scheduled threads. Another aspect to use in order to optimize the code is the block size and the number of threads. The original parallel code used 2916 blocks, and each had 128 threads. The original runtime of the parallel code is XX4XX seconds.
In our attempts to come up with an optimized version of the FDTD on a CUDA GPU, we shall use the following:
XXX
Results and findings
XXX
Concluding remarks
VI. References
[1] K. S. Yee, "Numerical Solution of Initial Boundary Value Problems Involving Maxwell's Equations in Isotropic Media," IEEE Transactions on Antennas and Propagation, vol. 14, pp. 302-307. IEEE May 1966.
[2] D. Donno, A. Esposito, L. Tarricone, and L. Catarinucci, "Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD," University of Salento, IEEE Antennas and Propagation Magazine, Vol. 52, No. 3, June 2010.
[3] D. Orozco and G. Gao, "Mapping the FDTD Application to Many-Core Chip Architecture," CAPSL at the University of Delaware, CAPSL Technical Memo 087, 2009.
[4] D. B. Kirk and W. W. Hwu, "Programming Massively Parallel Processors: A Hands-on Approach," NVIDIA, Morgan Kaufmann; 2010. pp. 1-120.
[5] J. Nickolls and W. J. Dally, "The GPU Computing Era," NVIDIA, IEEE Computer Society, 2010.
[6] D. Brandao, M. Zamith, E. Clua, A. Montenegro, A. Bulcao, D. Madeira, M. Kischinhevsky and R. C.P. Leal-Toledo, "Performance Evaluation of Optimized Implementations of Finite Difference Method for Wave Propagation Problems on GPU Architecture," pp.7-12, 22nd International Symposium on Computer Architecture and High Performance Computing Workshops 2010, IEEE computer society 2010.
[7] P. Sypek, A. Dziekonski and M. Mrozowski, "How To Render FDTD Computations More Effective Using a Graphics Accelerator," IEEE Transactions on Magnetics, vol. 45, No. 3, pp.1324-1327, March 2009.
[8] T. M. Millington and N.J. Cassidy, "Optimizing GPR Modeling: A Practical, Multi Threaded Approach to 3D FDTD Numerical Modeling," Proceedings of the 12th International Conference on Ground Penetrating Radar, pp.1-7, 2008.
[9] M. F. Hadi, "A Finite Volumes-Based 3-D Low Dispersion FDTD Algorithm," IEEE Transactions on Antennas and Propagation, Vol. 55, No. 8, pp. 2287-2293, August 2007.
[10] A. Valcarce and J. Zhang, "Implementing a 2D FDTD scheme with CPML on a GPU using CUDA," The 26th International Review of Progress in Applied Computational Electromagnetic, Tampere, Finland, April 2010.
[11] D. B. Kirk and W. W. Hwu, "Programming Massively Parallel Processors a Hands-on Approach," NVIDIA, Morgan Kaufmann Publishers, pp. 1-120. 2010.
[12] NVIDIA CUDA ZONE,
http://www.nvidia.com/object/cuda_home.html.
[13] V. Demir and A. Z. Elsherbeni, "Compute Unified Device Architecture (CUDA) Based Finite Difference Time-Domain (FDTD) Implementation," ACES Journal, vol. 25, No. 4, April, 2010.
???????????????????????????????????????????????????
[14] D. Liuge, L. Kang and K. Fanmin, "Parallel 3D Finite Difference Time Domain Simulation on Graphics Processors with CUDA," Computational Intelligence and Software Engineering. CiSE, IEEE December, 2009.
???????????????????
[15] M. Moazeni, A. Bui and M. Sarrafzadeh, "A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPU," IEEE 7th Symposium of Application Specific Processors, pp. 43-49, July, 2009.
[16] NVIDIA, "Getting_Started_Windows.pdf",
http://www.nvidia.com/object/cuda_devlop.html.
NVIDIA Corporation. All rights reserved. 2010.
[17] NVIDIA, "CUDA_C_Programming_Guide.pdf",
http://www.nvidia.com/object/cuda_develop.html
NVIDIA Corporation. All rights reserved. 2010.
[18] NVIDIA, "Compute_Visual_Profiler_User_Guide.pdf",
http://www.nvidia.com/object/cuda_devlop.html
NVIDIA Corporation. All rights reserved. 2010.