Ntroduction To Parallel Computing In Matlab Information Technology Essay

Published: November 30, 2015 Words: 6772

Software frameworks such as COMSOL Multiphysics, MATLAB, Maple and Mathematica enable scientists and engineers to quickly build models, perform simulations and solve common problems through the use of high-level technical computing languages. These high-level languages provide abstractions to simple and common computational tasks such as matrix multiplication, complex number support, Fourier transforms and solving of differential equations. The obvious benefit of using these frameworks is that it reduces the time spent on developing solutions for these problems compared to that of solutions developed using conventional low-level programming languages such as C/C++ and FORTRAN.

Traditionally, applications developed using these frameworks have been designed for serial execution due to predominant use of numerical computing libraries. An application solving a particular problem is executed as a serial steam of instructions on a single processor computer and each instruction is executed one at a time. In the modern world, these problems have become more complex and have demanded more computational power and memory resulting in longer application runtimes. In the past, application runtime performance can be improved by increasing the frequency or clock speed of the processor used to execute the application [J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. 3rd edition, Morgan Kaufmann, p. 43. 2002].

However, a processor's clock speed cannot be increased indefinitely due to physical and thermal constraints. Instead of scaling the frequency of its processors, manufacturers are packing more computational cores onto a single chip [G. Koch. Discovering multi-core: extending the benefits of Moore's law. Technology@Intel Magazine. Intel. 2005]. Unfortunately, adding more computational cores onto a single chip is not synonymous with increasing the clock frequency of the chip to deliver more computational power. There is a caveat to fully utilize and take advantage of these additional cores: it has to be supported by the software. This gives rise to the era of parallel computing.

Parallel computing is a computing paradigm where many computations are performed concurrently, or "in parallel". This is achieved by breaking down a large and complex problem into smaller units where each unit is executed concurrently on a parallel computer. A parallel computer may consists of single machines with multi-core processors or multi-processor or multiple machines connected together through an interconnect such as Ethernet in a compute cluster or grid. Parallel computing has been utilized since the 1960s, mainly in high performance computing with supercomputers, but due to the proliferation of multi-core processors in the consumer market in the recent years, interest in it has grown tremendously.

Parallel computers can be further categorized based on its hardware support for parallelism and processor configuration:

Multi-core Processor

A multi-core processor contains multiple computational cores on a single physical chip. Each computational core is able to execute multiple instructions per clock cycle. Examples: AMD Athlon 64 X2 Family, Intel Core 2 Duo Family, Intel Core 2 Quad Family

Symmetric Multi-processor (SMP)

A SMP computer contains multiple identical processors interconnected via a bus and shares main memory.

Examples: Intel Itanium 2, Sun Microsystems UltraSPARC, SGI MIPS64

Massive Parallel Processor (MPP)

A MPP is a computer with multiple processors interconnected via a specialized network. MPPs typically have more than 100 processors.

Examples: IBM Blue Gene/L

Cluster computing

A cluster consists of multiple standalone computers interconnected via a commodity network, such as Ethernet.

Examples: Beowulf cluster built from commodity off-the-shelf hardware connected via TCP/IP Local Area Network (LAN)

Grid computing

A grid consists of multiple standalone computers connected via the Internet, each working on a small unit of the problem. Computers in a grid often only perform computation during the times where the computer is idle.

Examples: SETI@home, Folding@Home, Berkeley Open Infrastructure for Network Computing (BOINC)

Although parallel computing may improve an application's runtime performance, developing parallel applications can be challenging, as parallel computing introduces several issues -communication and synchronization. These issues have to be addressed carefully by the programmer, to avoid obtaining suboptimal runtime performance on parallel applications.

There are several parallel programming models which can be used to program parallel computers. A parallel programming model is an abstraction of the hardware and memory architectures of the parallel computer and each model assumes a different memory architecture [H. Shan and J. Pal Singh. Comparison of Three Programming Models for Adaptive Applications on the Origin 2000. Journal of Parallel and Distributed Computing, 62:241-266, 2002]:

Shared Memory

In the shared memory model, parallel applications share a common main memory space, which they can read and write asynchronously during execution. Applications can access data stored by another application through the use of global memory addresses.

Examples: OpenMP

Distributed Memory

In the distributed memory model, each parallel application uses its local memory during execution. Applications access data stored by another application through message passing.

Examples: Message Passing Interface (MPI)

(Talk about MPI being the standard?)

1.2 Motivation

MATLAB is one of the most widely used development environment and high-level technical computing language used in various fields such as signal and image processing, control systems, financial modelling and analysis, and communications. MATLAB has an interactive environment with excellent graphical tools and built-in mathematical functions making it a suitable tool for performing computational intensive tasks such as numeric computation, data analysis and data visualization. Since its introduction in 1970, MATLAB has gained popularity [Richard Goering, "Matlab edges closer to electronic design automation world," EE Times, 10/04/2004] and has been used extensively due to its design emphasis on programmability rather than performance. The key advantage of MATLAB is that it offers many specialized algorithms through add-ons, called toolboxes. These toolboxes and the use of high-level language enable MATLAB users to build various models, test designs and prototypes more quickly than low-level programming languages such as C/C++ and FORTRAN. MATLAB also provides interfaces and functions for integration with external languages such as C/C++ and Java.

Although flexible, the MATLAB language is an interpreted language. This means that MATLAB applications are compiled at runtime and then run - a technology known as just-in-time (JIT) compilation. This design ensures programmability, user friendliness and multi-platform support for MATLAB applications, but at a cost of runtime performance. Despite having slower runtime performance than conventional low-level programming languages, the high-level abstractions provided by the MATLAB language and toolboxes provide a framework for solving computational intensive engineering problems, thereby reducing the development time required to produce the solution to these problems. Furthermore, complex models can be built using smaller MATLAB functions or application as building blocks. Because the overall time required to produce a solution is the sum of the development time and the runtime of the application, MATLAB application generally have lower overall runtime compared to that of applications developed using low-level programming languages.

However, MATLAB applications have always been serially executed. MATLAB does not utilize any parallel computation resources besides implicit multi-threading for a limited number of functions [1] such as multi-processor, multi-core processors or compute cluster resources. Many interested parties have attempted to create a parallel version of MATLAB by modifying and extending the existing version of MATLAB to enable parallel computing. An extensive survey by Choy and Edelman [R. Choy and A. Edelman. Parallel MATLAB: Doing it Right. Proceedings of the IEEE. 93(2), 331-341, 2005.] has identified as many as 27 parallel MATLAB projects originating from both research and commercial organizations. These projects employ various approaches to implementing a parallel version of MATLAB: embarrassingly parallel, message passing, backend support and MATLAB compilers.

As a result of ever growing demand and interest for parallel computing with MATLAB, The MathWorks finally released commercial versions of parallel MATLAB, the Parallel Computing Toolbox and the MATLAB Distributed Computing Server. The Parallel Computing Toolbox and the MATLAB Distributed Computing Server extends the MATLAB language enabling the use of parallel constructs such as parfor (parallel for loop), spmd (single program multiple data), distributed/codistributed arrays and message passing functions. These parallel constructs allow MATLAB users to re-use their existing MATLAB applications with minimal modifications in the parallel environment without having to program in low-level languages such as in MPI or worry about system specific architecture, threading and synchronization. This design feature ensures that MATLAB users are always able to create MATLAB applications that are correct, scalable, flexible, easy to debug and maintain, and cross-platform.

MATLAB performs parallel computation by distributing work, either many loop iterations or a large data set to a resource set, called a matlabpool. A matlabpool consists of workers or labs, which are individual processes of the MATLAB engine running either locally on a multi-processor or multi-core processor machine or on a remote compute cluster. Within a matlabpool, a user can perform parallel computation to some or all workers participating in that matlabpool, but as each matlabpool is isolated, a user cannot use more workers than initially allocated when creating the matlabpool. Furthermore, in a remote compute cluster, a matlabpool may consist of workers from the same node or from multiple nodes. This means that a matlabpool exhibit a flat structure - worker locality or remoteness is indifferent in a matlabpool.

When a matlabpool is allocated, each worker is bound into an MPI ring by MATLAB. When using parallel constructs in MATLAB, MATLAB automates the process of setting up the execution environment, code and data transfer and gathering of results from the workers in the matlabpool through low-level MPI function calls. Although there are advantages in providing these high-level abstractions to the lower level MPI function calls, this approach, however, has a number of drawbacks:

A matlabpool may consist of workers from the same node (in a local machine) or from multiple nodes (in a remote compute cluster). Nodes in a compute cluster may have multi-processors or multi-core processors. In the scenario where workers in a matlabpool span multiple nodes, workers allocated from the same node would share the available memory of the node. This could have adverse effect on the performance of memory intensive computations, such as the use of Fast Fourier Transforms (FFT) kernels. In the current implementation of parallel MATLAB with the Parallel Computing Toolbox and the MATLAB Distributed Server, it is not possible to dynamically assign workers at runtime due to a global policy which view each worker as the same level.

When allocating a matlabpool, MATLAB binds each worker process in the matlabpool into an MPI ring. This approach prohibits a hierarchical structure of parallelism where workers in a matlabpool can be further organized into groups - based on their locality or any other properties. The notion of groups is a hierarchical structure of parallelism where parallelism may exist within and beyond the group level in a resource set, such as a matlabpool. The flexibility provided by group computation or a hierarchical structure of parallelism may be exploited in certain classes of parallel computing problems.

These two drawbacks contribute to the limitations of the current implementation of parallel MATLAB using Parallel Computing Toolbox and MATLAB Distributed Computing Server: difficulty in supporting nested parallelism (parfor within a parfor) and group computing using a hierarchical structure of parallelism.

1.3 Problem Statement

While previous works revolve around the design, implementation and performance of parallel MATLAB, this thesis focuses on the design and implementation of group computing in MATLAB. The novelty of our implementation lies in our approach, design and implementation of group computing in MATLAB using existing parallel constructs provided by the Parallel Computing Toolbox and MATLAB Distributed Computing Server. To the best of our knowledge, there have been no previous attempts to implement group computing in MATLAB.

(Demonstrate the concept of group computing using an example)

This thesis describes how the current implementation of parallel MATLAB using the Parallel Computing Toolbox and MATLAB Distributed Computing Server is augmented to create a framework for group computing in MATLAB.

Finally, this thesis investigates and compares the performance of group computing and parallel computing in MATLAB.

1.4 Thesis Outline

The remainder of this thesis is organized into a further five chapters.

The second chapter provides the background

The third chapter describes the design and approach

The forth chapter describes the implementation

The fifth chapter presents benchmark results and performance data

The sixth chapter discusses the results presented in the previous chapter, future works and conclusion

Chapter 2

Background

2.1 Introduction to Parallel Computing in MATLAB

Although the user community was rather successful in developing and building libraries and utilities for parallel computing with MATLAB, the different approaches and implementations of a parallel version of MATLAB present a major drawback - existing MATLAB applications cannot be reused and parallel applications developed for use with a particular library or utility is not compatible with one another due to limitations or reduced language support in a particular library or utility.

For example, parallel MATLAB applications developed for use with pMATLAB [N. Travinin and J. Kepner. pMatlab parallel MATLAB library. International Journal of High Performance Computing Applications. 21(3), 336-359. 2007.] are not compatible with MultiMATLAB [A. Trefethen, V. Menon, C. Chang, G. Czajkowski, C. Myers and L. Trefethen. Multimatlab: MATLAB on Multiple Processors. Technical Report, UMI Order Number: TR96-1586. Cornell University. 1996.] because they do not use the same parallel constructs and parallel data structures.

The only way to create a parallel MATLAB which would retain the strengths of MATLAB in terms of multi-platform support, interactivity and programmability is to extend the MATLAB language and software. This implementation would deliver a parallel MATLAB that would be as user friendly and as programmable as the MATLAB language, and with parallel constructs and functions that would abstract away the work of setting up the environment, underlying infrastructure and resource allocation.

MATLAB offers two forms of parallelism: implicit and explicit. Implicit parallelism involves multi-threading computations on a single multi-core or multi-processor machine. Usually, optimal performance is achieved when the number of threads is equal to the number of computational cores. The optimal number of threads for a parallel computation can also be determined experimentally. Explicit parallelism in MATLAB involves the use of parallel constructs and functions (examples of which will be discussed in detailed further in this chapter) provided by the Parallel Computing Toolbox and MATLAB Distributed Computing Server.

The advantage of explicit parallelism is that it gives the programmer absolute control over the parallel execution and allows very efficient parallel applications to be developed. However, while developing parallel applications using explicit parallelism, the programmer also has to deal with synchronization, data management and task division. This could introduce runtime overheads and would require more time and effort from the programmer to develop the parallel code.

This thesis would not be exploring the implicit parallelism paradigm in MATLAB but would concentrate on explicit parallelism in MATLAB instead.

2.2 Parallel Computing Toolbox and MATLAB Distributed Computing Server

In November 2004, The MathWorks released the Parallel Computing Toolbox and MATLAB Distributed Computing Server. These two components are extensions of the original MATLAB language and software which enable parallel computing with MATLAB on multi-core and multi-processor machines, or on a remote compute cluster. The Parallel Computing Toolbox provides parallel constructs and functions such as parallel for loops (parfor), single program multiple data (spmd), distributed arrays and message passing functions while the MATLAB Distributed Computing Server contains the MATLAB computational engine executing the parallel application expressed using the MATLAB language, parallel constructs and message passing functions.

The MATLAB computational engine processes running either locally on a multi-core or multi-processor machine or remotely on a compute cluster are called workers. Workers are independent processes that do not require inter-process communication and thus do not support message passing functions and distributed arrays. On the contrary, workers that require the use of message passing functions and distributed arrays are called labs. In this thesis, we would use the term workers or labs as appropriate to denote whether inter-process communication is required.

Figure 1: Parallel MATLAB architecture [http://www.mathworks.co.uk/products/parallel-computing/]

The Parallel Computing Toolbox enables parallel applications to be executed on up to eight workers or labs on a single multi-core or multi-processor machine. When the Parallel Computing Toolbox is used together with the MATLAB Distributed Computing Server on a remote compute cluster, the same parallel application can be executed on an arbitrary number of workers or labs.

The Parallel Computing Toolbox and the MATLAB Distributed Computing Server are separable into two components: language and infrastructure. The language component provides the high-level abstractions and interface to the parallel constructs and functions. On the other hand, the infrastructure component backs the language component such that it provides the automated mechanisms to set up the parallel execution environment and transfers application code and data to the workers or labs as necessary. These two components work collaboratively to ensure that the parallel MATLAB environment retain the programmability, scalability and flexibility of MATLAB. Programmability means that the newly developed parallel applications use similar and familiar MATLAB language as existing MATLAB applications and are easy to debug and maintain. Scalability means that the parallel applications would execute correctly on an arbitrary number of workers or labs, either in parallel or serial. These are the key design goals of the Parallel Computing Toolbox and the MATLAB Distributed Computing Server.

The parallel functions and constructs which make parallel computing possible in MATLAB are described in the subsequent sections.

2.3 matlabpool

The prerequisite for performing parallel computations in MATLAB with the Parallel Computing Toolbox and the MATLAB Distributed Computing Server is a parallel computing resource. A matlabpool is an example of such a parallel computing resource. A matlabpool consists of, as its name suggests, a pool of MATLAB computational engines processes bound in a Message Passing Interface (MPI) ring with a control communication connection back to the client MATLAB process. These MATLAB computational engine processes can run either locally on a multi-core or multi-processor machine or remotely on a compute cluster.

The main purpose of having the concept of a parallel computing resource such as a matlabpool is to enable interleaving of serial and parallel code in a MATLAB application. In addition, it allows the same application to run correctly in the absence of any parallel computing resources. Whenever MATLAB encounters a parallel construct such as parallel for loops (parfor) or single program multiple data (spmd) during execution, it automatically and transparently distributes code and data to the workers or labs in the matlabpool for execution. If there is no matlabpool available, MATLAB runs the code within the parallel construct on the client MATLAB process instead.

When a matlabpool is created, MATLAB starts a parallel job through a scheduler interface which would, in turn, start the required number of workers or labs either locally or remotely. These workers or labs would connect back to the client MATLAB process via socket connection to build the control communication which controls the parallel execution. The process of setting up the socket connections between the client MATLAB process and the workers or labs happens through the JAVA layer.

(Diagram of the architecture to follow)

Only one matlabpool may be created (and therefore remain active) in a client MATLAB process at any given time, although in a remote compute cluster environment with many nodes there may be multiple matlabpool created by different MATLAB client processes on the network. Therefore, each matlabpool is isolated and no communication or data transfer may occur between each matlabpool.

In MATLAB, a matlabpool is created, manipulated and destroyed through the MATLAB function matlabpool which has an interface to the underlying JAVA layer to setup the matlabpool infrastructure.

2.4 Parallel for Loops (parfor)

The parallel for loop (parfor) is one of the parallel construct provided by the Parallel Computing Toolbox for parallel computing in MATLAB. Parfor loops are essentially the same as conventional for loops except for the way they execute each iteration of the loop, that is, the body of the for loops. Parfor loops can be used in place of regular for loops to speed up computation in loops having either many small iterations or several long iterations. Instead of executing each iteration of the loop one at a time on the client, parfor distributes part of the iterations to every worker in the matlabpool for execution to achieve better runtime performance. When a parfor loop is executed, some iterations are executed on the client MATLAB process while the remaining iterations are executed in parallel on the each of the workers. If there are as many workers as iterations in the parfor loop, each worker executes exactly one iteration of the loop; if there are more iterations than the number of available workers, then some workers execute more than one iteration of the loop. If there is no matlabpool or available workers, a parfor behaves as a regular for loop. Parfor exhibits task parallelism.

However, when using parfor loops, it is required that each iteration of the loop to be independent of each other. This requirements means that MATLAB must be able to execute each iteration of the loop in any order and yet produce correct results. When a parfor loop is executed, the support infrastructure automatically handles the transmission of application code and data and gathering of computation results from the workers on the fly. This also ensures that a single and coherent workspace exists throughout the execution of the parfor loop.

For example, the following for loop which is used to compute the first ten Fibonacci numbers cannot be replaced with a parfor loop:

fibonacci(1) = 0;

fibonacci(2) = 1;

for i = 3 : 10

fibonacci(i) = fibonacci(i-2) + fibonacci(i-1);

end

This is because in every iteration of the for loop, to calculate the current value of fibonacci, value of fibonacci from the previous two iterations are required. In a parfor loop, the iteration ranges would be distributed across the workers, and it would not be possible to determine the previous two values of fibonacci as the exact execution sequence of the loop is dynamically determined at runtime.

In summary, when a parfor loop is executed by MATLAB (assuming that there is an active matlabpool), MATLAB performs the following tasks:

Analysis of the parfor loop

The body of the loop is checked to ensure that it does not contain constructs or statements that prevent deterministic behaviour such as break and return clauses. In addition, it is a requirement that the ranges of the iteration are confined to monotonically increasing integers.

Transmission of application code and data to workers

Persistent communication channels are established via socket connections from the master (typically the client MATLAB process) to the workers. Packages containing application code and data are serialized and then transmitted over these channels to the workers.

Execution of parfor loop on workers

On the workers, the packages received from the master are deserialized and then executed.

Collection of computation results from workers

Computation results are returned to the master where they are assembled.

In MATLAB, iteration distribution of the parfor loop is performed implicitly through the parallel_function function.

The following example demonstrates the concept and execution of a parfor loop. A parfor loop used to calculate the maximum eigenvectors of a 1000x1000 matrix drawn from a standard normal distribution is shown below:

N = 1000;

parfor i = 1 : 100

result(i) = max(eig(randn(N)));

end

In this example, assuming that the matlabpool consists of four workers, each of the workers would be allocated 100 / 4 = 25 iterations each. For example, worker 1 would compute i = 1:25, worker 2 would compute i = 26:50, worker 3 would compute i = 51:75 and finally, worker 4 would compute i = 76:100.

2.5 Single program multiple data (spmd)

The single program multiple data (spmd) is another parallel construct provided by the Parallel Computing Toolbox. In contrast with the parallel for loop which distributes its iteration ranges to multiple workers, the spmd construct allows a block of code to run concurrently on all labs in the matlabpool. The "single program" notion of spmd means that the same code or application runs concurrently on all labs. On the other hand, the "multiple data" notion means that the "single program" is capable of using different data sets as inputs.

Spmd is the most common style of parallel programming with MATLAB because it permits communication and synchronization between the labs. Spmd is typically used to speed up applications that perform the same computation on multiple data sets or applications that perform computations on a large data set. Because the spmd block will be executed on every lab in the matlabpool, the use of spmd as a parallel construct requires a matlabpool with available labs. If there is no matlabpool or available labs, the spmd block is executed on the client MATLAB process instead.

In general, the spmd construct is more flexible than the parallel for loop in the sense that its usage is not only confined to replacing and parallelizing regular for loops. The spmd construct allows arbitrary code to be executed on each lab. Each lab in the matlabpool is differentiated by its unique identifier, in the case of MATLAB, it is a sequence of positive integers. This unique identifier is also known as labindex in MATLAB and equivalent to the MPI variable MPI_Comm_rank. The total number of labs in the matlabpool is accessible through the numlabs function. In MPI, this is equivalent to MPI_Comm_Size.

The following example demonstrates the use of a spmd block to perform different computations on various labs using the labindex function:

spmd

if labindex == 1

% Compute on lab 1

N = 1000;

A = max(eig(randn(N)));

elseif labindex == 2

% Compute on lab 2

N = 1000;

B = max(svd(randn(N)));

end

end

The entire spmd block would be executed on every lab in the matlabpool. However, by using if…elseif statements together with labindex, the flow and execution of the spmd block can be further controlled such that it is possible to control which statements are executed on a particular lab. In this example, the lab with labindex = 1 would compute the maximum eigenvectors while the lab with labindex = 2 would compute the maximum singular value decomposition of a random 1000x1000 matrix.

Similar to that of parfor constructs, spmd constructs also support workspace transparency. This means that variables defined or assigned inside spmd blocks can be accessed directly from the MATLAB client process. The values of the variables are actually stored on the labs, and the MATLAB client process accesses these values by means of a reference to a Composite object. The Composite is a feature of the spmd language that presents an abstraction for storing and retrieving values of variables on the labs. It is an implementation of the Partitioned Global Address Space model, in which processes share their address space [B. Carlson, T. El-Ghazawi, R. Numrich and K. Yelick. Programming in the Partitioned Global Address Space Model. Tutorial at Supercomputing 2003. November 2003.]. In this model, the shared address space or memory is partitioned such that each process owns a localized portion of the space. Computational operations performed on Composites are performed on the lab which owns the data. [This is also the concept and implementation behind distributed arrays in MATLAB.]

When MATLAB encounters a spmd block, MATLAB performs the following tasks:

Parsing of spmd block

The spmd language component parses and analyses the spmd block and prepares the infrastructure for execution. This is performed by the spmd_feval function which is the basis of spmd block execution.

Selection of resource set for spmd block execution

The execution of a spmd block requires a resource set. A resource set is a gateway to the labs in the matlabpool - it contains the necessary MPI communicator handles and process handles to the labs. The resource set can either comprise of all labs or a subset of these labs. When Composite objects are created, they are assigned to a particular resource set. This enables the Composite to track the labs which owns and store the data it references. Hence, when Composites are used in a spmd block, the spmd block must be executed on the same resource set. If the resource set could not be located, MATLAB would throw an error.

Execution of spmd block on labs

At this stage, the resource set creates a spmd controller to manage the execution of the spmd block on the labs. The spmd controller interfaces with the underlying JAVA layer to transmit the spmd block and data to the labs for execution.

2.6 Message passing functions

MPI is the dominant model for programming parallel computers which adopts the distributed memory model. MPI enables communication and synchronization between processes in a parallel application. Message passing functions provided by the Parallel Computing Toolbox aims to provide similar features and functionality as various MPI implementations. However, message passing functions in MATLAB is not an implementation of the MPI specification in the MATLAB language itself as opposed to older parallel MATLAB projects such as MatlabMPI [J. Kepner. MatlabMPI. Journal of Parallel and Distributed Computing. 64(8), 997-1005. 2004.]. Instead, the message passing functions in MATLAB are high-level abstractions of low-level MPI functions implemented using MPICH2 developed by Argonne National Laboratory.

These high-level abstractions provide the mechanisms to automatically initialize the MPI environment, create the MPI communicators, finalize the MPI environment and allow message passing functions in MATLAB to send and receive arbitrary MATLAB data types. Additional features such as error detection and deadlock detection are also built-in to these abstractions.

Chapter 3

Approach

This chapter describes our approach and explains our design decisions throughout the process of developing a framework for group computing in MATLAB. It also explores various designs to implementing group computing in MATLAB, comparing the advantages and disadvantages of each design and the challenges faced.

3.1 Design goals

Our goal was to implement group computing in MATLAB by augmenting the current capability of MATLAB. We aim to develop a framework that is easy to use, user friendly and as programmable and flexible as the MATLAB language. Our design goals include:

Users should be able to use familiar MATLAB language and constructs in developing and executing group MATLAB applications.

Users need not manually perform low-level tasks such as configuring the MATLAB environment for group computing.

3.2 Parallel for Loops (parfor)

There are many restrictions and limitations in the usage of the parfor construct. Firstly, the parfor is static and deterministic in the sense that it is not possible to control what computation is executed on each of the workers in the matlabpool since parfor dynamically assigns and execute an iteration range on the available workers.

Secondly, the parfor construct employs the concept of a single workspace. This means that the data distribution of variables used in the loop body is pre-determined. Data represented by these variables is transmitted to each of the workers and back to the MATLAB client process in each iteration of the loop and is stored on the MATLAB client process rather than on the workers.

Thirdly, message passing functions are not supported when using the parfor construct. Workers in the matlabpool would have no means of communicating either by point-to-point or broadcast operations. Therefore, it would not be possible to extend the message passing functionality in MATLAB to include communication between groups. In addition, the lack of support for message passing functions also prohibits the usage of high-level abstractions such as distributed arrays that rely on message passing functions.

Based on these reasons, we ruled out the possibility of using the parfor construct to realize our goals of group computing in MATLAB.

3.3 Parallel Jobs

Another way to execute parallel applications in MATLAB is to create a parallel job. A parallel job is a single task that is executed concurrently on several labs either on a local machine or remote compute cluster. These labs execute the task on multiple data sets or on a portion of a large data set and can utilize message passing functions for communication. The execution of parallel jobs requires a scheduler.

A scheduler (or the MathWorks Job Manager) manages the execution of parallel jobs and tasks on the available labs on a local machine or a remote cluster. Each lab executes the task from a running job assigned by the scheduler and returns the result to the scheduler. If there are more labs available, the scheduler assigns and runs the next queued job on these labs. Similar to that of a matlabpool, a scheduler is an interface to a parallel computing resource. In fact, when a matlabpool is allocated, a matlabpooljob is created and submitted to the local or remote scheduler. Therefore, a matlabpool can be viewed as an abstraction that allows the execution of parallel constructs without the need to explicitly define jobs.

Lab

Lab

Parallel Job

Result

Task

Task

MATLAB client

Scheduler or MathWorks Job Manager

Lab

Result

Result

Task

Results

The features of parallel jobs in MATLAB can be leveraged to support group computing. MATLAB supports running multiple parallel jobs concurrently from a single MATLAB client process, as long as there are enough labs available. The number of labs working on a parallel job can be adjusted by setting the MinimumNumberOfWorkers and MaximumNumberOfWorkers properties on the parallel job object as appropriate when defining and creating a parallel job. This allows each parallel job to be executed on a different number of labs.

Lab

Group 1

Scheduler or MathWorks Job ManagerIn fact, a parallel job can be viewed as a group entity and the labs executing on it as its members.

Lab

Lab

Results

Group Computation

MATLAB client

Group 2

Lab

Although this approach satisfies our notion of groups (chapter 1.3), it has a major drawback. With parallel jobs, each job is isolated from each other. As a consequence, the labs are unable to communicate beyond the job or the group barrier. For example, the labs in group 1 cannot communicate with the labs in group 2. Again, this prohibits the development of additional message passing functions in MATLAB to support communication between groups.

(Maybe elaborate on why communications between groups is vital?)

3.4 Single program multiple data (spmd)

The spmd construct features MPI style programming, allowing parallel applications similar to those programmed using MPI to be developed and executed in MATLAB. Therefore, the spmd construct is the most flexible - it gives the user full control over the execution of code on each of the labs and allows communication between labs.

Due to its flexibility and programmability, we chose to use the spmd construct as a starting point to implement group computing in MATLAB. Having decided to use the spmd construct as a basis of our implementation of group computing, we explored two implementations.

In our first implementation, we partitioned the matlabpool into several groups, with each group consisting of several labs. Our goal was to execute multiple spmd blocks concurrently - each executed on a group of labs. In the current implementation, when a spmd block is encountered, MATLAB reserves and allocates the labs from the matlabpool, creates a resource set, executes the spmd block on the resource set and de-allocates the resource set when the spmd block has finished executing. In the case where the resource set comprises of all labs in the matlabpool, it is called the "world" resource set.

In the process of creating the resource set, the spmd block executor creates a MPI communicator which assigns each lab in the resource set a unique identifier (labindex) and orders them in a topology. This communicator is used to co-ordinate the execution of the spmd block on the labs in the resource set and forms the MPI infrastructure supporting the message passing functions.

In order to partition the matlabpool and introduce the notion of groups, we start by building the "world" resource set. Then, we created multiple resource sets by splitting the "world" resource set. This is achieved by modifying the undocumented class RemoteResourceSet in the spmdlang package. In order to execute a spmd block, the resource set builds a spmd controller which interfaces with the underlying JAVA layer to manage the spmd block execution on the labs in the resource set. Again, this is achieved by modifying the undocumented class RemoteSpmdExecutor in the same package.

Once we have created a resource set, we used the undocumented function spmd_feval to execute a spmd block on that resource set. Although we succeeded in partitioning and manipulating the matlabpool to build the "world" resource set and subsequently splitting it to multiple resource sets, we were unable to execute multiple spmd blocks concurrently. Upon thorough study and analysis of the spmdlang package, we determined that the problem lies with the underlying JAVA layer of the spmd controller. It is in this layer where all the actual work of transmitting the code in the spmd block to the labs, execution of the spmd block on the labs, and collection of result from the labs is done. At present, MATLAB only supports one JAVA spmd controller at any time. Once a JAVA spmd controller has been instantiated, any further attempts to create another JAVA spmd controller would result in MATLAB throwing an error. Due to the fact that the underlying JAVA layer is closed source, we are unable to modify the layer to accommodate multiple JAVA spmd controllers.

In our second implementation, instead of using multiple spmd constructs and MPI communicators, we resorted to using a single spmd block to logically organize the labs in the matlabpool to introduce the notion of groups and a hierarchical structure of parallelism. In addition, we extended the message passing functions in MATLAB to support communication within the group and group-to-group communication.

We discuss this implementation in great detail in the next chapter.

Chapter 4

Implementation

This chapter describes our implementation of a framework for group computing in MATLAB.

4.1 Grouppool

In order to introduce the notion of groups and a hierarchical structure of parallelism, we introduce the concept of grouppool. Grouppool can be viewed as a parallel computing resource comprising of labs in the matlabpool that are logically organized into groups. Built on top of matlabpool, grouppool incorporates all its features and has similar syntax. Grouppool defines and creates a group structure and provides the mechanism to assign each lab in the matlabpool to a group. In addition to the number of labs, grouppool allows the user to specify the number of groups required.

There are several ways these labs can be organized into groups - based on a property such as its hostname, sequentially or randomly. The method used to characterize each lab into a group is known as the policy. The table below summarizes the various policies supported by grouppool:

Policy

Description

Localized

Labs are organized into groups such that labs from the same node in a cluster are assigned in the same group. The number of groups will be equal to the number of nodes in the cluster. If this policy is used on a local machine, only one group will be created since all the labs are on the same node. This is the default policy if no policy is specified by the user.

Modulo

Labs are distributed to the groups in a circular fashion where the labs are assigned to groups in succession. The number of groups is specified by the user.

Random

Labs are randomly assigned to a group. The number of groups is specified by the user. Although this policy assigns a lab to a group selected at random, it tries to distribute the labs to the groups as evenly as possible so that each group contains roughly equal number of labs.

An allocated grouppool is a prerequisite to group computing in MATLAB. When a grouppool is allocated, MATLAB performs the following tasks:

Allocation of a matlabpool

The required number of MATLAB computational engine processes is started on the local machine or remote compute cluster.

Creation of the group data structure

Information on every lab in the matlabpool is stored in a data structure. This data structures stores the labindex, hostname, groupindex and grouprank of the labs. Grouppool obtains the hostname of the labs using spmd.

Assignment of labs to groups

Grouppool assigns each lab in the matlabpool to a group based on the policy specified by the user. Each group is also assigned a groupindex and within each group, each lab in the group is assigned a grouprank. This creates the hierarchical structure which enables multi-level parallelism.

Replication of group data structure to every lab

The group data structure is replicated on to the workspace of every lab. This data structure is not stored as a variable on the labs; instead, it is stored as a persistent variable in the function groupdata. This is to ensure that the workspaces of the labs are empty and clean, preventing potential variable naming conflicts.

grouppool

groupindex 1

labindex 1

grouprank 1

matlabpool

labindex 4

grouprank 2

labindex 3

grouprank 1

labindex 2

grouprank 2

groupindex 2

ss