Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes—computing nodes that unpredictably slow down or fail—is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes, thus resulting in a lot of redundant computation. We propose a *rateless fountain coding* strategy that achieves the best of both worlds—we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the *m* rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than *m* row-vector products are collectively finished by the nodes. Evaluation on parallel and distributed computing yields as much as three times speedup over uncoded schemes.

### 1. Introduction

Matrix-vector multiplications form the core of a plethora of scientific computing and machine learning applications that include solving partial differential equations, forward and back propagation in neural networks, computing the PageRank of graphs, etcetera. In the age of Big Data, most of these applications involve multiplying extremely large matrices and vectors and the computations cannot be performed efficiently on a single machine. This has motivated the development of several algorithms that seek to speed up matrix-vector multiplication by distributing the computation across multiple computing nodes. The individual nodes (the *workers*) perform their respective tasks in parallel, while a central node (the *master*) aggregates the output of all these workers to complete the computation.

**The problem of stragglers.** Unfortunately, large-scale distributed computation jobs are often bottlenecked by tasks that are run on unpredictably slow or unresponsive workers *called stragglers.*^{3} Since the job is complete only when all its parallel tasks are executed, this problem is aggravated for jobs with a large number of parallel tasks. Even a tiny probability of a node slowing down and becoming a straggler can cause a big increase in the expected latency of the job. As pointed out in Dean^{3} (Table 1), the latency of executing many parallel tasks could be significantly larger (140 ms) than the median latency of a single task (1 ms). Straggling of nodes is widely observed in cloud infrastructure, and it is the norm rather than an exception.

**Table 1. Comparison of different strategies to multiply a m x n matrix A with vector x using p worker nodes. The latency values are approximate, and a number of computations values are for the case when none of the nodes slows down.**

**1.1. Previous solution approaches**

**Load balancing strategies.** An obvious solution to overcome the bottleneck of waiting for slow nodes is to move tasks from busy or slow nodes to idle or fast nodes. Such work stealing or dynamic load balancing strategies are often implemented in shared and distributed memory settings. This approach involves establishing a protocol for continually monitoring workers and moving tasks from slow to fast workers. It entails considerable centralized control over the computing environment, which may not be feasible in cloud systems where the nodes can unpredictably slow down due to background processes, network outages, and so forth. There may also be concerns regarding data privacy and the communication cost of moving data between nodes in a distributed system spread over a large geographical area. Thus, it is desirable to develop a principled and easy-to-implement straggler mitigation approach that does not involve moving data between workers.

**Task replication.** Existing systems such as MapReduce^{4} and Spark^{20} deal with the problem of stragglers by launching replicas of straggling tasks, which are referred to as *back-up* tasks. This strategy of task replication has been theoretically analyzed in Wang^{18} where schemes for adding redundant copies based on the tail of the runtime distribution at the workers are proposed. In the area of queuing theory, there is a line of recent works analyzing the effect of task replication on queuing delays in multiserver systems.^{8,9,11,17} For distributed matrix-vector multiplication, which is the focus of this work, a simple replication strategy is to divide matrix **A** into *p/r* (where *r* divides the number of workers *p*) sub-matrices and replicate each submatrix at *r* workers. Then, the master waits for the fastest worker from each set of *r* to finish multiplying its submatrix with the vector **x** in order to recover the result **b = Ax.**

**Erasure-coded matrix-vector multiplication.** From a coding-theoretic perspective, task replication is a special case of more general *erasure codes* that overcome loss or erasure of data and recover the message from a subset of the transmitted bits. Erasure codes were first employed to overcome stragglers in the context of fast content download from distributed storage.^{10} A file that is divided into *k* chunks and encoded using a (*p, k*) Maximum Distance Separable (MDS) code (e.g., a Reed-Solomon code) can be recovered by downloading any *k* out of *p*-encoded chunks.

Unlike distributed storage, erasure coding of computing jobs is not straightforward. A job with *n* parallel tasks needs to be designed such that the execution of any *k* out of *n* tasks is sufficient to complete the job. However, this is possible for linear computations such as matrix-vector multiplication. Recent works^{6,12,13} have employed MDS codes to speed up the computation of matrix vector products in a distributed setting. For example, suppose that we want to multiply a matrix **A** with vector **x** using 3 worker nodes and a (3, 2) MDS code. Then, we split **A** along rows into two matrices **A**_{1} and **A**_{2} such that . The worker nodes store matrices **A**_{1}, **A**_{2}, and **A**_{1} + **A**_{2}, respectively, and each node multiplies its matrix with **x.** Results from any two workers are sufficient to obtain **Ax,** and thus, the system is tolerant to 1 straggler.

**1.3. Our rateless coding approach**

The replication or MDS coding strategies used for matrix-vector multiplication are fixed-rate strategies; that is, they fix a redundancy rate *k/p* when encoding matrix **A** and use the results of the fastest *k* out of *p* worker nodes. The key drawbacks of this approach are that: (1) it cannot perform load balancing within the fastest *k* nodes and account for variabilities in their speeds, and (2) it discards all partial works done by the *p* – *k* straggling workers. We address both these issues by proposing the use of *rateless fountain codes*, specifically Luby Transform (LT) codes^{14} that are known to be scalable and form the basis for practical erasure coding schemes in wireless communication standards.^{16}

The rateless-coded matrix-vector multiplication algorithm generates *m _{e}* = α

*m*(α > 1) coded linear combinations of the

*m*rows of matrix

**A**and distributes them equally across

*p*worker nodes. Each of these linear combinations is generated by choosing

*d*of the

*m*rows uniformly at random and adding them. For example, if

*d*= 2, and we choose rows

**a**

_{1}and

**a**

_{3}of

**A,**then the encoded row is

**a**

_{1}+

**a**

_{3}, as shown in Figure 3a. The value

*d*, referred to as the degree of the linear combination, is an i.i.d. realization of a carefully chosen degree distribution ρ(

*d*). For LT codes, ρ(

*d*) is the Robust Soliton distribution. Each worker receives

*m*encoded rows of matrix

_{e}/p**A**and a copy of the vector

**x.**It computes row-vector products, for example, (

**a**

_{1}+

**a**

_{3})

^{T}

**x**=

*b*

_{1}+

*b*

_{3}for the encoded row (

**a**

_{1}+

**a**

_{3}), and sends them back to the master node. Due to the carefully chosen degree distribution ρ(

*d*), the master can use an iterative peeling decoder

^{14}(illustrated in Figure 3b) to recover the product vector

**b**=

**Ax**with a low decoding complexity of

*O*(log

*m*). Overall, it needs to wait for any

*M'*=

*m*(1 + ε) row-vector products to be completed across

*all*the nodes, where ε is a small overhead; ε → 0 as

*m*→ ∞.

Rateless codes offer the following key benefits over previously proposed coding techniques based on MDS codes.

**Near-ideal load balancing.** In order to adjust to varying speeds of worker nodes and minimize the overall time to complete the multiplication **Ax,** one can use an *ideal load-balancing scheme that dynamically assigns one row-vector product computation task to each worker node as soon as the node finishes its current task.* Thus, faster nodes complete more tasks than slower nodes, and the final product **b** = **Ax** is obtained when the *p* nodes collectively finish *m* row-vector products. Our rateless coding strategy achieves nearly the same load balancing benefit without the communication overhead of dynamically allocating the tasks one row-vector product at a time. In our strategy, the nodes need to collectively finish *M'* = *m*(1 + ε) row-vector products, for small ε that goes to zero as *m* → ∞. In contrast, MDS coding strategies do not adjust to different degrees of node slowdown; they use the results from *k* nodes and ignore the remaining *p* – *k* nodes. As a result, rateless codes achieve a much lower delay than MDS coding strategies.

**Negligible redundant computation.** A major drawback of MDS coding is that if there is no straggling, the workers collectively perform *mp/k* row-vector products, instead of *m.* With the rateless coding strategy, the nodes collectively perform at most *M'* = *m*(1 + ε) row-vector products where ε → 0 as *m*, the number of rows in the matrix **A** increases.

**Maximum straggler tolerance.** A (*p*, *k*) MDS-coded distributed computation is robust to *p* – *k* straggling nodes, for *k* ∈ [1, 2, …, *p*]. Reducing *k* increases straggler tolerance but also adds more redundant computation. The rateless coding scheme can tolerate up to *p* – 1 stragglers, with negligible redundant computation overhead.

**Low decoding complexity.** One may argue that MDS coding approaches can also use partial computations and achieve near-perfect load balancing if we construct an (*m _{e}*,

*m*) MDS code (for a given amount of redundancy

*m*) to encode a

_{e}/m*m*x

*n*matrix. The decoding complexity of such a code is

*O*(

*m*

^{3}), which is unacceptable for large

*m*in practice. Rateless codes offer a low decoding complexity:

*O*(

*m*log

*m*) for LT codes,

^{14}and

*O*(

*m*) for Raptor codes.

^{16}

The use of LT codes for matrix-vector multiplication has been recently proposed in Severinson^{15} and Wang.^{19} However, these works do not utilize the "rateless" property of LT codes and instead use them in a fixed-rate setting. To the best of our knowledge, our work is the first to exploit the *rateless* nature of LT codes to perform load balancing in distributed matrix computations and utilize all the partial works done by slow workers. We provide the first theoretical analysis of the latency achieved by this strategy with ideal load balancing and show that it asymptotically achieves near-perfect latency and computation cost. Moreover, we present extensive experimental results local parallel computing and distributed computing on Amazon EC2.

### 2. Problem Formulation

**2.1. System model**

Consider the problem of multiplying a *m* × *n* matrix **A** with a *n* × 1 vector **x** using *p* worker nodes and a master node as shown in Figure 1. The worker nodes can only communicate with the master and cannot directly communicate with other workers. The goal is to compute the result **b** = **Ax** in a distributed fashion and mitigate the effect of unpredictable node slowdown or straggling. The rows of **A** are encoded using an error-correcting code to give the *m _{e}* ×

*n*encoded matrix

**A**

_{e}, where

*m*≥

_{e}*m.*We denote the amount of redundancy added by the parameter α =

*m*Matrix

_{e}/m.**A**

_{e}is split along its rows to give

*p*submatrices

**A**

_{e,1}, …,

**A**

_{e,p}of equal size such that worker

*i*stores submatrix

**A**

_{e,i}. To compute the matrix-vector product

**b**=

**Ax**, the vector

**x**is communicated to the workers such that Worker

*i*is tasked with computing the product

**A**

_{e,i}

**x.**

**Figure 1. The system model for coded distributed matrix vector multiplication with a master-worker framework. The master generates the encoded matrix A _{e} by applying a coding scheme to the rows of A. Worker i stores a submatrix of A_{e} denoted by A_{e,i} and sends encoded row-vector products be_{e,i} to the master (i = 1, …, p). Different b_{e,i}'s may have different sizes. The master decodes the encoded row-vector products in be_{e,i}, i = 1, …, p to recover b = Ax.**

To complete the assigned task, each worker needs to compute a sequence of row vector products of the form **a**_{e,j}**x** where **a**_{e,j} is the *j*th row of **A**_{e}. The time taken by a worker node to finish computing one or more row-vector products may be random due to variability in the node speed or variability in the amount of computation assigned to it. The master node aggregates the computations of all, or a subset of, workers into the vector **b**_{e}, which is then decoded to give the final result **b** = **Ax.** If **b**_{e} is not decodable, the master waits until workers compute more row-vector products.

**2.2. Performance criteria**

We use the following metrics to compare different distributed matrix-vector multiplication schemes *via* theoretical analysis and associated simulations (Section 4), and experiments in parallel, distributed, and serverless environments (Section 5).

DEFINITION 1 (LATENCY (*T*)). *The latency T is the time required by the system to complete enough computations so that* **b** = **Ax** *can be successfully decoded from worker computations aggregated in* **b _{e}**.

DEFINITION 2 (COMPUTATIONS (*C*)). *The number of computations C is defined as the total number of row-vector products* **a**_{e,j}**x** *performed collectively by the worker nodes until* **b** = **Ax** *is decoded.*

For any strategy, we always have *C* ≥ *m* where *m* is the number of rows of **A** or the number of elements in **b.**

**2.3. Benchmarks for comparison**

We compare the performance of the proposed rateless-coded strategy with three benchmarks: ideal load balancing, *r*-replication, and the (*p, k*) MDS-coded strategy, which are formally described below. Figure 2 illustrates the differences in the way row-vector product tasks are assigned to and collected from workers in each strategy.

**Figure 2. Each square represents one row-vector product task out of a total of m tasks to be completed by p workers. In the ideal scheme, we have a central queue of m tasks and each worker is assigned a new task as soon as it becomes idle until all m tasks are completed. In the replication scheme, the master waits for the fastest worker for each submatrix. With MDS coding, the master needs to wait for k out of p workers, but each worker has to complete m/k tasks. The rateless coded strategy requires waiting for only m(1 + ε) tasks across all workers.**

**Ideal load balancing.** The multiplication of the *m* × *n* matrix **A** with the *n* × 1 vector **x** can be treated as a job with *m* tasks, where each task corresponds to one row-vector product. In the ideal load balancing strategy, the master node maintains a central queue of these *m* tasks. It dynamically assigns one task to each of the *p* workers as soon as a worker finishes its previous task. The matrix-vector multiplication is complete when exactly *m* tasks are collectively finished by the workers. This strategy seamlessly adapts to varying worker speeds without performing any redundant computation (*C* = *m*); hence, it gives the optimal latency computation trade-off. This strategy may be impractical due to the constant communication between the master and the worker nodes. Nevertheless, it serves as a good theoretical benchmark for comparison with the rateless, replication, and MDS strategies.

**The r-replication strategy.** A simple distributed multiplication strategy is to split

**A**along its rows into

*p/r*submatrices

**A**

_{1}, …,

**A**

_{p/r}, with

*rm/p*rows each (assume that

*p/r*divides

*m*) and multiply each submatrix with

**x**in parallel on

*r*distinct worker nodes. The master collects the results from the fastest of the

*r*nodes that have been assigned the task of computing the product

**A**

_{i}

**x**for all

*i.*The computed products are aggregated into the

*m*× 1 vector

**b.**

*Setting r*= 1

*corresponds to the naive or uncoded strategy where*

**A**

*is split into p sub-matrices and each worker node computes the corresponding submatrix-vector product.*While this approach performs the least number of computations, it is susceptible to straggling nodes or node failures. Increasing the number of replicas provides greater straggler tolerance at the cost of redundant computations. Real distributed computing frameworks such as MapReduce

^{4}and Spark

^{20}often use

*r*= 2; that is, each computation is assigned to two different worker nodes for added reliability and straggler tolerance.

**The** (*p, k*) **MDS-coded strategy.** Recent works^{12, 13} have applied MDS coding to overcome the problem of stragglers in the uncoded strategy. The strategy involves pre-multiplying **A** at the central node with a suitable encoding matrix **F** denoting the MDS codes. For encoding using a (*p, k*) MDS code, the matrix **A** is split along its rows into *k* matrices **A**_{1}, …, **A**_{k}, each having *m/k* rows. The MDS code adds *p* – *k* redundant matrices **A**_{k+1}, …, **A**_{p}, which are independent linear combinations of the matrices **A**_{1}, …, **A**_{k}. Worker *i* computes the product **A**_{i}**x**. Thus, the system is robust to *p – k* stragglers. However, this strategy adds a significant computation overhead. When none of the nodes are slow, the system performs *mp/k* row-vector products (as opposed to *m* row-vector products in the uncoded case).

### 3. Proposed Rateless Strategy

We describe how rateless codes, specifically LT codes,^{14} can be applied to perform coded matrix vector multiplication and then propose a distributed implementation of this scheme for straggler mitigation in computing the matrix-vector product **b** = **Ax** using the framework of Section 2.1.

**3.1. LT-coded matrix-vector multiplication**

Luby Transform (LT) codes proposed in Luby^{14} are a class of erasure codes that can be used to generate a limitless number of encoded symbols from a finite set of source symbols. We apply LT codes to matrix-vector multiplication by treating the *m* rows of the matrix **A** as source symbols. Each encoded symbol is the sum of *d* source symbols chosen uniformly at random from the matrix rows. Thus, if *S _{d}* ⊆ {1, 2, …,

*m*} is the set of

*d*row indices, the corresponding encoded row is .

The number of original rows in each encoded row, or the degree *d*, is chosen according to the Robust Soliton degree distribution as described in Luby.^{14} Once the degree *d* is chosen, encoding is performed by choosing *d* source symbols uniformly at random (this determines *S _{d}*) and adding them to generate an encoded symbol. The encoding process is illustrated in Figure 3a.

**Figure 3. (a) Bipartite graph representation of the encoding of the rows a _{1}, a_{2}, … a_{m} of matrix A. Each encoded row is the sum of d rows of A chosen uniformly at random, where d is drawn from the Robust Soliton degree distribution. (b) In each step of the iterative decoding process, a single-degree one-encoded symbol is decoded directly and is subtracted from all sums in which it participates.**

Once the rows of the encoded matrix **A**_{e} are generated, we can compute the encoded matrix vector product **b**_{e} = **A**_{e}**x**. To decode the desired matrix vector product **b** = **Ax** from a subset of *M'* symbols of **b**_{e} we use the *iterative peeling decoder* described in Luby.^{14} If **b** = [*b*_{1}, *b*_{2}, …, *b _{m}*], the decoder may receive symbols

*b*

_{1}+

*b*

_{2}+

*b*

_{3},

*b*

_{2}+

*b*

_{4},

*b*

_{3},

*b*

_{4}, and so on since each row of

**A**

_{e}is a sum of some rows of

**A.**Decoding (illustrated in Figure 3b) is performed in an iterative fashion. In each iteration, the decoder finds a degree one encoded symbol, covers the corresponding source symbol, and subtracts the symbol from all other encoded symbols connected to that source symbols.

Since the encoding uses a random bipartite graph, the number of symbols required to decode the *m* source symbols successfully is a random variable, which for the Robust Soliton degree distribution is with probability at least 1 – δ.^{14} Moreover, the complexity of the decoding process described here is *O*(*m* ln *m*) for LT codes (due to the careful design of the Robust Soliton distribution). This is the key reason for preferring LT codes in our setting over other random linear codes for which the decoding complexity can be as high as *O*(*m*^{3}).

**3.2. Distributed implementation**

The *m* × *n* matrix **A** is encoded to generate an *m _{e}* ×

*n*encoded matrix

**A**

_{e}where

*m*= α

_{e}*m.*Each row of

**A**

_{e}is the sum of a random subset of rows of

**A**as described in Section 3.1. The knowledge of the mapping between the rows of

**A**and the rows of

**A**

_{e}is crucial for successful decoding as illustrated in Figures 3a and 3b. Hence, this mapping is stored at the master. The encoding step can be treated as a pre-processing step in that it is only performed initially.

The α*m* rows of the encoded matrix are distributed equally among the *p* worker nodes as illustrated in Figure 1. To multiply **A** with a vector **x**, the master communicates **x** to the workers. Each worker multiplies **x** with each row of **A**_{e} stored in its memory and returns the product (a scalar) to the master. The master collects row-vector products of the form **a**_{e,j}**x** (elements of **b**_{e}) from the workers until it has enough elements to be able to recover **b.** If a worker node completes all the α*m/p* row-vector products assigned to it before the master is able to decode **b,** it will remain idle, while the master collects more row-vector products from other workers.

Once the master has collected a sufficient number of coded row-vector products from the workers, it can recover the desired matrix vector product **b** = **Ax** from the subset of the elements of **b**_{e} = **A**_{e}**x** that it has collected using the iterative peeling decoder. Once the master decodes all elements of the product vector **b** = **Ax**, it sends a *done* signal to all workers nodes to stop their local computation.

The following modifications can make the current implementation even more efficient in real systems.

**Blockwise communication:** To truly monitor partial work done by each worker, the master needs to receive each encoded row-vector product **a**_{e,j}**x** from the workers. However, this imposes a large communication overhead, which may increase latency in a slow network. To prevent this, in our distributed computing experiments, we communicate submatrix-vector products where is the *j*th part of the encoded submatrix **A**_{ei} stored at worker *i*, and each part corresponds to approximately 10% of the total rows of the sub-matrix. Note that if **A** is very large, then it is infeasible for worker *i* to read **A**_{ei} from memory at once and so **A**_{ei} **x** needs to be computed in parts for *any* coding scheme.

**Using raptor codes:** Despite their ease of implementation and fast decoding, LT codes^{14} are suboptimal in practice due to the overhead of *M'* – *m* extra symbols required to decode the original *m* source symbols. In our experiments, we observe that for a matrix **A** with *m* = 11,760 rows, we need to wait for 12,500 encoded row-vector products to decode **b** = **Ax** with 99% probability. Advanced rateless codes like Raptor Codes^{16} can decode *m* source symbols from *m*(1 + ε) symbols for any *constant* ε even for *finite values* of *m.* Since Raptor Codes are the rateless codes used in practical wireless standards, we expect them to be used in practical implementations of our coded distributed matrix vector multiplication strategy to improve efficiency.

**Using systematic rateless codes:** We can entirely avoid decoding (in the absence of significant straggling) by using Systematic LT/Raptor Codes^{16} where the *m* source rows **a**_{1}, **a**_{2}, …, **a**_{m} form a subset of the encoded rows in **A**_{e}. The overall scheme can be designed so that each worker first computes the row-vector products corresponding to the systematic symbols **a**_{1}, **a**_{2}, …, **a**_{m} and then computes other encoded products (in the event of node slowdown). This would preclude the need for decoding if there is no/little straggling thereby reducing the overall latency.

### 4. Performance Analysis

In this section, we theoretically analyze the performance of LT coding and the three benchmark strategies—ideal load balancing, (*p, k*)-MDS, and *r*-replication—in terms of latency (Definition 1) and computations (Definition 2). Our results are summarized in Table 1.

**4.1. Delay model**

We consider a simple delay model, illustrated in Figure 4, where worker *i* has an initial delay (setup time) of *X _{i}* after which it spends a constant time τ per row-vector product task. Thus, worker

*i*requires time

*Y*to perform

_{i}*B*row-vector product computations where

_{i}

**Figure 4. Worker i has a random initial delay X_{i}, after which it completes row-vector product tasks (denoted by the small rectangles), taking time τ per task. Latency T is the time until enough tasks have been completed for the product b = Ax to be recovered.**

This delay model is motivated by the observations of Dean^{3} where it is noted that the variability in latency arises largely from delays due to background tasks running at worker nodes and that once a request actually begins execution, the variability is considerably lower. Our model also captures the effect of increasing the amount of computations on the delay—if a worker is assigned more computations, there is larger delay. When *X _{i}* is exponentially distributed with rate μ, the time worker

*i*takes to perform

*b*computations is distributed as

**4.2. Rateless coding versus ideal load balancing**

In the ideal load balancing strategy, the *m* row-vector product tasks (which comprise the job of multiplying the *m* × *n* size **A** with vector **x**) are kept in a central queue at the master and dynamically allocated to idle workers one task at a time. The job is complete when *m* tasks are collectively finished by the workers. The rateless coding strategy differs from this ideal policy in two ways due to which its latency is larger: (1) each worker gets *m _{e}/p* = α

*m/p*encoded rows and thus, a fast worker may run out of rows before the master is able to recover

**b**=

**Ax,**and (2) the workers collectively need to finish

*m*(1 + ε) tasks where ε is a small overhead that diminishes as

*m*→ ∞. Our main theoretical result stated in the following (informal) theorem compares the two latencies.

THEOREM 1. *The latency T _{LT} and computations C_{LT} of our LT coded distributed matrix-vector multiplication strategy in computing the product of a m* ×

*n matrix*

**A**

*with a n*× 1

*vector*

**x**

*satisfy the following for large m:*

*where m _{e}* = α

*m*

*(for*α ≥ 1

*) is the number of encoded rows, the initial delay at each worker is X*∼ exp(∝)

_{i}*and*τ

*is the time taken to compute each row-vector product. Due to the inherent design of LT codes*, ε → 0

*as m*→ ∞.

These results show that as long as the number of encoded rows *m _{e}* is sufficiently larger than

*m*, despite not performing dynamic task assignment, the rateless coding strategy can seamlessly adapt to varying initial delays at the workers. Its runtime

*T*

_{LT}and computations

*C*

_{LT}asymptotically converge to the ideal strategy. This is also illustrated in Table 1, which shows that the expected latency and number of computations for the rateless-coded strategy are of the same order as the ideal strategy (with the 1 + ε overhead corresponding to the extra number of symbols required for successful decoding). Lastly note that the ideal load balancing scheme is not exactly realizable in practice. Approaches like work stealing

^{5}can potentially approximate this strategy by physically moving tasks from busy workers to idle workers. However, implementing such approaches may not be feasible in all settings, for example, when the communication latency between workers is too large, or a data is restricted to lie on a particular worker due to privacy concerns. In this work, we show that it is possible to

*algorithmically*achieve near-ideal latency performance for matrix-vector multiplication by using the rateless-coded computing strategy described in Section 3

*without*physically moving data between workers.

**4.3. Comparison with MDS and replication strategies**

Unlike our rateless coding strategy, MDS-coded and replication-based strategies give strictly worse latency and cost than the ideal scheme and the gap does not go to zero.

Table 1 contains the expressions for expected latency of the *r*-replication and (*p, k*) – MDS-coded schemes for the delay model in (2). Observe that in both cases, adding redundancy (increasing *r* and reducing *k*, respectively) leads to an increase in the first term (more computation at each node) and decrease in the second term (less delay due to stragglers). Thus, straggler mitigation comes at the cost of additional computation at the workers, which might even lead to an increase in latency. This is in contrast to Theorem 1, which indicates that the expected latency of the rateless-coded strategy always decreases on adding redundancy (increasing α). Moreover, the presence of the log factor in the second term causes the latencies *T*_{Rep}, *T*_{MDS} to always be larger than *T*_{ideal} since there is no log-factor 1/∝ term in the latencies of the Ideal and LT-coded schemes in Table 1.

REMARK 1. Another important advantage of the rateless-coded strategy is that the number of computations performed by the workers, *C*_{LT}, is always equal to *M'* and does not increase on increasing redundancy (increasing α) unlike for the MDS and replication strategies. Moreover since E[*M'*] = *m*(1 + ε) and ε → 0 as *m* → ∞, E[*C*_{LT}] asymptotically approaches the minimum number of computations (*m*) required to recover a *m*-dimensional matrix-vector product.

REMARK 2. While the benefits of using partial work from all workers can be obtained by using any random linear code on the rows of **A,** the key strength of LT codes is their low *O*(*m* ln *m*) decoding complexity. Using an (*m _{e}*,

*m*) MDS code on the rows of

**A**has

*O*(

*m*

^{3}) decoding complexity, which is unacceptable for large

*m.*

We simulate the MDS, replication, and LT-coded schemes under our delay model (1) for distributed matrix-vector multiplication with *m* = 10,000 matrix rows, *p* = 10 workers, and delay model parameters ∝ = 1.0, τ = 0.001 (Figure 5). We limit the amount of redundancy to α = *m _{e}/m* ≤ 2.0 since this is the amount of redundancy in the basic 2-replication scheme. Observe that the LT-coded strategy (α = 2.0) clearly outperforms MDS coding (with

*k*= 8) in that it not only exhibits near-ideal latency (Figure 5a) but also performs fewer total computations (Figure 5b) than MDS coding. Moreover, increasing redundancy (reducing

*k*) in MDS coding leads to higher latency after a point, as illustrated in Figure 5c (and as expected from Table 1). On the other hand, the latency of LT coding converges to that of the Ideal scheme on increasing α, without any increase in computations.

**Figure 5. The tail probability of latency is highest for replication schemes. MDS codes perform better in terms of latency but they perform a large number of redundant computations. The latency tail of LT codes is the lightest among all the schemes. Moreover, the LT-coded schemes perform significantly fewer redundant computations than MDS codes or replication. All simulations are performed for a matrix-vector multiplication task with m = 10,000 matrix rows, p = 10 worker nodes, and delay model parameters ∝ = 1.0, τ = 0.001.**

### 5. Experimental Results

We demonstrate the effectiveness of rateless codes in speeding up distributed matrix-vector multiplication in parallel and distributed computing settings.

**5.1. Parallel computing**

We consider multiplication of a 10,000 × 10,000 matrix **A** of random integers with a 10,000 × 1 vector **x** of random integers parallelized over 100 processes using Python's Multiprocessing Library.^{7} We compare the un-coded, 2-replication, MDS coding (*k* = 80, 50), and LT coding (α = 1.25, 2.0) approaches. Rows of the encoded matrix **A**_{e} are divided equally among the *p* = 100 processes, which multiply the rows with **x** in parallel. The experiment is repeated 10 times with a different random **x** each time and we record the average latency (time required to collect enough row-vector products for successful decoding) and total computations. Results of average latency (Figure 6a) show that LT-coded and MDS-coded approaches are clearly faster (about at least 1.2 times) than the uncoded and 2-replication approaches, while Figure 6c shows that the LT-coded approaches also perform fewer total computations than the MDS or 2-replication strategies thus leading to more efficient resource utilization. Note that while MDS coding with *k* = 80 has latency comparable to that of LT coding (both for α = 1.25 and α = 2.0), both latency and total computations with MDS coding increase on decreasing *k* to 50 due to the higher computational load at each node (as discussed in Section 4). Recall that *k* corresponds to the number of "fast" workers in the system. In most real systems, the number of "fast" workers is transient and thus unpredictable. Our experiments show that MDS coding is highly sensitive to the choice of *k* with incorrect choices leading to *higher* latency. LT coding on the other hand is not only fast, but is also insensitive to the amount of redundancy (α) in that α can be as large as permitted by memory constraints without loss in performance.

**Figure 6. Experiments on coded distributed matrix vector multiplication in parallel (Python Multiprocessing ^{7}) and distributed [AWS EC2^{1}] settings show that the LT-coded strategy has lower average latency than all other approaches (1.2 times to 3 times improvement across scenarios) and performs fewer computations than replication or MDS coding. Each error bar corresponds to one standard deviation.**

**5.2. Distributed computing**

We created a cluster of 70 t2. small workers on Amazon Web Services (AWS) EC2^{1} for multiplying a 11,760 × 9216 matrix **A** from the Self-Taught Learning (STL)-10^{2} dataset with different vectors (of length 9216) from the same dataset. Once again, we compared uncoded, 2-replication, MDS coding (*k* = 56, 35), and LT coding (α = 1.25, 2.0). The encoded matrix A_{e} is divided equally among the *p* = 70 workers and each worker computes approximately 14 row-vector products at a time before communicating the results to the master to balance excessive communication (communicating one row at a time) and excessive data size (communicating all rows at once). Figure 6b shows the average latency (over five trials) of the different approaches. Both LT-coded approaches are almost two times faster than the MDS-coded approaches and almost three times faster than the uncoded approaches. Figure 6d shows that LT-coded strategies also perform fewer total computations than MDS or 2-replication. Plots in Figure 7 also show that variability in individual worker times is significantly lower for our rateless-coded strategy (Figure 7d) than for other approaches as fast nodes perform more tasks than slow nodes under our approach leading to much better load balancing. Additionally, *T*_{LT} is closest to *T*_{ideal}, the latency of the ideal load-balancing strategy, approximated as the minimum time to compute 11,760 row-vector products across all workers.

**Figure 7. Comparison of load balancing across different matrix-vector multiplication approaches. The height of the bar plot for each worker indicates the time spent by the worker computing row-vector products either until it finishes its assigned tasks or is terminated by the master because the final matrix-vector product Ax has been successfully decoded. The dash-dot line indicates the overall latency (time at which matrix-vector product Ax can be successfully decoded) in each case, and the black-dashed line is the latency of ideal load balancing. The LT-coded approach exhibits near-ideal load balancing and has lower latency than other approaches.**

### 6. Conclusion

We propose an erasure coding strategy based on *rateless fountain codes* to speed up distributed matrix-vector multiplication in the presence of slow nodes (stragglers). For a matrix with *m* rows, our strategy requires the nodes to *collectively* finish slightly more than *m* row-vector products. Thus, it seamlessly adapts to varying node speeds and achieves near-perfect load balancing. Moreover, it has a small overhead of redundant computations (asymptotically zero) and low decoding complexity. Theoretical analysis and experiments show that our approach strikes a better latency-computation trade-off than existing uncoded, replication, and MDS coding approaches. In the future, we plan to extend our approach to special linear computations such as sparse matrix-vector multiplication (SpMV) and Fourier Transforms, and to devise principled erasure-coded schemes to speed up distributed *nonlinear* computations.

### Acknowledgments

This project was supported in part by the CMU Dean's Fellowship, Qualcomm Innovation Fellowship, NSF CCF grant no. 1850029, and an Amazon Credits for Research Grant.

### References

1. Amazon. Amazon web services EC2, 2006. https://aws.amazon.com/ec2/.

2. Coates, A., Ng, A., Lee, H. An analysis of single-layer networks in unsupervised feature learning. In *Proceedings of the 14 ^{th} International Conference on Artificial Intelligence and Statistics* (2011), Proceedings of Machine Learning Research (PMLR), 215–223.

3. Dean, J., Barroso, L.A. The tail at scale. *Commun. ACM 56*, 2 (2013), 74–80.

4. Dean, J., Ghemawat, S. MapReduce: simplified data processing on large clusters. *Commun. ACM 51*, 1 (2008), 107–113.

5. Dinan, J., Olivier, S., Sabin, G., Prins, J., Sadayappan, P., Tseng, C.-W. Dynamic load balancing of unbalanced computations using message passing. In *Proceedings of the IEEE International Parallel and Distributed Processing Symposium*, IEEE, NY, 2007, 1–8.

6. Dutta, S., Cadambe, V., Grover, P. "Short-dot" computing large linear transforms distributedly using coded short dot products. In *Proceedings of the 30 ^{th} International Conference on Neural Information Processing Systems* (2016), Neural Information Processing Systems Foundation, Inc. (NIPS), 2100–2108.

7. P. S. Foundation. Multiprocessing. 2008. https://docs.python.org/3/library/multiprocessing.html.

8. Gardner, K., Harchol-Balter, M., Scheller-Wolf, A., Van Houdt, B. A better model for job redundancy: Decoupling server slowdown and job size. *IEEE/ACM Trans. Network 25*, 6 (2017), 3353–3367.

9. Joshi, G. Synergy via redundancy: Boosting service capacity with adaptive replication. *ACM SIGMETRICS Perform. Eval. Rev 45*, 3 (2018), 21–28.

10. Joshi, G., Liu, Y., Soljanin, E. On the delay-storage trade-off in content download from coded distributed storage systems. *IEEE J. Sel. Areas Commun 32*, 5 (2014), 989–997.

11. Joshi, G., Soljanin, E., Wornell, G. Efficient redundancy techniques for latency reduction in cloud systems. *ACM Trans. Model. Perform. Eval. Comput. Syst 2*, 2 (2017), 1–30.

12. Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., Ramchandran, K. Speeding up distributed machine learning using codes. *IEEE Trans. Inf. Theory 64*, 3 (2017), 1514–1529.

13. Li, S., Maddah-Ali, M.A., Avestimehr, A.S. A unified coding framework for distributed computing with straggling servers. In *Proceedings of the IEEE Global Communications Conference Workshops*, IEEE, NY, 2016, 1–6.

14. Luby, M. LT codes. In *Proceedings of the 43 ^{rd} Annual IEEE Symposium on Foundations of Computer Science*, IEEE, NY, 2002, 271–271.

15. Severinson, A., i Amat, A.G., Rosnes, E. Block-diagonal and LT codes for distributed computing with straggling servers. *IEEE Trans. Commun 67*, 3 (2018), 1739–1753.

16. Shokrollahi, A., Luby, M., et al. Raptor codes. *Found. Trends*® *Commun. Inf. Theory 6*, 3–4 (2011), 213–322.

17. Sun, Y., Zheng, Z., Koksal, C. E., Kim, K., Shroff, N. B. Provably delay efficient data retrieving in storage clouds. In *Proceedings of the IEEE Conference on Computer Communications*, IEEE, NY, 2015.

18. Wang, D., Joshi, G., Wornell, G.W. Efficient straggler replication in large-scale parallel computing. *ACM Trans. Model. Perform. Eval. Comput. Syst 4*, 2 (2019), 7:1–7:23.

19. Wang, S., Liu, J., Shroff, N. Coded sparse matrix multiplication. In *Proceedings of the International Conference on Machine Learning* (2018), Proceedings of Machine Learning Research (PMLR), 5152–5160.

20. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I. Spark: cluster computing with working sets. *HotCloud 10*, 10 (2010), 95.

### Footnotes

To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3524292

The original version of this paper was published in *Proceedings of ACM Meas. Anal. Comput. Systems 3*, 3 (Dec. 2019), Article 58.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.