Artificial Intelligence and Machine Learning Research highlights

Data-Driven Algorithm Design

Posted Jun 1 2020

Abstract
1. Introduction
2. Motivating Scenarios
3. A PAC Learning Model
4. Learning Greedy Heuristics
5. Feature-Based Algorithm Selection
6. Choosing the Step Size in Gradient Descent
7. Conclusion
Acknowledgments
References
Authors
Footnotes

The best algorithm for a computational problem generally depends on the “relevant inputs,” a concept that depends on the application domain and often defies formal articulation. Although there is a large literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem.

We model the problem of identifying a good algorithm from data as a statistical learning problem. Our framework captures several state-of-the-art empirical and theoretical approaches to the problem, and our results identify conditions under which these approaches are guaranteed to perform well. We interpret our results in the contexts of learning greedy heuristics, instance feature-based algorithm selection, and parameter tuning in machine learning.

1. Introduction

Rigorously comparing algorithms is hard. Two different algorithms for a computational problem generally have incomparable performance: one algorithm is better on some inputs but worse on the others. How can a theory advocate one of the algorithms over the other? The simplest and most common solution in the theoretical analysis of algorithms is to summarize the performance of an algorithm using a single number, such as its worst-case performance or its average-case performance with respect to an input distribution. This approach effectively advocates using the algorithm with the best summarizing value (e.g., the smallest worst-case running time).

Solving a problem “in practice” generally means identifying an algorithm that works well for most or all instances of interest. When the “instances of interest” are easy to specify formally in advance—say, planar graphs, the traditional analysis approaches often give accurate performance predictions and identify useful algorithms. However, the instances of interest commonly possess domain-specific features that defy formal articulation. Solving a problem in practice can require designing an algorithm that is optimized for the specific application domain, even though the special structure of its instances is not well understood. Although there is a large literature, spanning numerous communities, on empirical approaches to data-driven algorithm design (e.g., Fink¹¹, Horvitz et al.¹⁴, Huang et al.¹⁵, Hutter et al.¹⁶, Kotthoff et al.¹⁸, Leyton-Brown et al.²⁰), there has been surprisingly little theoretical analysis of the problem. One possible explanation is that worst-case analysis, which is the dominant algorithm analysis paradigm in theoretical computer science, is intentionally application agnostic.

This paper demonstrates that application-specific algorithm selection can be usefully modeled as a statistical learning problem, in the spirit of Haussler.¹³ We prove that many classes of algorithms have small pseudo-dimension; in these cases, an approximately best-in-class algorithm can be learned from a modest number of representative benchmark instances. We interpret our results in the contexts of learning greedy heuristics, instance feature-based algorithm selection, and parameter tuning in machine learning.

Dimension notions from statistical learning theory have been used almost exclusively to quantify the complexity of classes of prediction functions (e.g. Anthony and Bartlett², Haussler¹³). Our results demonstrate that these concepts are useful and relevant in a much broader algorithmic context while also providing a novel formalization of the oft-mentioned but rarely defined “simplicity” of a family of algorithms.

2. Motivating Scenarios

Our learning framework sheds light on several well-known approaches, spanning disparate application domains, to the problem of learning a good algorithm from data. To motivate and provide interpretations of our results, we describe several of these in detail. There are also strong connections between data-driven algorithm design and self-improving algorithms,^1,9 where the goal is to design an algorithm that uses a modest amount of space and, given a sequence of independent samples from an unknown input distribution, converges to the optimal algorithm for that distribution; the full version¹² elaborates on these connections.

2.1. Example 1: greedy heuristic selection

One of the most common and also most challenging motivations for data-driven algorithm design is presented by computationally difficult optimization problems. When the available computing resources are inadequate to solve such a problem exactly, heuristic algorithms must be used. For most hard problems, our understanding of when different heuristics work well remains primitive. For con-creteness, we describe one recent and high-stakes example of this issue, which also aligns well with our model and results in Section 4.

In 2016–2017, the FCC ran a novel double auction to buy back licenses for spectrum from certain television broad-casters and resell them to telecommunication companies for wireless broadband use.¹⁹ The auction generated roughly $10 billion for the US government. The “reverse” (i.e., buy back) phase of the auction determined which stations to buy out and what to pay them. The auction was tasked with buying out sufficiently many stations, so that the remaining stations (who keep their licenses) can be “repacked” into a small number of channels, leaving a target number of channels free to be repurposed for wireless broadband. To first order, the feasible repackings were determined by interference constraints between stations. Computing a repacking, therefore, resembles familiar hard combinatorial problems such as the independent set and graph coloring problems.

The reverse auction deployed by the FCC used a greedy heuristic to decide which stations would keep their licenses and stay on the air. The chosen heuristic favored stations with high value and discriminated against stations that interfered with a large number of other stations. There are many ways of combining these two criteria and no obvious reason to favor one implementation over another. The specific implementation in the FCC auction was justified through trial-and-error experiments using synthetic instances that were thought to be representative. One interpretation of our results in Section 4 is as a post hoc justification of this approach to algorithm design for sufficiently simple collections of algorithms, such as the family of greedy heuristics considered for this FCC auction.

2.2. Example 2: parameter tuning in optimization and machine learning

Many “algorithms” used in practice are really meta-algorithms, with a large number of free parameters that need to be instantiated by the user. For instance, implementing even the most basic version of gradient descent requires choosing a step size and error tolerance. For a more extreme example, CPLEX, a widely used commercial linear and integer programming solver, comes with a 221-page parameter reference manual describing 135 parameters.

An analogous problem in machine learning is “hyperparameter optimization,” where the goal is to tune the parameters of a learning algorithm, so that it learns (from training data) a model with high accuracy on test data, and in particular a model that does not overfit the training data. A simple example is regularized regression, such as ridge regression, where a single parameter governs the trade-off between the accuracy of the learned model on training data and its “complexity.” More sophisticated learning algorithms can have many more parameters.

Figuring out the “right” parameter values is notoriously challenging in practice. The CPLEX manual simply advises that “you may need to experiment with them.” In machine learning, parameters are often set by discretizing and then applying exhaustive search (a.k.a. “grid search”), perhaps with random subsampling (“random search”).⁷ When this is computationally infeasible, variants of gradient descent are often used to explore the parameter space, with no guarantee of convergence to a global optimum.

The results in Section 6 can be interpreted as a sample complexity analysis of grid search for the problem of choosing the step size in gradient descent to minimize the expected number of iterations needed for convergence. We view this as a first step toward reasoning more generally about the problem of learning good parameters for machine learning algorithms.

2.3. Example 3: empirical performance models for SAT solvers

The examples above already motivate choosing an algorithm for a problem based on characteristics of the application domain. A more ambitious and refined approach is to select an algorithm on a per-instance (instead of a per-domain) basis. Although it is impossible to memorize the best algorithm for every possible instance, one might hope to use coarse features of a problem instance as a guide to which algorithm is likely to work well.

For example, Xu et al.²⁴ applied this idea to the satisfiability (SAT) problem. Their algorithm portfolio consisted of seven state-of-the-art SAT solvers with incomparable and widely varying running times across different instances. The authors identified a number of instance features, ranging from simple features such as input size and clause/variable ratio to complex features such as Knuth’s estimate of the search tree size¹⁷ and the initial rate of progress of local search.^a The next step involved building an “empirical performance model” (EPM) for each of the seven algorithms in the portfolio—a mapping from instance feature vectors to running time predictions. They then computed their EPMs using labeled training data and a suitable regression model. With the EPMs in hand, it is clear how to perform per-instance algorithm selection: given an instance, compute its features, use the EPMs to predict the running time of each algorithm in the portfolio, and run the algorithm with the smallest predicted running time. Using these ideas (and several optimizations), their “SATzilla” algorithm won numerous medals at the 2007 SAT Competition. Section 5 outlines how to extend our learning framework to reason about EPMs and feature-based algorithm selection.

3. A PAC Learning Model

This section casts the problem of choosing the best algorithm for a poorly understood application domain as one of learning the optimal algorithm with respect to an unknown instance distribution. Section 3.1 formally defines the basic model, and Section 3.2 reviews the relevant preliminaries from statistical learning theory. Sections 4–6 describe the applications of the model to learning greedy heuristics, instance feature-based algorithm selection, and hyperparameter tuning.

3.1. The basic model

Our basic model consists of the following ingredients.

A fixed computational or optimization problem II. For example, II could be the problem of computing a maximum-weight independent set of a graph (Section 4) or minimizing a convex function (Section 6).
An unknown distribution D over instances x ∈ Π.
A set A of algorithms for Π. For example, A could be a finite set of SAT solvers or an infinite family of greedy heuristics.
A performance measure cost: A× → [0, H], indicating the performance of a given algorithm on a given instance. Two common choices for cost are the running time of an algorithm and, for optimization problems, the objective function value of the solution produced by an algorithm.

The “application-specific” information is encoded by the unknown input distribution D, and the corresponding “application-specific optimal algorithm” A_D is the algorithm that minimizes or maximizes (as appropriate)

over the algorithms A ∈ A. The error of an algorithm A ∈ A for a distribution D is

In our basic model, the goal is:

Learn the application-specific optimal algorithm from data (i.e., samples from D).

More precisely, the learning algorithm is given m i.i.d. samples x₁, …, x_m ∈ Π from D and (perhaps implicitly) the corresponding performance cost(A, x_i) of each algorithm A ∈ A on each input x_i. The learning algorithm uses this information to suggest an algorithm Â∈A to use on future inputs drawn from D. We seek learning algorithms that almost always output an algorithm of A that performs almost as well as the optimal algorithm in A for D—learning algorithms that are probably approximately correct (PAC).

DEFINITION 3.1 A learning algorithm L (ε, δ)-learns the optimal algorithm in A from m samples if, for every distribution D over Π, with probability at least 1−δ over m samples x₁, …, x_m ~ D, L outputs an algorithm Â ∈ A with error at most ε.

3.2. Pseudo-dimension and uniform convergence

PAC learning an optimal algorithm, in the sense of Definition 3.1, reduces to bounding the “complexity” of the class A of algorithms. We next review the relevant definitions from statistical learning theory.

Let H denote a set of real-valued functions defined on the set X. A finite subset S = {x₁, …, x_m} of X is (pseudo-)shattered by H if there exist real-valued witnesses r₁, …, r_m such that, for each of the 2^m subsets T of S, there exists a function h ∈ H such that h(x_i) > r_i if and only if i ∈ T (for i = 1, 2, …, m). The pseudo-dimension of H is the cardinality of the largest subset shattered by H (or +∞, if arbitrarily large finite subsets are shattered by H). The pseudo-dimension is a natural extension of the VC dimension from binary-valued to real-valued functions.^b

To bound the sample complexity of accurately estimating the expectation of all functions in H, with respect to an arbitrary probability distribution D on X, it is enough to bound the pseudo-dimension of H.

THEOREM 3.2 (UNIFORM CONVERGENCE (e.g. Anthony and Bartlett²) ) Let H be a class of functions with domain X and range in [0, H], and suppose H has pseudo-dimension d_H. For every distribution D over X, every ε > 0, and every δ ∈ (0, 1], if

for a suitable constant c (independent of all other parameters), then with probability at least 1−δ over m samples x₁, …, x_m ~ D,

for every h ∈ H.

We can identify each algorithm A ∈ A with the real-valued function x ↦ COST(A, x). Regarding the class A of algorithms as a set of real-valued functions defined on Π, we can discuss its pseudo-dimension, as defined above. We need one more definition before we can apply our machinery to learn algorithms from A.

DEFINITION 3.3 (EMPIRICAL RISK MINIMIZATION (ERM) ) Fix an optimization problem Π, a performance measure COST, and a set of algorithms A. An algorithm L is an ERM algorithm if, given any finite subset S of Π, L returns an algorithm from A with the best average performance on S.

For example, for any Π, COST, and finite A, there is the trivial ERM algorithm that simply computes the average performance of each algorithm on S by brute force and returns the best one. The next corollary follows easily from Definition 3.1, Theorem 3.2, and Definition 3.3.

COROLLARY 3.4 Fix parameters ε > 0, δ ∈ (0, 1], a set of problem instances Π, and a performance measure COST. Let A be a set of algorithms that has pseudo-dimension d with respect to Π and COST. Then, any ERM algorithm (2ε, δ) learns the optimal algorithm in A from m samples, where m is defined as in (1).

The same reasoning applies to learning algorithms beyond ERM. For example, with the same assumptions as in Corollary 3.4, an algorithm from A with approximately optimal (over A) average performance on a set of samples is also approximately optimal with respect to the true input distribution. This observation is particularly useful when the ERM problem is computationally difficult but can be approximated efficiently.

Corollary 3.4 is only interesting if interesting classes of algorithms A have small pseudo-dimension. In the simple case where A is finite, as in our example of an algorithm portfolio for SAT (Section 2.3), the pseudo-dimension of A is trivially at most log₂|A|. The following sections demonstrate the much less obvious fact that natural infinite classes of algorithms also have small pseudo-dimension.

4. Learning Greedy Heuristics

The goal of this section is to bound the pseudo-dimension of many classes of greedy heuristics such as, as a special case, the family of heuristics relevant for the FCC double auction described in Section 2.1. Throughout this section, the performance measure cost is the objective function value of the solution produced by a heuristic on an instance, where we assume without loss of generality a maximization objective.

4.1. Definitions and examples

Our general definitions are motivated by greedy heuristics for NP-hard problems. For example:

Knapsack. The input is n items with values v₁, …, v_n, sizes s₁, …, s_n, and a knapsack capacity C. The goal is to compute a subset S ⊆ {1,2, …, n} with maximum total value Σ_i∈Sv_i, subject to having total size Σ_i∈SS_i at most C. Two natural greedy heuristics are to greedily pack items (subject to feasibility) in order of nonincreasing value v_i or in order of nonincreasing density v_i/s_i (or to take the better of the two).
Maximum-Weight Independent Set (MWIS). The input is an undirected graph G = (V, E) and a nonnegative weight w_v for each vertex v ∈ V. The goal is to compute an independent set—a subset of mutually nonadjacent vertices—with maximum total weight. Two natural greedy heuristics are to greedily choose vertices (subject to feasibility) in order of nonincreasing weight w_v or nonincreasing density w_v/(1 + deg(v) ). (The intuition for the denominator is that choosing v “uses up” 1 + deg(v) vertices—v and all of its now-blocked neighbors.) The latter heuristic also has an adaptive variant, where the degree deg(v) is computed in the subgraph induced by the vertices not yet blocked from consideration, rather than in the original graph.

In general, we consider object assignment problems, where the input is a set of n objects with various attributes, and the feasible solutions consist of assignments of the objects to a finite set R, subject to feasibility constraints. The attributes of an object are represented as an element ξ of an abstract set. For example, in the Knapsack problem, ξ encodes the value and size of an object; in the MWIS problem, ξ encodes the weight and (original or residual) degree of a vertex. In the Knapsack and MWIS problems, R = {0, 1}, indicating whether or not a given object is selected.

By a greedy heuristic, we mean algorithms of the following form (cf., the “priority algorithms” of Borodin et al.⁸):

While there remain unassigned objects:
1. Use a scoring rule σ (described below) to compute a score σ(ξ_i) for each unassigned object i, as a function of its current attributes ξ_i.
2. For the unassigned object i with the highest score, use an assignment rule to assign i a value from R and, if necessary, update the attributes of the other unassigned objects.^c For concreteness, assume that ties are always resolved lexicographically.

A scoring rule assigns a real number to an object as a function of its attributes. Assignment rules that do not modify objects’ attributes yield nonadaptive greedy heuristics, which use only the original attributes of each object (such as v_i or v_i/s_i in the Knapsack problem, for instance). In this case, objects’ scores can be computed in advance of the main loop of the greedy heuristic. Assignment rules that modify object attributes yield adaptive greedy heuristics, such as the adaptive MWIS heuristic described above.

In a single-parameter family of scoring rules, there is a scoring rule of the form σ(ρ, ξ) for each parameter value ρ in some interval I ⊆ R. Moreover, σ is assumed to be continuous in ρ for each fixed value of ξ. Natural examples include Knapsack scoring rules of the form and MWIS scoring rules of the form w_v/(1 + deg(v) )^ρ for ρ ∈ [0, 1] or ρ ∈ [0, ∞). A single-parameter family of scoring rules is K-crossing if, for each distinct pair of attributes ρ, ρ’, there are at most k values of ρ for which σ(ρ, ξ) = σ(ρ, ξ’). For example, all of the scoring rules mentioned above are 1-crossing rules.

For an example assignment rule, in the Knapsack and MWIS problems, the rule simply assigns i to “1” if it is feasible to do so, and to “0” otherwise. In the adaptive greedy heuristic for the MWIS problem, whenever the assignment rule assigns “1” to a vertex v, it updates the residual degrees of other unassigned vertices (two hops away) accordingly.

We call an assignment rule β-bounded if every object i is guaranteed to take on at most β distinct attribute values. For example, an assignment rule that never modifies an object’s attribute is 1-bounded. The assignment rule in the adaptive MWIS algorithm is n-bounded, where n is the number of vertices, as it only modifies the degree of a vertex (which lies in {0,1,2, …, n−1}).

Coupling a single-parameter family of K-crossing scoring rules with a fixed β-bounded assignment rule yields a (K, β)-single-parameter family of greedy heuristics. All of our running examples of greedy heuristics are (1,1)-single-parameter families, except for the adaptive MWIS heuristic, which is a (1,n)-single-parameter family.

4.2. Upper bound on pseudo-dimension

We next show that every (K, β)-single-parameter family of greedy heuristics has small pseudo-dimension. This result applies to all of the concrete examples mentioned above; additional examples are described in the full version.¹²

THEOREM 4.1 (PSEUDO-DIMENSION OF GREEDY ALGORITHMS) If A is a (K, β)-single-parameter family of greedy heuristics for an object assignment problem with n objects, then the pseudo-dimension of A is O(log(Kβn) ).

In particular, all of our running examples are classes of heuristics with pseudo-dimension O(log n).

PROOF. Recall from the definitions (Section 3.2) that we need to upper bound the size of every set that is shatterable using the greedy heuristics in A. For us, a set is a fixed set of s inputs (each with n objects) S = {x₁, …, x_s}. For a potential witness r₁, …, r_s ∈ R, every algorithm A ∈ A induces a binary labeling of each sample x_i, according to whether COST(A,x_i) is strictly more than or at most r_i. We proceed to bound from above the number of distinct binary labelings of S induced by the algorithms of A, for any potential witness.

Consider ranging over algorithms A ∈ A—equivalently, over parameter values ρ ∈ I. The trajectory of a greedy heuristic A ∈ A is uniquely determined by the outcome of the comparisons between the current scores of the unassigned objects in each iteration of the algorithm. Because the family uses a K-crossing scoring rule, for every pair i,j of distinct objects and possible attributes ξ_i, ξ_j with ξ_i ≠ ξ_j, there are at most K values of ρ for which there is a tie between the score of i (with attributes ξ_i) and that of j (with attributes ξ_j). Because σ is continuous in ρ for every fixed ξ, the relative order of the score of i (with ξ_i) and j (with ξ_j) remains the same in the open interval between two successive values of ρ at which their scores are tied. The upshot is that we can partition I into at most K + 1 intervals, such that the outcome of the comparison between i (with attributes ξ_i) and j (with attributes ξ_j) is constant on each interval. In the case that ξ_i = ξ_j, because we break ties between equal scores lexicographically, the outcome of the comparison between σ(ρ, ξ_i) and σ(ρ, ξ_j) is the same for all ρ ∈ I.

Next, the s instances of S contain a total of sn objects. Each of these objects has some initial attributes. Because the assignment rule is β-bounded, there are at most snβ object-attribute pairs (i, ξ_i) that could possibly arise in the execution of any algorithm from A on any instance of S. This implies that, ranging across all algorithms of A on all inputs in S, comparisons are only ever made between at most (snβ)² pairs of object-attribute pairs (i.e., between an object i with current attributes ξ_i and an object j with current attributes ξ_j). We call these the relevant comparisons.

For each relevant comparison, we can partition I into at most K + 1 subintervals such that the comparison outcome is constant (in ρ) in each subinterval. Intersecting the partitions of all of the at most (snβ)² relevant comparisons splits I into at most (snβ)²K + 1 subintervals such that every relevant comparison is constant in each subinterval. That is, all of the algorithms of A that correspond to the parameter values ρ in such a subinterval execute identically on every input in S. The number of binary labelings of S induced by algorithms of A is trivially at most the number of such subintervals. Our upper bound (snβ)²K + 1 on the number of subintervals exceeds 2^s, the requisite number of labelings to shatter S, only if s = O(log(Kβn) ).□

Theorem 4.1 and Corollary 3.4 imply that, if K and β are bounded above by a polynomial in n, then an ERM algorithm (ε, δ)-learns the optimal algorithm in A from only samples,^d where H is the largest objective function value of a feasible solution output by an algorithm of A on an instance of Π.^e

Not all classes of algorithms have such a small pseudo-dimension. Theorem 4.1 thus gives one quantifiable sense in which natural greedy algorithms are “simple algorithms.”

REMARK 4.2 (EXTENSIONS) Theorem 4.1 is robust, and its proof is easily modified to accommodate various extensions. For example, consider families of greedy heuristics parameterized by d real-valued parameters ρ₁, …, ρ_d. Here, an analog of Theorem 4.1 holds with the crossing number K replaced by a more complicated parameter—essentially, the number of connected components of the co-zero set of the difference of two scoring functions (with ξ, ξ’ fixed and variables ρ₁, …, ρ_d). This number can often be bounded (by a function exponential in d) in natural cases, for example, using Bézout’s theorem.

REMARK 4.3 (NONLIPSCHITZNESS) We noted in Section 3.2 that the pseudo-dimension of a finite set A is always at most log₂|A|. This suggests a simple discretization approach to learning the best algorithm from A: take a finite “ε-net” of A and learn the best algorithm in the finite net. (Indeed, Section 6 uses precisely this approach.) The issue is that without some kind of Lipschitz condition—stating that “nearby” algorithms in A have approximately the same performance on all instances—there is no reason to believe that the best algorithm in the net is almost as good as the best algorithm from all of A. Two different greedy heuristics—say, two MWIS greedy algorithms with arbitrarily close ρ-values—can have completely different executions on an instance. This lack of a Lipschitz property explains why we take care in Theorem 4.1 to bound the pseudo-dimension of the full infinite set of greedy heuristics.

4.3. Computational considerations

The proof of Theorem 4.1 also demonstrates the presence of an efficient ERM algorithm: the O( (snβ)²) relevant comparisons are easy to identify, the corresponding subintervals induced by each are easy to compute (under mild assumptions on the scoring rule), and brute-force search can be used to pick the best of the resulting O( (snβ)²K) algorithms (with one arbitrary representative from each subinterval). This algorithm runs in polynomial time as long as β and K are polynomial in n, and every algorithm of A runs in polynomial time.

For example, for the family of Knapsack scoring rules described above, implementing this ERM algorithm reduces to comparing the outputs of O(n²m²) different greedy heuristics (on each of the m sampled inputs), with m = O(log n). For the adaptive MWIS heuristics, where β = n, it is enough to compare the sample performance of O(n⁴m²) different greedy algorithms, with m = O(log n).

5. Feature-Based Algorithm Selection

The previous section studied the problem of choosing a single algorithm for use in an application domain—of using training data to make an informed commitment to a single algorithm from a class A, which is then used for all future instances. A more refined and ambitious approach is to select an algorithm based on both the previous experience and the current instance to be solved. This approach assumes, as in the scenario in Section 2.3, that it is feasible to quickly compute some features of an instance and then to select an algorithm as a function of these features.

Throughout this section, we augment the basic model of Section 3.1 with:

A set F of possible instance feature values and a map f:X→F that computes the features of a given instance.^f

For example, if X is the set of SAT instances, then f(x) might encode the clause/variable ratio of the instance x, Knuth’s estimate of the search tree size,¹⁷ etc.

We focus on the case where A is small enough that it is feasible to learn a separate performance prediction model for each algorithm A ∈ A. (The full version also discusses the case of large A.¹²) This is exactly the approach taken in the motivating example of empirical performance models (EPMs) for SAT described in Section 2.3. We then augment the basic model to include a family of performance predictors.

A set P of performance predictors, with each p ∈ P a function from F to R.

The goal is to learn, for each algorithm A ∈ A, among all permitted predictors p ∈ P, the one that minimizes some loss function. Like the performance measure COST, we take this loss function as given. For example, for the squared error loss function, for each A ∈ A, we aim to compute the function that minimizes

over p_A ∈ P. For a fixed algorithm A, this is a standard regression problem, with domain F, real-valued labels, and a distribution on F × R induced by D via x ↦ (f(x), COST(A, x) ). Bounding the sample complexity of this learning problem reduces to bounding the pseudo-dimension of P. For standard choices of P, such bounds are well known. For example, suppose the set P is the class of linear predictors, with each p ∈ P having the form p(f(x) ) = a^Tf(x) for some coefficient vector a ∈ R^d. (The EPMs used by Xu et al.²⁴ are linear predictors.) The pseudo-dimension of P is well known to be d. If all functions in P map all possible inputs to [0, H], then Corollary 3.4 implies a sample complexity bound of for (ε, δ) learning the predictor with minimum expected square error.

6. Choosing the Step Size in Gradient Descent

For our last example, we give sample complexity results for the problem of choosing the best step size in gradient descent. When gradient descent is used in practice, the step size is generally taken much larger than the upper limits suggested by theoretical guarantees and often converges in many fewer iterations than with the step size suggested by theory. This motivates the problem of learning the best step size from examples. We view this as a first step toward reasoning more generally about the problem of learning good hyperparameters for machine learning algorithms.

6.1. Gradient descent preliminaries

Recall the basic gradient descent algorithm for minimizing a function f given an initial point z₀ over Rⁿ:

Initialize z := z₀.
While ∥∇f(z)∥₂ > v:
1. z := z − ρ · ∇f(z).

We take the error tolerance v as given and focus on the more interesting parameter, the step size ρ. Bigger values of ρ have the potential to make more progress in each step but run the risk of overshooting a minimum of f.

We instantiate the basic model (Section 3.1) to study the problem of learning the best step size. There is an unknown distribution D over instances, where an instance x ∈ Π consists of a function f and an initial point z₀. Each algorithm A_ρ of A is the basic gradient descent algorithm above, with some choice ρ of a step size drawn from some fixed interval [ρ_l, ρ_u] ⊂ (0, ∞). The performance measure COST(A, x) is the number of iterations (i.e., steps) taken by the algorithm for the instance x. Because gradient descent is translation invariant, we can assume for the sake of analysis that 0 is a minimum of f, with f(0) = 0. (We consider only instances that have a global minimum.)

To obtain positive results, we impose three further assumptions on problem instances, analogous to those used in the classical convergence analysis of gradient descent for smooth and strongly convex functions.

A1. Every function f is L-smooth for a known L, meaning that f is everywhere differentiable with ∥∇f(z₁) − ∇f(z₂)∥ ≤ L ∥z₁ − z₂∥ for all z₁ and z₂. (Every norm in this section is the l₂ norm.)

A2. Every function f is m-strongly convex for a known m ≤ L, meaning that it is continuously differentiable with for all z₁ and z₂.

A3. The magnitudes of the initial points are bounded, with ∥z₀∥ ≤ Z for some known constant Z > v.

These assumptions imply a “guaranteed progress toward 0” property: There is a known constant γ ∈ (0, 1) such that ∥z − ρ∇f(z)∥ ≤ (1 − γ)∥z∥ for all ρ ∈ [ρ_l, ρ_u]. (The standard analysis of gradient descent implies that γ ≥ ρm for every ρ ≤ 2/(m + L) over this class of functions.) Our assumptions imply that gradient descent halts within iterations.

6.2. Technical lemma

The heart of the analysis is the following technical lemma, which bounds how far two gradient descent paths with different step sizes can diverge when initialized at the same point. Define g(z, ρ) :=z − ρ∇f(z) as the result of taking a single gradient descent step from the point z with the step size ρ, and let g^j(z, ρ) denotes the result of taking j gradient descent steps.

LEMMA 6.1 With assumptions (A1)−(A3), for every starting point z, iteration number j, and step sizes ρ ≤ η,

where c_ρ = max{1, Lρ − 1} ≥ 1 is a constant that is a function only of L and ρ.

The most important point is that the right-hand side of the inequality in Lemma 6.1 goes to 0 with the step size difference η − ρ. The proof of Lemma 6.1 can be found in the full version.¹²

Lemma 6.1 easily implies a Lipschitz-type condition on COST(A_ρ, x) as a function of ρ.

LEMMA 6.2 For every input x satisfying (A1)−(A3) and step sizes ρ ≤ η with

PROOF. Suppose COST(A_η, x) ≤ COST(A_ρ, x); the argument in the other case is similar. Let j = COST(A_η, x), and recall that j ≤ H. By the stopping condition of gradient descent, ∥∇g^j(z₀, η)∥ ≤ v. By the m-strong convexity of f, . By Lemma 6.1 and the assumed gap between ρ and η,

Set . By the guaranteed progress condition,

By L-smoothness, ∥∇g^j+τ(z₀, ρ)∥ ≤ v. This implies that gradient descent with step size ρ requires at most τ more iterations to converge than with step size η. □

6.3. Learning the best step size

We can now apply the discretization approach suggested by Remark 4.3. Let . Let N denotes the set of all integer multiples of K that lie in the interval [ρ_l, ρ_u] of possible step sizes. Note that |N| ≤ ρ_u/K + 1. We can now state the main result of this section:

THEOREM 6.3 (LEARNABILITY OF STEP SIZE) There is a learning algorithm that learns the optimal algorithm in A using m = Õ(H³/ε²) samples from D.^g

PROOF. Because A_N = {A_ρ : ρ ∈ N} is a finite set, its pseudo-dimension is at most log₂|N|. Let L_N denotes the ERM algorithm for this set. Corollary 3.4 implies that L_N (ε, δ)-learns the optimal algorithm in A_N using m = Õ(H² log | N | /ε²) samples. Because log₂|N|=Õ(H), m is Õ(H³/ε²).

Now, Lemma 6.2 implies that for every ρ, there is an η ∈ N such that, for every input distribution D, the difference in expected costs of A_η and A_ρ is at most . Thus, -learns the optimal algorithm in A using Õ(H³/ε²) samples. □

7. Conclusion

Empirical work on data-driven algorithm design has far outpaced theoretical analysis of the problem, and this paper takes an initial step toward redressing this imbalance. We formulated the problem as one of learning a best-in-class algorithm with respect to an unknown input distribution. Many state-of-the-art empirical approaches to the problem map naturally to our learning framework. This paper demonstrates that many well-studied classes of algorithms have small pseudodimension, and thus, it is possible to learn a near-optimal algorithm from a relatively modest number of representative benchmark instances.

Our work suggests numerous wide-open research directions worthy of further study. For example:

Which other classes of algorithms have small pseudo-dimension? There has been recent progress on this question for local search algorithms,¹² clustering and partitioning algorithms,⁴ auctions and mechanisms,^{5, 6, 21, 22} and mathematical programs.³
In the online version of data-driven algorithm design, instances to a problem arrive one by one. The goal is to choose an algorithm to use at each time step, before the instance at that time step arrives, so that the average performance of the chosen algorithms is close to that of the best fixed algorithm in hindsight. When does such an online learning algorithm exist? The full version of this paper¹² gives positive results for classes of greedy algorithms; these are generalized in Cohen-Addad and Kanade¹⁰ and further in Balcan et al.³
When is it possible to learn a near-optimal algorithm using only a polynomial amount of computation, ideally with a learning algorithm that is better than brute-force search? Alternatively, are there (conditional) lower bounds stating that brute-force search is necessary for learning? See Roughgardern and Wang²³ for progress on these questions for learning revenue-maximizing auctions.
Are there any nontrivial relationships between statistical learning measures of the complexity of an algorithm class and more traditional computational complexity measures?

Acknowledgments

This research was supported in part by NSF awards CCF-1215965 and CCF-1524062.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Data-Driven Algorithm Design

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3394625

June 2020 Issue

Published: June 1, 2020

Vol. 63 No. 6

Pages: 87-94

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

1. Introduction

2. Motivating Scenarios

3. A PAC Learning Model

4. Learning Greedy Heuristics

5. Feature-Based Algorithm Selection

6. Choosing the Step Size in Gradient Descent

7. Conclusion

Acknowledgments

Data-Driven Algorithm Design

DOI

June 2020 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.