Systems and Networking Research highlights

Resolution of the Burrows-Wheeler Transform Conjecture

Posted Jun 1 2022

Abstract
1. Introduction
2. Preliminaries
3. Combinatorial Bounds
4. From LZ77 to Run-Length BWT
Acknowledgments
References
Authors
Footnotes

Abstract

The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as the r-index⁷), central in modern bioinformatics. The compressibility of BWT is quantified by the number r of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on r is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log n factor (where n is the length of the text) from z, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an n^ε factor for ε > 0).

In this paper, we show that r = (z log² n) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in (z polylog n) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog n factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse.

In addition, we provide an (z polylog n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets¹¹) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings.

1. Introduction

Lossless data compression aims to exploit redundancy in the input data to represent it in a small space. Despite the abundance of compression methods, nearly every existing tool falls into one of the few general frameworks, among which the three most popular are the following: Lempel–Ziv compression (where the nominal and most commonly used is the LZ77 variant²⁵), statistical compression (this includes, e.g., context mixing, prediction by partial matching (PPM), and dynamic Markov coding), and Burrows-Wheeler transform (BWT).⁴

One of the features that best differentiates these algorithms is whether they better remove the redundancy caused by skewed symbol frequencies or by repeated fragments. The idea in LZ77 (which underlies, e.g., 7-zip and gzip compressors) is to partition the input text into long substrings, each having an earlier occurrence in the text. Every substring is then encoded as a pointer to the previous occurrence using a pair of integers. This method natively handles long repeated substrings and can achieve an exponential compression ratio given sufficiently repetitive input. Statistical compressors, on the other hand, are based on representing (predicting) symbols in the input based on their frequencies. This is formally captured by the notion of the kth order empirical entropy H_k(T). For any sufficiently long text T, symbol frequencies (taking length-k contexts into account) in any power of T (the concatenation of several copies of T) do not change significantly (see Kreft and Navarro,¹⁴ Lemma 2.6). Therefore, |T^t| H_k(T^t) ≈ t · |T| H_k(T) holds for any t ≥ 1, which means that entropy is not sensitive to long repetitions, and hence it is not able to capture the same type of redundancy as the LZ77 compression.^7,12,14,24

The above analysis raises the question about the nature of compressibility of the Burrows-Wheeler transform. The compression of BWT-based compressors, such as bzip2, is quantified by the number r of equal-letter runs in the BWT. The clear picture described above no longer applies to the measure r. On the one hand, Manzini¹⁶ proved that r can be upper-bounded in terms of the kth order empirical entropy of the input string. On the other hand, already in 2008, Sirén et al.²⁴ observed that BWT achieves excellent compression (superior to statistical methods) on highly repetitive collections and provided probabilistic analysis exhibiting cases when r is small. Yet, after more than a decade, no upper bound on r in terms of z (the size of the LZ77 parsing) was discovered.

This lack of understanding is particularly frustrating due to numerous applications of BWT in the field of bioinformatics and compressed computation. One of the most successful applications of BWT is in compressed indexing, which aims to store a compressed string simultaneously supporting various queries (such as random access, pattern matching, or even suffix array queries) concerning the uncompressed version. Although classical (uncompressed) indexes, such as suffix trees and suffix arrays, have been successful in many applications, they are not suitable for storing and searching big highly repetitive collections, which are virtually impossible to search without preprocessing. A recent survey¹⁷ provides several real-life examples of such datasets. In particular, Github stores more than 20 terabytes of data, with an average of over 20 versions per project, whereas the 100000 Human Genome Project produced over 70 terabytes of DNA, which is highly compressible due to 99.9% similarity between individual human genomes. Motivated by such applications, compressed indexing witnessed a remarkable surge of interest in recent years. BWT-based indexes, such as the r-index,⁷ are among the most powerful, and their space usage is up to (polylog n) factors away from the value r. For a comprehensive overview of this field, we refer the reader to a survey by Navarro.¹⁸

In addition to text indexing, the Burrows-Wheeler transform has many applications in computational biology. In particular, BWT is the main component of the popular genome read aligners such as Bowtie, BWA, and Soap2, which paved its way to general bioinformatics textbooks.²¹ Specialized literature on algorithmic analysis of biological sequences^15,19 devotes dozens of pages to BWT applications.

Given the importance and practical significance of BWT, one of the biggest open problems that emerged in the field of lossless data compression and compressed computation asks:

What is the upper bound on the output size of the Burrows-Wheeler transform?

Except for BWT, essentially every other known compression method has been proven to produce output whose size is always within an (polylog n) factor from z, the output size of the LZ77 algorithm (e.g., grammar compression, collage systems, and macro schemes)^a or larger by a polynomial factor (n^ε for some ε > 0) in the worst case (e.g., LZ78, compressed word graphs (CDAWGs)). We refer the reader to Navarro¹⁷ for a survey of repetitiveness measures. Notably, BWT is known to never compress much better than LZ77, that is, z = (r log n),⁶ and the opposite relation r = (z poly-log n) was often conjectured to be false.⁵

1.1. Our contribution

We prove that r = (z log² n) holds for all strings, resolving the BWT conjecture in the more surprising way and answering an open question of Gagie et al.^5,6 This result alone has multiple implications for indexing and compression:

It is possible to support suffix array and suffix tree functionality in (z polylog n) space.⁷
It was shown by Kempa¹⁰ that many string processing tasks (such as BWT and LZ77 construction) can be solved in (n/log_σ n+r polylog n) time (where σ is the alphabet size). Thus, if the text is sufficiently compressible by BWT (formally, n/r = Ω (polylog n)), these tasks can be solved in optimal time (which is unlikely to be possible for general texts¹¹). Our result loosens this assumption to n/z = Ω (polylog n).
Until now, methods based on the Burrows-Wheeler transform were thought to be neither statistical nor dictionary (LZ-like) compression algorithms.^5,24 Our result challenges the notion that the BWT forms a separate compression type: Because of our bound, BWT is much closer to LZ compressors than anticipated.
Our slightly stronger bound r = (δ log² n), where δ ≤ z is a symmetric (insensitive to string reversal) repetitiveness measure recently studied by Kociumaka et al.,¹³ further shows that:
The number of BWT runs in the reverse of the text satisfies , which is the first nontrivial bound in terms of r. This result is of practical importance due to many algorithms whose efficiency depends on .^2,20,22,23

After proving r = (z log² n), we refine our approach to obtain and subsequently . We then show that the latter bound, combined with the trivial one r ≤ n, is asymptotically tight for the full spectrum of values of n and δ. As a side result, we obtain a tight upper bound (n log δ) on the sum of irreducible LCP values, improving upon the previously known bound (n log r).⁹

Next, we develop an (z log⁸ n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT (the polylog n factor has not been optimized). This offers up to exponential speedup over the previously fastest space-efficient algorithms, which need Ω(n log z) time.^20,22 To achieve this, we develop new data structures and techniques of independent interest. In particular, we introduce a notion of compressed string synchronizing sets, generalizing the technique by Kempa and Kociumaka.¹¹ We then describe a new variant of wavelet trees,⁸ designed to work for sequences of long strings. In the full version of this paper, we also propose new indexes that can be built directly from the LZ77 parsing and support fast pattern matching queries on text substrings.

2. Preliminaries

A string is a finite sequence of characters from a given alphabet. The length of a string S is denoted by |S| and, for i ∈ [1 . . |S|],^b the ith character of S is denoted by S[i]. A string U is a substring of S if U = S[i] S[i + 1] … S[j – 1] for some 1 ≤ i ≤ j ≤ |S| + 1. We then say that U occurs in S at position i. The occurrence of U at position i in S is denoted by S[i . . j) or S[i . . j – 1]. Such an occurrence is a fragment of S and can be represented by (a pointer to) S and the two positions i, j. Two fragments (perhaps of different strings) match if they are occurrences of the same substring. A fragment S[i . . j] is a prefix if i = 1 and a suffix if j = |S|.

We use to denote the reverse of S, that is, S[|S|] … S[2] S[1]. We denote the concatenation of two strings U and V, that is, U[1]U[2] … U[|U|]V[1]V[2] … V[|V|], by UV or U · V. Furthermore, denotes the concatenation of k ≥ 0 copies of S; note that S⁰ = ε is the empty string.

An integer p ∈ [1 . . |S|] is a period of a string S if S[i] = S[i + p] holds for every i ∈ [1 . . |S| – p]. The shortest period of S is denoted as per(S). A string S is called periodic if . The following fact is a consequence of the classic periodicity lemma (which we do not use directly).

FACT 2.1 (see Amir et al.,¹ FACT 1). Any two distinct periodic strings of the same length differ on at least two positions.

Throughout the paper, we consider a string (called the text) T of length n ≥ 1 over an ordered alphabet Σ of size σ. We assume that T[n] = $, where $ is the smallest symbol in Σ, and that $ does not occur anywhere else in T. We use ⪯ to denote the order on Σ, extended to the lexicographic order on Σ^* (the set of strings over Σ) so that U, V ∈ Σ^* satisfy U ⪯ V if and only if either U is a prefix of V, or U[1 . . i) = V[1 . . i) and U[i] ≺ V [i] holds for some i ∈ [1 . . min(|U|, |V|)].

The suffix array SA[1 . . n] of T is a permutation of [1 . . n] such that T[SA[1] . . n] ≺ T[SA[2] . . n] ≺ … ≺ T[SA[n] . . n]. The closely related Burrows-Wheeler transform BWT[1 . . n] of T is defined by BWT[i] = T[SA[i] – 1] if SA[i] > 1 and BWT[i] = T[n] otherwise. In the context of BWT, it is convenient to consider an infinite string T^∞ defined so that T^∞[i] = T[1 + (i – 1) mod n] for i ∈ ℤ in particular, T^∞ [1 . . n] = T [1 . . n]. We then have BWT[i] = T^∞ [SA[i] – 1] for i ∈ [1 . . n]. Moreover, the assumption T[n] = $ implies T^∞ [SA[1] . .] ≺ … ≺ T^∞[SA[n] . .].

For any string , where c_i ∈ Σ and ℓ_i ∈ ℤ_>0 for i ∈ [1 . . h], and c_i ≠ c_i+1 for i ∈ [1 . . h), we define the run-length encoding of S as a sequence RL(S) = ((c₁, λ₁), …, (c_h, λ_h)), where λ_i = ℓ₁ + … + ℓ_i for i ∈ [1 . . h]. Throughout, we let r = |RL(BWT)| denote the number of runs in the BWT of T. For example, for the text T = bbabaababababaababa$ in Table 1, we have BWT = a¹b⁶a¹b²a⁶b¹a²$¹, and hence r = 8.

Table 1. The lexicographically sorted suffixes of a string T = bbabaa-babababaababa$ along with the BWT, SA, and LCP tables.

By lcp(U, V) we denote the length of the longest common prefix of strings U and V. For j₁, j₂ ∈ [1 . . n], we let LCE(j₁, j₂) = lcp(T[j₁ . .], T[j₂ . .]). The LCP array LCP[1 . . n] is defined so that LCP[i] = LCE(SA[i], SA[i – 1]) for i ∈ [2 . . n] and LCP[1] = 0. We say that the value LCP[i] is reducible if BWT[i] = BWT[i – 1] and irreducible otherwise (including when i = 1). There are exactly r irreducible LCP values.

We say that a nonempty fragment T[i . . i + ℓ) is a previous factor if it has an earlier occurrence in T, that is, LCE (i, i’) ≥ ℓ holds for some i’ ∈ [1 . . i). The LZ77 parsing of T is a factorization T = F₁ ··· F_f into nonempty phrases such that the jth phrase F_j is the longest previous factor starting at position 1 + |F₁ ··· F_j-1|; if no previous factor starts there, then F_j consists of a single character. In the underlying LZ77 representation, every phrase F_j = T [i . . i+ℓ) that is a previous factor is encoded as (i’, ℓ), where i’ ∈ [1 . . i) satisfies LCE(i, i’) ≥ ℓ (and is chosen arbitrarily in case of multiple possibilities); if F_j = T[i] is not a previous factor, then it is encoded as (T[i], 0). We denote the number of phrases in the LZ77 parsing by z. For example, the text bbabaababababaababa$ of Table 1 has LZ77 factorization b · b · a · ba · aba · bababa · ababa · $ with z = 8 phrases, and its LZ77 representation is (b, 0), (1, 1), (a, 0), (2, 2), (3, 3), (7, 6), (10, 5), ($, 0).

The trie of a finite set S ⊆ Σ^* is an edge-labeled rooted tree with a node v_X for every string X that is a prefix of a string S ∈ S. The trie is rooted at v_ε and the parent of each node v_X with X ≠ ε is v_X[1..|X|). In this case, the edge from v_X[1..|X|) to v_X is labeled with X[|X|]. See Figure 1 for the trie of {$aba, aaba, abaa, abab, abb$, b$ab, baab, baba, babb, bb$a}.

Figure 1. The trie of the reversed length-4 substrings of T^∞ for T = bbabaababababaababa$ of Table 1. Light edges are thin and dotted.

3. Combinatorial Bounds

3.1. Basic upper bound

To illustrate the main idea of our proof technique, we first prove the upper bound in its simplest form r = (z log² n). The following lemma stands at the heart of our proof.

LEMMA 3.1. For every ℓ ∈ [1 . . n], the number of irreducible LCP values in [ℓ . . 2ℓ) is (z log n).

PROOF. Denote S_m = {S ∈ Σ^m: S is a substring of T^∞} for m ≥ 1. Observe that |S_m| ≤ mz because every length-m substring of T^∞ has an occurrence crossing or beginning at a phrase boundary of the LZ77 parsing of T. This includes substrings overlapping two copies of T, which cross the boundary between the last and the first phrase.

The idea of the proof is as follows: With each irreducible value LCP[i] ∈ [ℓ . . 2ℓ), we associate a cost of ℓ units, which are charged to individual characters of strings in S_3ℓ. We then show that each of the strings in S_3ℓ is charged at most 2 log n times. The number of irreducible LCP values in [ℓ . . 2ℓ) equals times the total cost, which is at most

To devise the announced assignment of cost to characters of strings in S_3ℓ, consider the trie of all reversed strings in S_ℓ (see Figure 1 for example).

Let LCP[i] ∈ [ℓ . . 2ℓ) be an irreducible LCP value; note that i > 1 due to LCP[i] ≥ ℓ > 0. Let j₀ = SA[i – 1] and j₁ = SA[i] so that LCP[i] = LCE(j₀, j₁). Because LCP[i] is irreducible, we have T^∞ [j₀ – 1] = BWT[i – 1] ≠ BWT[i] = T^∞ [j₁ – 1]. For k ∈[1 .. ℓ], the kth unit of the cost associated with LCP[i] is charged to the kth character (T^∞[j_t – 1]) of the string (T^∞[j_t – k .. j_t – k + 3ℓ)∈S_3ℓ, where t ∈ {0, 1} is such that the subtree of rooted at contains fewer leaves than the subtree rooted at (we choose t = 0 in case of ties); see Figure 2 for an illustration.

Figure 2. Proof of Lemma 3.1 on T from Table 1 for ℓ = 4, i = 11, and k = 2. Strings T^∞ [j_t – k . . j_t – k + 3ℓ) are highlighted. The subtree of rooted in v_baa is smaller than in v_bab (see Figure 1), and hence we charge the second symbol of T^∞ [j₁ – k . . j₁ – k + 3ℓ), that is, t = 1.

Note that at most log n characters of each S ∈ S_3ℓ can be charged during the above procedure: Whenever S[k], with k ∈ [1 .. ℓ], is charged, the subtree of rooted at has at least twice as many leaves as the subtree rooted at , and this can happen for at most log|S_ℓ| ≤ log n nodes on the path from to the root of .

It remains to show that, for every S ∈ S_3ℓ, a single position S[k], with k ∈[1 .. ℓ], can be charged at most twice. For this, observe that the characters charged for a single irreducible value LCP[i] are at different positions (of strings in S_3ℓ). Hence, to analyze the total charge assigned to S[k], we only need to bound the number of possible candidate positions i. Let [b . . e] be the set of indices i’ such that T^∞ [SA[i’]..] starts with S[k + 1 .. 3ℓ]. In the above procedure, if a character S[k] is charged a unit of cost corresponding to LCP[i], then S[k + 1 .. 3ℓ] is a prefix of either T^∞[SA[i−1]..]=T^∞[j₀ ..] or T^∞[SA[i]..]=T^∞[j₁ ..]. Hence, {i−1,i}∩[b..e]≠∅. At the same time, LCE(SA[i –1], SA[i]) < 2ℓ, and all strings T^∞[SA[i’]..] with i’ ∈[b .. e] share a common prefix S[k + 1 .. 3ℓ] of length 3ℓ – k ≥ 2ℓ. Thus, i = b or i = e + 1. ⅈ

THEOREM 3.2. All length-n strings satisfy r = (z log² n).

PROOF. Recall that r is the total number of irreducible LCP values. Thus, the claim follows by applying Lemma 3.1 for ℓ_i = 2ⁱ, with i ∈[0 .. ⌊log n⌋] , and observing that the number of LCP values 0 is exactly σ≤z.

3.2. Tighter upper bound

To obtain a tighter bound, we refine the ideas from Section 3.1, starting with a counterpart of Lemma 3.1.

LEMMA 3.3. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ℓ .. 2ℓ) is (z log z).

PROOF. The proof follows closely that of Lemma 3.1. However, with each irreducible value LCP[i] ∈ [ℓ .. 2ℓ), we associate cost instead of ℓ. We then show that each of the strings in S_3ℓ is charged at most 2 · (3 + log z) times (rather than 2 log n times). Then, the number of irreducible LCP values in the range [ℓ .. 2ℓ) does not exceed times the total cost, which is bounded by

Recall the trie of all reversed strings in S_ℓ. For a node v of , by size(v), we denote the number of leaves in the subtree of rooted in v. An edge connecting v ≠root( ) to its parent in is called light if v has a sibling v’ satisfying size(v’) ≥ size(v) (see Figure 1). In the proof of Lemma 3.1, we observed that the characters S[k] of S ∈ S_3ℓ that can be charged correspond to light edges on the path from the root of to the leaf : Whenever S[k], with k ∈[1 .. ℓ], is charged, the edge connecting to its parent is light. We then noted that there are at most log | S_ℓ | ≤ log n light edges on each root-to-leaf path in . Here, we charge the characters of strings in S_3ℓ in the same way as in Lemma 3.1, but only for units . This implies that only characters S[k] of S ∈ S_3ℓ with are charged. It remains to show that any root-to-leaf path in contains at most 3 + log z light edges between a node at depth at least and its child.

Consider a light edge from a node v to its parent u at depth at least . Let v’ be a sibling of v satisfying size(v’) ≥ size(v), and let S_v, S_v’ be the labels of the paths from the root to v and v’, respectively. These labels differ on the last position only so, by Fact 2.1, they cannot be both periodic. Let ṽ ∈ {v, v’} be such that S_ṽ is not periodic, and let .

Consider the set S of length-ℓ strings corresponding to the leaves in the subtree of rooted at ṽ (i.e., the labels of the root-to-leaf paths passing through ṽ). Define and note that , because is the trie of reversed strings from S_ℓ. Let denote the ending positions of the leftmost occurrences in T^∞[1 . .) of strings in . By definition, we have an occurrence of ending in T^∞ at every position e_i with . Now, implies that for every (otherwise, the two close occurrences of would yield . Consequently, at least length-ℓ substrings of T^∞[1 . .) have disjoint leftmost occurrences. Because each leftmost occurrence crosses or begins at a phrase boundary of the LZ77 parsing of T, we conclude that , and therefore .

The reasoning above shows that once a root-to-leaf path encounters a light edge connecting a node u at depth at least to its child v, we have size(v) ≤ 4z. The number of the remaining light edges on the path is at most log(size(v)) ≤ 2 + log z by the standard bound applied to the subtree of rooted at v.

THEOREM 3.4. Every string T of length n satisfies .

PROOF. To obtain tighter bounds on the number of irreducible LCP values in [ℓ . . 2ℓ), we consider three cases:

ℓ ≤ log z. We repeat the proof of Lemma 3.1, except that we observe that the number of light edges on each root-to-leaf path in is bounded by ℓ. Thus, the number of irreducible LCP values in [ℓ . . 2ℓ) is (z ℓ).

. We use the bound (z log z) of Lemma 3.3. . We repeat the proof of Lemma 3.3, except that we observe that |S_3ℓ| ≤ n. Thus, the number of irreducible LCP values in .

The above upper bounds, applied for every ℓ = 2ⁱ with i ∈ [0 . . ⌊log n⌋], yield

3.3. Tight bounds in terms of δ

Let denote the (cyclic) substring complexity of T.¹³ Note that letting is equivalent because |S_m| ≤ n holds for m ≥ 1, which implies for m ≥ n. We start by noting that δ ≤ z because |S_m| ≤ mz holds for every m ≥ 1, as observed in the proof of Lemma 3.1. Furthermore, |S_m| ≤ mδ holds by definition of δ, so δ can replace z in the proof of Lemma 3.1.

To adapt the proof Lemma 3.3, we need to generalize the observation that at most z substrings from S_ℓ may have disjoint leftmost occurrences in T^∞[1 . .). This observation is easy because the LZ77 parsing naturally yields a set of z positions (phrase boundaries) in T. The substring complexity δ does not provide such structure, but as the lemma here implies, we can replace z by 3δ in the aforementioned observation. The proof of Lemma 3.5 is a straightforward modification of the argument used in Kociumaka et al.¹³ (Lemma 6). For completeness, we write down the full reasoning, with technical details tailored to our notation (e.g., S_ℓ defined in terms of T^∞ rather than T).

LEMMA 3.5 (based on Kociumaka et al.,¹³ LEMMA 6). For any integer ℓ ≥ 1, the total number of positions in T^∞[1 . .) covered by the leftmost occurrences of strings from S_ℓ is at most 3δℓ.

PROOF. Let C denote the set of positions in T^∞[1 . .) covered by the leftmost occurrences of strings from S_ℓ, and let C’ = C\[1 .. ℓ). For any i ∈ C’ denote S =T^∞[i−ℓ+1..i+ℓ], and let S ={S_i:i∈C‘}⊆S_2ℓ. We will show that |S| = |C’|. Let i ∈ C’. First, observe that, due to i ≥ ℓ, the fragment S_i is entirely contained in T^∞[1 . .). Furthermore, by definition, S_i. contains the leftmost occurrence of some S ∈ S_ℓ. Thus, this occurrence of S_i in T^∞[1 . .) must also be the leftmost one in T^∞[1 . .). Consequently, the substrings S_i for i ∈ C’ are distinct.

We have thus shown that |C’| = |S|≤|S_2ℓ. Because |S_2ℓ| ≤ 2δℓ holds by definition of δ, we obtain |C|<|C‘|+ ℓ ≤ |S_2ℓ|+ℓ≤(2δ+1)ℓ≤3δℓ. ⅈ

LEMMA 3.6. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ℓ .. 2ℓ) is (δ log δ).

PROOF. Compared to the proof of Lemma 3.3, we use the bound |S_3ℓ| ≤ 3ℓδ instead of |S_3ℓ| ≤ 3ℓz. The only other modification required is that, for every light edge connecting a node u of at depth at least to its child v, we need to prove size(v) = (δ).

Let be defined as in the proof of Lemma 3.3. Recall that there are at least strings in S_ℓ with disjoint leftmost occurrences in T^∞[1 . .). By Lemma 3.5, there are at most 3δ such substrings. Thus, . ⅈ

By replacing the thresholds log z and with log δ and , respectively, in the proof of Theorem 3.4, we immediately obtain a bound in terms of δ.

THEOREM 3.7. Every string T of length n satisfies .

The following construction, analyzed in the full version of this paper, provides tight examples with substring complexity δ covering the whole spectrum between (1) and . For ℓ ≥ 1 and x ∈[0 .. 2^ℓ), we set bin_ℓ (x) ∈ {0, 1}^ℓ to be the binary representation of x.

LEMMA 3.8. For all integers ℓ ≥ 2 and K ≥ 1, the length n, the substring complexity δ, and the number of runs r in the BWT of a string T_l,K ∈ {$, 0, 1, 2}⁺, defined with

satisfy n = Θ(2^K+ℓℓ, δ=Θ(2^ℓ), and r = Ω(2^ℓ ℓK).

If , on the other hand, then a trivial upper bound r = (n) is tighter than that of Theorem 3.7. In this case, the following construction, also analyzed in the full version of this paper, provides tight examples.

LEMMA 3.9. For all integers ℓ ≥ 2 and Δ∈Ω(2^ℓ)∩ (2^ℓℓ), the length n, the substring complexity δ, and the number of runs r in the BWT of a string , defined with

satisfy n =Θ(2^ℓℓ), δ=Θ(Δ), and r = Ω(2^ℓℓ).

Overall, combining Lemmas 3.8 and 3.9, we obtain the following tight lower bound for δ ranging from (1) to Ω(n).

THEOREM 3.10. For every N ≥ 1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and the number of runs r in the BWT satisfy n=Θ(N), δ=Θ(Δ), and .

3.4. Further combinatorial bounds

By combining Theorem 3.7 with known properties of the substring complexity δ, we obtain the first bound relating the number of BWT runs in the string and its reverse. No such bounds (even polynomial in r log n) were known before.

COROLLARY 3.11. If r and denote the number of runs in the BWT of a length-n text and its reverse, respectively, then .

PROOF. Because the value of δ is the same for the text and its reverse, Theorem 3.7 yields . Combining the results of Kempa and Prezza¹² (Theorem 3.9) and Kociumaka et al.¹³ (Lemma 2) gives δ ≤ r. Consequently, we obtain .

In the full version of this paper, we also characterize the sum of all irreducible LCP values:

THEOREM 3.12. For every string of length n, the sum r_Σ of all irreducible LCP values satisfies r_Σ = (n log δ).

Due to δ ≤ r, the presented upper bound is always (asymptotically) at least as strong as the previous bound r_σ ≤ n log r by Kärkkäinen et al.⁹ Furthermore, it can be strictly stronger because log δ = o(log r) is possible when δ = log^o(1) n. In the full version of this paper, we show that the strings of Lemmas 3.8 and 3.9 yield a matching lower bound:

THEOREM 3.13. For every N ≥ 1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and sum r_Σ of irreducible LCP values satisfy n=Θ(N), δ=Θ(Δ), and r_Σ = Θ(n log δ).

4. From LZ77 to Run-Length BWT

In this section, we outline our algorithm that, given the LZ77 parsing of a text T ∈ Σⁿ, computes its run-length compressed BWT in (z polylog n) time. We start with an overview that explains the key concepts. Next, we present two new data structures utilized in our algorithm: the compressed string synchronizing set (Section 4.1) and the compressed wavelet tree (Section 4.2). The sketch of the final algorithm is then presented in Section 4.3.

For any substring Y of T^∞, we define

We say that a substring Y of T^∞ is left-maximal if there exist distinct symbols a, b ∈ Σ such that the strings aY and bY are also substrings of T^∞. The following definition, assuming Σ∩ℕ=∅, plays a key role in our construction.

DEFINITION 4.1 (BWT MODULO ℓ). Let T ∈ Σⁿ, ℓ ≥ 1 be an integer, and Y_i= T^∞[SA[i]..SA[i]+ℓ) for i ∈[1..n]. We define the string BWT_ℓ ∈(Σ∪ℕ)ⁿ, called the BWT modulo ℓ (of T), as follows. For i ∈ [1 .. n],

The algorithm runs in k = ⌈log n⌉ rounds. For q ∈ [0 .. k), the input to the qth round is RL(BWT_2^q) and the output is RL(BWT_2^q+1). At the end of the algorithm, we have RL(BWT_2^k) = RL(BWT), because X ∈ S_2^k is never left-maximal for 2^k ≥ n.

Informally, in round q, we are given a (run-length compressed) subsequence of BWT that can be determined based on sorting the suffixes only up to their prefixes of length 2^q. Notice that BWT_ℓ[b .. e] ∈ Σ^* implies BWT_ℓ+1 [b .. e] ∈ Σ^* (because a prefix of a left-maximal substring is left-maximal). Hence, these subsequences need not be modified until the end of the algorithm (except possibly merging their runs with adjacent runs). For the remaining positions, BWT_ℓ identifies the (leftmost occurrences of) substrings to be inspected in the qth round with the aim of replacing their corresponding runs in BWT_ℓ with previously unknown BWT symbols (as defined in BWT_2ℓ).

We call a block BWT[b . . e] uniform if all symbols in BWT[b . . e] are equal and nonuniform otherwise. The following lemma ensures the feasibility of the above construction.

LEMMA 4.2. For any ℓ ≥ 1, it holds |RL(BWT_ℓ)| < 2r.

PROOF. Denote RL(BWT_ℓ) = ((c₁, λ₁, …,(c_h,λ_h)), letting λ₀ = 0. By definition of BWT_ℓ, if c_i ∈ ℕ, then the block BWT(λ_i-1 .. λ_i] is nonuniform. Thus, there are at most r – 1 runs of symbols from ℕ in BWT_ℓ.

On the other hand, c_i ∈ Σ and c_j ∈ Σ, with i < j, cannot both belong to the same run in BWT. If this was true, then either c_i+1 ∈ Σ (which implies c_i+1 = c_i, contradicting the definition of RL(BWT_ℓ) ), or c_i+1 ∈ ℕ, which is impossible because BWT(λ_i .. λ_i+1] would then be nonuniform. Thus, there are at most r runs of symbols from Σ in BWT_ℓ. ⅈ

4.1. Compressed string synchronizing sets

Our algorithm builds on the notion of string synchronizing sets, recently introduced by Kempa and Kociumaka.¹¹ Synchronizing sets are one of the most powerful techniques for sampling suffixes. In the uncompressed setting, they are the key in obtaining time-optimal solutions to multiple problems, such as a state-of-the-art BWT construction algorithm for texts over small alphabets.¹¹ In this section, we introduce a notion of compressed string synchronizing sets. Our construction is the first implementation of synchronizing sets in the compressed setting and thus of independent interest.

We start with the definition of basic synchronizing sets.

DEFINITION 4.3 (τ-SYNCHRONIZING SET¹¹). Let T ∈ Σⁿ be a string and let be a parameter. A set S⊆[1..n−2τ+1] is called a τ-synchronizing set of T if it satisfies the following consistency and density conditions:

If T[i..i+2τ)=T[j..j+2τ), then i ∈ S holds if and only if j ∈ S (for i, j ∈[1 .. n – 2τ +1]),
S∩[i..i+τ)=∅ if and only if (for i ∈[1 .. n – 3τ + 2]).

In most applications, we want to minimize |S|. However, the density condition imposes a lower bound for strings of length n ≥ 3τ – 1 that do not contain substrings of length 3τ – 1 that are periodic with period at most . Therefore, we cannot hope to achieve an upper bound improving in the worst case upon the following one.

THEOREM 4.4 (Kempa and Kociumaka¹¹). For any string T of length n and parameter , there exists a τ-synchronizing set S of size . Moreover, such S can be (deterministically) constructed in (n) time.

Storing S for compressible strings presents the following challenge: Although is sometimes possible, it is not implied by z ≪ n. For example, as noted by Bille et al.,³ Thue–Morse strings satisfy z = (log n) yet they contain no periodic substring X with , and thus their synchronizing sets satisfy . This prevents us from keeping the plain representation of S when .

We therefore exploit a different property of compressible strings: That all their substrings Y satisfy lpos(Y) , where e_j is the last position of the jth phrase in the LZ77 parsing of T. By consistency of S, it suffices to store . To decide whether i ∈ S, we then locate i’ = lpos(T[i .. i + 2τ) and check if . This motivates the following (more general) definition.

DEFINITION 4.5 (COMPRESSED τ-SYNCHRONIZING SET). Let S be a τ-synchronizing set of string T[1 . . n] for some , and, for every j ∈[1 .. z], let e_j denote the last position of the jth phrase in the LZ77 parsing of T. For k ∈ ℤ_≥2, we define the compressed representation of S as

In the full version of this paper, we prove that every text T has a synchronizing set S with a small compressed representation, and we show how to efficiently compute such S from the LZ77 parsing of T.

THEOREM 4.6. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a string T ∈ Σⁿ, a parameter , and a constant k ∈ ℤ_≥2, constructs in (z log⁵ n) time a compressed representation comp_k(S) of a τ-synchronizing set S of T satisfying |comp_k(S)|≤ 72kz.

4.2. Compressed wavelet trees

Along with string synchronizing sets, wavelet trees,⁸ originally invented for text indexing, play a central role in our algorithm. Unlike virtually all prior applications of wavelet trees, ours uses a sequence of very long strings (up to Θ(n) symbols). This approach is feasible because all strings are substrings of the text, which is stored in the LZ77 representation. In this section, we describe this novel variant of wavelet trees, dubbed here compressed wavelet trees. In particular, we prove an upper bound on their size, describe an efficient construction from the LZ77-compressed text, and show how to augment them to support some fundamental queries.

Let Σ be an alphabet of size σ ≥ 1. Consider a string W[1 . . m] over the alphabet Σ^ℓ so that W is a sequence of m ≥ 0 strings of length ℓ ≥ 0 over the alphabet Σ. The wavelet tree of W is the trie of Σ with each node v_X associated to a string B_X ∈ Σ^* defined here based on W. We let denote the node-set of .

With each node v_X ∈ V( ), we associate an increasing sequence I_X [1 . . h] of primary indices such that

Based on I_X, we define B_X ∈ Σ^* so that, for i ∈ [1 . . h],

if |X| < ℓ, and B_X = ε if |X| = ℓ. In other words, B_X is a string containing the symbol at position |X| + 1 for each string of W that is prefixed by X. Importantly, the symbols in B_X occur in the same order as these strings occur in W.

As typically done in the applications of wavelet trees, we only explicitly store the strings B_X. The values of primary indices I_X are retrieved using additional data structures, based on the following observation.

LEMMA 4.7 (Grossi et al.⁸). Let X ∈ Σ^d, where d ∈ [0 . . ℓ). For every c ∈ Σ and j ∈ [1 . . |I_Xc|], we have I_Xc [j] = I_X [i], where B_X[i] is the jth occurrence of c in B_X.

We define the compressed wavelet tree _c of W as the wavelet tree of W in which all strings B_X have been run-length compressed and, except , all nodes v_X satisfying |RL(B_X)| ≤ 1 have been removed (the unary paths are collapsed into single edges with their labels concatenated). The shape and edge labels of the resulting tree are identical to the compacted trie of {W[1], …, W[m]}.

We store the edge labels of _c as pointers to substrings in W. We assume that each edge of _c and each element of RL(B_X) can be encoded in (1) space. Because |RL(B_Y)| ≥ 1 holds for every internal node v_Y ∈ V( _c) and, unless |V( _c)| = 1, each leaf v_Z in _c can be injectively mapped to an element of RL(B_Z’) for the parent v_Z’ of v_Z, the tree _c uses space.

THEOREM 4.8. If _c is the compressed wavelet tree of a sequence W of length-l strings, then .

PROOF. Let m = |W|, k = |RL(W)| ≤ m, and k’ = |{W[i] : i ∈ [1 . . m]}|. Due to |V(T_c)|≤ 1+2k‘= (1 + k), we can focus on nodes v_x∈V(T_c) such that |RL(B_X)| ≥ 2.

The proof resembles that of Lemma 3.1. With each X ∈ Σ^* such that |RL(B_X)| ≥ 2, we associate |RL(B_X)| – 1 units of cost and charge them to individual elements of W. We then show that each run in RL(W) is in total charged at most 2 log k’ units of cost. Consequently,

Consider X ∈ Σ^d with |RL(B_X)| ≥ 2; note that d < ℓ. Let RL(B_X) = ((c₁, …, (c_h, λ_h)). Observe that if we let p₀ = I_X[λ_i] and p₁ = I_X[λ_i + 1] for some i ∈ [1 . . h), then W[p₀][d + 1] = c_i ≠ c_i + ₁ = W[p₁][d + 1]. Moreover, B_X [λ_i] ≠ B_X [λ_i + 1] implies W[p₀ + 1] ≠ W[p₀] and W[p₁ – 1] ≠ W[p₁]. The ith unit of cost is charged to W[p_t], where t ∈ {0, 1} is chosen depending on the sizes of subtrees of _c rooted at the children of v_X, so that the subtree containing v_{w[p_t]} has at most as many leaves as the subtree containing v_{w[p_1−t]}.

Now, consider a run W[b . . b’] = Y^δ in RL(W). For a single depth d, the run could be charged at most twice, with at most one unit assigned to W[b] due to p₁ = b and at most one unit assigned to W[b’] due to p₀ = b’, both for X = Y[1 . . d]. Moreover, note that the subtree size on the path from v_Y to the root v_ε of _c doubles for every depth d for which the run was charged. Thus, the total charge of the run is at most 2 log k’ units. ⅈ

The key operation that we want to support on _c is to compute the value I_X [q] given q ∈ [1 . . |I_X|] and a pointer to v_x ∈ V(T_c). Let W[1 . . m] be a sequence of substrings of T^∞ of common length ℓ. Notice that if we have access to T, then the sequence W can be encoded in (1 + |RL(W)|) space. Namely, it suffices to store the length ℓ and the sequence RL((lpos(W[i])) _{i ∈ [1 .. m]}). In the full version of this paper, we show that, given such encoding of W and the LZ77 parsing of T, the compressed wavelet tree of W supporting fast primary index queries can be constructed efficiently.

THEOREM 4.9. Given the LZ77 representation of T[1 . . n] and a sequence W[1 .. m] of m ≤ n same-length substrings of T^∞, represented as RL((lpos(W[i]))_{i ∈[1 .. m]}), the compressed wavelet tree of W, supporting primary index queries in time (log⁴ n), can be constructed in time ((z+|RL(W)|) log² n).

4.3. The algorithm

We are now ready to sketch the procedure constructing the sequences RL(BWT_2^q), where q ∈ [0 . . ⌈log n⌉].

Let q ∈ [4 . . ⌈log n⌉) (values q ∈ [0 . . 3] are handled separately) and let ℓ = 2^q. We show how to compute RL(BWT_2ℓ) given RL(BWT_ℓ) and the LZ77 parsing of T. The main idea of the algorithm is as follows.

Let S be a τ-synchronizing set of T, where . As noted earlier, BWT_ℓ [j] ∈ Σ implies BWT_2ℓ [j] ∈ Σ. Let BWT_ℓ [y . . y’] ∈ ℕ⁺ be a run in BWT_ℓ. By definition of BWT_ℓ, the suffixes of T^∞ starting at positions i ∈ SA[y . . y’] share a common prefix of length ℓ ≥ 3τ. Thus, assuming that S∩[i..i+τ) = ⌽ holds for all i ∈ SA[y . . y’] (the periodic case is handled separately), by the consistency of S, all text positions i ∈ SA[y . . y’] share a common offset Δ with i + Δ = min (S∩[i..i+τ)). This lets us deduce the order of length-21 prefixes T[i . . i+2ℓ) based on the order of strings T[i + Δ . . i + 2ℓ) starting at synchronizing positions. For this, from the sorted list of fragments T[s . . s + 2ℓ) across s ∈ S, we extract, using a wavelet tree, those preceded by T[i . . i + Δ) (a prefix common to T^∞[i . .) for i ∈ SA[y . . y’]). Importantly, the synchronizing positions s sharing T[s – τ . . s + 2ℓ) can be processed together; hence, by Theorem 4.6, applied for k = 7 so that 2ℓ ≤ kτ, it suffices to use (z) distinct substrings.

In the full version of the paper, we provide the details of the above construction and obtain the following result.

THEOREM 4.10. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a text T of length n, computes its run-length compressed Burrows-Wheeler transform in time ((r + z) log⁶ n) = (z log⁸ n).

Acknowledgments

This work was performed when the first author was a post-doctoral scholar at the University of California, Berkeley (supported by NSF grants no. 1652303 and 1934846, and an Alfred P. Sloan Fellowship grant), whereas the second author was a postdoctoral scholar at Bar-Ilan University, Israel (supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme, agreement no. 683064).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Resolution of the Burrows-Wheeler Transform Conjecture

View in the ACM Digital Library

This work is licensed under a https://creativecommons.org/licenses/by/4.0/

DOI

10.1145/3531445

June 2022 Issue

Published: June 1, 2022

Vol. 65 No. 6

Pages: 91-98

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

BLOG@CACM Apr 16 2024

The Value of Data in Embodied Artificial Intelligence

Shaoshan Liu

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Abstract

1. Introduction

2. Preliminaries

3. Combinatorial Bounds

4. From LZ77 to Run-Length BWT

Acknowledgments

Resolution of the Burrows-Wheeler Transform Conjecture

DOI

June 2022 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.