Small space and streaming pattern matching with $k$ edits - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2021

Small space and streaming pattern matching with $k$ edits

Résumé

In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n\geq m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\overset{\sim}{O}(k^5)$ space and $\overset{\sim}{O}(k^8)$ amortised time per character of the text, providing answers correct with high probability. (Hereafter, $\overset{\sim}{O}$(•) hides a poly(log $n$ ) factor.) This answers a decade-old question: since the discovery of a poly($k$ log $n$)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly ($k$ log $n$)-space algorithm was known even in the simpler semi-streaming model, where $T$ comes as a stream but $P$ is available for readonly access. In this model, we give a deterministic algorithm that achieves slightly better complexity. Our central technical contribution is a new space-efficient deterministic encoding of two strings, called the greedy encoding, which encodes a set of all alignments of cost $\leq k$ with a certain property (we call such alignments greedy). On strings of length at most $n$, the encoding occupies $\overset{\sim}{O}(k^2)$ space. We use the encoding to compress substrings of the text that are close to the pattern. In order to do so, we compute the encoding for substrings of the text and of the pattern, which requires read-only access to the latter. In order to develop the fully streaming algorithm, we further introduce a new edit distance sketch parametrised by integers $n\leq k$. For any string of length at most $n$, the sketch is of size $\overset{\sim}{O}(k^2)$ and it can be computed with an $\overset{\sim}{O}(k^2)$-space streaming algorithm. Given the sketches of two strings, in $\overset{\sim}{O}(k^3)$ time we can compute their edit distance or certify that it is larger than $k$. This result improves upon $\overset{\sim}{O}(k^8)$-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent $\overset{\sim}{O}(k^3)$-size sketches of Jin, Nelson, and Wu [STACS 2021].
Fichier principal
Vignette du fichier
main.pdf (687.54 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03257386 , version 1 (10-06-2021)

Licence

Paternité

Identifiants

  • HAL Id : hal-03257386 , version 1

Citer

Tomasz Kociumaka, Ely Porat, Tatiana Starikovskaya. Small space and streaming pattern matching with $k$ edits. 2021. ⟨hal-03257386⟩
20 Consultations
105 Téléchargements

Partager

Gmail Facebook X LinkedIn More