Posts in: string-algorithms

Brute force pattern matching

String Algorithms

In this series I cover basic and advanced algorithmic techniques that allow for rapid searching of patterns in text. In this post, I will start with an introduction to pattern matching, and discuss a brute-force solution to exact pattern matching.

Introduction

Aside from being a fascinating subject, pattern matching techniques have a wide range of applications in the real world. Breakthroughs in efficient pattern matching algorithms continue to have a defining role in the advancement of biotechnology, especially in relation to DNA sequencing. The human genome has around 3 * 10^9 nucleotide and 10^12 unique patterns. Even on the world’s best supercomputers the analysis of the human genome would be impossible without ingenious solutions to pattern matching problems. Pattern matching algorithms are also employed in virtually every text editor, search engine, and search bar. So let’s get started!

Brute force approach

Picture the text as a path of letters. We begin by walking down the path, letter by letter. Each time we take another step down the path we compare the letter at our feet with the first letter of our pattern. When the letter at our feet matches with the first letter of the pattern, we stop walking. We then use our eyes to compare each letter of the pattern with the steps ahead of us. If there’s a mismatch then we stop scanning with our eyes, and continue walking down the path again comparing the letter at our feet with the first letter of the pattern. Otherwise, if we’ve scanned with our eyes the length of the pattern and there’s no mismatch then we scribble on a notepad the # of steps we’ve walked down the path so far; and then we continue walking again. When we reach the end of the path with our feet, we are done, and we check our notepad to see our results.

Here’s an implementation in javascript, but if you don’t know javascript that isn’t a problem, just read it like pseudocode:

Limitations

The brute force algorithm has a runtime of O(|Text| * |Patterns|). That’s not terribly slow when we are matching a single pattern across a small portion of text. However, this algorithm grinds to a halt as the text and/or cumulative length of multiple patterns grow. The human genome is around 3 * 10^9 characters long with roughly 10^12 unique permutations. This algorithm would not work for such a large sample set. Luckily, more efficient methods have been discovered, and I will cover them in future posts!