Brute force pattern matching

String Algorithms

In this series I cover basic and advanced algorithmic techniques that allow for rapid searching of patterns in text. In this post, I will start with an introduction to pattern matching, and discuss a brute-force solution to exact pattern matching.

Introduction

Aside from being a fascinating subject, pattern matching techniques have a wide range of applications in the real world. Breakthroughs in efficient pattern matching algorithms continue to have a defining role in the advancement of biotechnology, especially in relation to DNA sequencing. The human genome has around 3 * 10^9 nucleotide and 10^12 unique patterns. Even on the world’s best supercomputers the analysis of the human genome would be impossible without ingenious solutions to pattern matching problems. Pattern matching algorithms are also employed in virtually every text editor, search engine, and search bar. So let’s get started!

Brute force approach

Picture the text as a path of letters. We begin by walking down the path, letter by letter. Each time we take another step down the path we compare the letter at our feet with the first letter of our pattern. When the letter at our feet matches with the first letter of the pattern, we stop walking. We then use our eyes to compare each letter of the pattern with the steps ahead of us. If there’s a mismatch then we stop scanning with our eyes, and continue walking down the path again comparing the letter at our feet with the first letter of the pattern. Otherwise, if we’ve scanned with our eyes the length of the pattern and there’s no mismatch then we scribble on a notepad the # of steps we’ve walked down the path so far; and then we continue walking again. When we reach the end of the path with our feet, we are done, and we check our notepad to see our results.

Here’s an implementation in javascript, but if you don’t know javascript that isn’t a problem, just read it like pseudocode:

Limitations

The brute force algorithm has a runtime of O(|Text| * |Patterns|). That’s not terribly slow when we are matching a single pattern across a small portion of text. However, this algorithm grinds to a halt as the text and/or cumulative length of multiple patterns grow. The human genome is around 3 * 10^9 characters long with roughly 10^12 unique permutations. This algorithm would not work for such a large sample set. Luckily, more efficient methods have been discovered, and I will cover them in future posts!

Symbol.hasInstance

ES6 Symbols

Symbol.hasInstance

In this post I will discuss Symbol.hasInstance which is the first of 11 well-known symbols provided by the ES6 specification. Well-known symbols, for those unfamiliar, are provided by JavaScript as properties on an object which let you re-configure how ES6 treats the object. We will see how they are used in a minute.

Which native functions make use of Symbol.hasInstance?

The short answer is that the native instanceof function holds the sole rights to Symbol.hasInstance. Symbol.hasInstance in turn lets you customize and modify the behavior of the instanceof method. Let’s review the basics of instanceof briefly…

How is instanceof implemented?

instanceof is a method available to nearly every function. Objects in JavaScript are linked via a prototype chain. Therefore a method that is a metaphorical “child” of its parent does not inherit the properties of the parent but rather contains a link to the parent’s prototype. As such, instanceof compares the left-hand of its operand to the right-hand and determines if in the entire prototype chain of the left side, does the right-side’s prototype appear. It does this by querying the property Symbol.hasInstance placed on the prototypes of Objects.

instanceof‘s low-level implementation is via a Symbol.hasInstance property placed on native functions of an Object. Symbol.hasInstance is referred to as @@hasInstance for short in the ES6 spec. To implement instanceof , child instanceof parent evaluates according to the spec to parent[@@hasInstance](child) and when reduced further essentially evaluates to parent[Symbol.hasInstance](child).  So in short instanceof behind the scenes is implemented through querying the property Symbol.hasInstance which happens to be the topic of this post. instanceof is actually the only native function to make use of Symbol.hasInstance.

Here is a piece describing what we just discussed from the ES6 spec:

This (i.e. Function.prototype[@@hasInstance](V)) is the default implementation of @@hasInstance that most functions inherit. @@hasInstance is called by the instanceof operator to determine whether a value is an instance of a specific constructor. An expression such as
v instanceof F
evaluates as
F[@@hasInstance](v)
. . . . . .
The value of the name property of this function (i.e. @@hasInstance) is "[Symbol.hasInstance]".

Note: I know there is a trend to use Object.create() instead of new; however, this will not work as intended with instanceof since the left-side of instanceof is intended to be an object and the right-side a function. The Object.create() design pattern means that both the left and right-side will be objects which throws an error. If you plan on using instanceof stick to function name() and new instantiation as in the examples in this post. Here’s an example of the error thrown with the Object.create() design pattern.

What are some use cases for Symbol.hasInstance?

Here are two examples, the first is basic and less useful while the second is a more practical application…

I can’t say those examples are extremely practical, but it’s nice to know how to overwrite Symbol.hasInstance. If you have a more practical use-case please comment below!

Exploiting Symbol.hasInstance for malicious purposes

Symbol.hasInstance is a powerful property and one that can be exploited for nefarious purposes. Let’s explore how to exploit Symbol.hasInstance to expose the scope of private functions, and also how to modify instanceof to always return true.

Unfortunately for the black-hatters among us, ECMAScript figured out this exploit before it was released and made Function.prototype[Symbol.hasInstance] non-writable and non-configurable. So says the spec:

This property (referring to Function.prototype[Symbol.hasInstance]) is non-writable and non-configurable to prevent tampering that could be used to globally expose the target function of a bound function.

This makes using Object.defineProperty() to redefine Function.prototype throw TypeError: Cannot redefine property: Symbol(Symbol.hasInstance) saving the day and countless code!

That’s it for now. There’s plenty more to write regarding the use of Object.setPrototypeOf and Object.getPrototypeOf instead of and perhaps preferable to instanceof. There’s also potentially more to add regarding prototypes in general. Perhaps I’ll add those as another post or tack on to this one!

Thanks for reading!

Object Property Attributes

ES6 Language Specification

6.1.7.1 Property Attributes

The spec makes a distinction between 2 types of object properties: data properties and accessor properties. Most people are very familiar with the basics of data properties but less familiar with the powerful accessor properties. This post will discuss both.

Data properties are the typical properties that users define on objects while accessor properties are a unique way to set and get those properties from outside the function. For instance here’s some code that sums up which are which:

Data properties have 4 special attributes that are built into Javascript: Value, Writable, Enumerable, and Configurable. Let’s run through each of these.

The first attribute to discuss for data properties is ValueIn short, the spec says that this is the value retrieved by a Get access of the property. Building off of the previous code snippet, here’s an example of its use:

The second built in attribute for data properties is the Writable attribute. This attribute when false will prevent the accessor properties’ Set attribute from working. In other words in the following snippet setting firstName through the Set attribute will not change its value…

The third built in attribute for data properties is the Enumerable attribute. There are a number of built-in functions that either ignore or don’t ignore non-enumerable items. Setting Enumerable to false will mask those items for the functions that ignore them.  Enumeration is quite idiosyncratic as some functions will hide it and others won’t. It’s important to note that when using Object.defineProperty() the default is non-enumerable, but when using a standard object declaration then the default is enumerable. This can lead to some confusion or bugs and is therefore helpful to be aware of. The common use case for setting enumeration to false is to hide it from the for-in loop, however, you should be aware that this property can still be accessed through other functions as you will see in the following snippet. The following snippet lists functions that ignore or include non-enumerable items…

The fourth and final built in attribute for data properties is the Configurable attribute. When this is set to false then it isn’t possible to delete the property or change its attributes. The only exceptions are that the Value attribute may be modified (assuming Writable is true) and that Writable may be switched to false (but not vis-versa).

The defaults for Writable, Enumerable, and Configurable are true in the standard form of object declaration. However, when using property descriptors the defaults are false. The following code snippet demonstrates this…

Accessor properties have 4 special attributes that are built into Javascript: Get, Set, Enumerable, and Configurable. The first 2 we have already seen examples of in the first snippet. The last 2, Enumerable and Configurable, have the same effect as they have with data properties. The defaults for Get and Set are undefined.

All of this said, the following are the relevant portions of the spec:

6.1.7: 

An Object is logically a collection of properties. Each property is either a data property, or an accessor property:

  • A data property associates a key value with an ECMAScript language value and a set of Boolean attributes.
  • An accessor property associates a key value with one or two accessor functions, and a set of Boolean attributes. The accessor functions are used to store or retrieve an ECMAScript language value that is associated with the property.

6.1.7.1

Attributes are used in this specification to define and explain the state of Object properties. A data property associates a key value with the attributes listed in Table 2.

Table 2 — Attributes of a Data Property
Attribute Name Value Domain Description
[[Value]] Any ECMAScript language type The value retrieved by a get access of the property.
[[Writable]] Boolean If false, attempts by ECMAScript code to change the property’s [[Value]] attribute using [[Set]] will not succeed.
[[Enumerable]] Boolean If true, the property will be enumerated by a for-in enumeration (see 13.7.5). Otherwise, the property is said to be non-enumerable.
[[Configurable]] Boolean If false, attempts to delete the property, change the property to be an accessor property, or change its attributes (other than [[Value]], or changing [[Writable]] to false) will fail.

 

An accessor property associates a key value with the attributes listed in Table 3.

Table 3 — Attributes of an Accessor Property
Attribute Name Value Domain Description
[[Get]] Object | Undefined If the value is an Object it must be a function object. The function’s [[Call]] internal method (Table 6) is called with an empty arguments list to retrieve the property value each time a get access of the property is performed.
[[Set]] Object | Undefined If the value is an Object it must be a function object. The function’s [[Call]] internal method (Table 6) is called with an arguments list containing the assigned value as its sole argument each time a set access of the property is performed. The effect of a property’s [[Set]] internal method may, but is not required to, have an effect on the value returned by subsequent calls to the property’s [[Get]] internal method.
[[Enumerable]] Boolean If true, the property is to be enumerated by a for-in enumeration (see 13.7.5). Otherwise, the property is said to be non-enumerable.
[[Configurable]] Boolean If false, attempts to delete the property, change the property to be a data property, or change its attributes will fail.

 

If the initial values of a property’s attributes are not explicitly specified by this specification, the default value defined in Table 4 is used.

Table 4 — Default Attribute Values
Attribute Name Default Value
[[Value]] undefined
[[Get]] undefined
[[Set]] undefined
[[Writable]] false
[[Enumerable]] false
[[Configurable]] false

The Numeric String Grammar

ES6 Language Specification

5.1.3 The Numeric String Grammar

There is a separate grammar used for converting Strings to numbers. It’s very similar to the lexical grammar in that it’s terminals are SourceCharacters or in other words Unicode characters. There’s really not much to say on this other than it is considered separate from the lexical grammar. It is not clear to me why we need to single this out into its own grammar, so if someone has an idea to why this is necessary then please comment below.

Another grammar is used for translating Strings into numeric values. This grammar is similar to the part of the lexical grammar having to do with numeric literals and has as its terminal symbols SourceCharacter. This grammar appears in 7.1.3.1.

Productions of the numeric string grammar are distinguished by having three colons “:::” as punctuation.

An interesting thing to note between this spec and ES5 is that ES6 has dropped the JSON Grammar defined in section 5.1.5 of the ES5 spec. I’m assuming this is because the JSON has made its way into ECMA-404

The Lexical and RegExp Grammars

ES6 Language Specification

5.1.2  The Lexical and RegExp Grammars

What is the lexical grammar?

The lexical grammar is the most rudimentary part of a grammar’s syntax. It is better described in section 11 of the specs, however, we will briefly describe it here, and leave some of the more intricate details to when we discuss section 11.

We will first quote the spec as a point of reference and then describe it in simpler language. The first paragraph states:

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

To better understand what the lexical grammar is we need to cover some of the terminology. The characters that make up the lexical grammar are called SourceCharacters, which are defined as being any Unicode character/symbol. These characters in turn define the lexical productions labeled InputElementDiv, InputElementTemplateTail, InputElementRegExp, and InputElementRegExpOrTemplateTail. These 4 productions are composed of what are termed tokens which are words like while or for or characters like ( or +, as well as non-tokens such as line terminators, comments, and white space. We have 4 distinct productions because each of these have additional valid grammars such as regular expressions or additional punctuators. These 4 productions are used in different contexts within the spec. 

We can now better understand the remaining part of 5.1.2:

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.

Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (11.9). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A MultiLineComment (that is, a comment of the form /**/regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a MultiLineComment contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

A RegExp grammar for ECMAScript is given in 21.2.1. This grammar also has as its terminal symbols the code points as defined by SourceCharacter. It defines a set of productions, starting from the goal symbol Pattern, that describe how sequences of code points are translated into regular expression patterns.

Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

 

Context-Free Grammars

ES6 Language Specification

5.1.1 Context-Free Grammars

In order to understand much of the specification moving forward, we will need to understand some basic information on grammar used to describe programming languages. Indeed the spec itself devotes all of section 5 to the basic notation and grammar used in the rest of the documentation. Unfortunately for some, the writers assume familiarity with some aspects of language theory. In order to bridge the gap for those that are not familiar with some of the terminology, and as a refresher for those that are, we will discuss the concepts as they appear.

The first concept to understand is what a context-free grammar is, as it lays the foundation of much of the remaining specification.

Let’s start with an example of a context-free grammar so we have something to reference while we discuss the topic:

Syntactically, each rule here appears as a line. The left-hand side is a single non-terminal (another word for a variable). The right-hand side is a string that consists of any number of terminals (constants) and non-terminals (variables). In our example, 0, 1, and @ are terminals, and A and B are non-terminals. The notation for the non-terminals (variables) is a capital letter or a word that is camel-cased. Terminals (constants) are written in lowercase, or as a number, or a symbol.

Formal grammars have a recursive nature and consist of a collection of substitution rules referred to as productions. Their recursive nature empowers them to be widely expressive as you will see in many real-world examples throughout the specification. To apply the grammar, you start with the top rule, and simply perform substitutions until there are no non-terminals on the right-hand side. The top-left non-terminal is known as the goal terminal because our goal is to define it with everything on its right. In our example, valid syntax that will conform to our grammar include 0@1 00@11 000@111 and so on.

Why the name ‘context-free’?

What makes something ‘context-free’ and why is it called that? Consider the following:

In the first rule you can replace A without worrying about anything else around it, i.e. context-free. However, with the second rule, you can only replace A if B is alongside it. This means that you need to worry about the context of A being next to B. That is why AB : a isn’t context-free! For a formula to be context-free it must have only one non-terminal on the left-hand side. More than one non-terminal and we no longer have a context-free formula.

Armed with all this information we can understand 5.1.1 of the spec.

A context-free grammar consists of a number of productions. Each production has an abstract symbol called a nonterminal as its left-hand side, and a sequence of zero or more nonterminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.

There we go! Exactly like we described albeit less cryptic now that we know what it’s talking about.

A chain production is a production that has exactly one nonterminal symbol on its right-hand side along with zero or more terminal symbols.

It’s called a chain because the rules or productions are linked together like a chain. A chain provides an optimal level of simplicity.

Starting from a sentence consisting of a single distinguished nonterminal, called the goal symbol, a given context-free grammar specifies a language, namely, the (perhaps infinite) set of possible sequences of terminal symbols that can result from repeatedly replacing any nonterminal in the sequence with a right-hand side of a production for which the nonterminal is the left-hand side.

Run-on sentence anyone? It’s not technically a run-on, but it is definitely a mouthful. If you read it chunk by chunk you’ll see that it re-iterates what we already know from what we just discussed.

Web Scripting

ES6 Language Specification

4.1 Web Scripting

An overview of the role ECMAScript can play in the web client and web server. Using ES for both provides for an even more powerful computational environment in which tasks can be fluidly distributed between the client and server.

A web browser provides an ECMAScript host environment for client-side computation including, for instance, objects that represent windows, menus, pop-ups, dialog boxes, text areas, anchors, frames, history, cookies, and input/output. Further, the host environment provides a means to attach scripting code to events such as change of focus, page and image loading, unloading, error and abort, selection, form submission, and mouse actions. Scripting code appears within the HTML and the displayed page is a combination of user interface elements and fixed and computed text and images. The scripting code is reactive to user interaction and there is no need for a main program.

A web server provides a different host environment for server-side computation including objects representing requests, clients, and files; and mechanisms to lock and share data. By using browser-side and server-side scripting together, it is possible to distribute computation between the client and server while providing a customized user interface for a Web-based application.

Each Web browser and server that supports ECMAScript supplies its own host environment, completing the ECMAScript execution environment.

Overview

ES6 Language Specification

4. Overview

This section contains a non-normative overview of the ECMAScript language.

In other words it is an informal overview. Don’t confuse that to mean less important. It’s just not formal or binding in the same way as the normative parts of the spec. However, this is an interesting, and highly-informative section of the specification.

ECMAScript is an object-oriented programming language for performing computations and manipulating computational objects within a host environment. ECMAScript as defined here is not intended to be computationally self-sufficient; indeed, there are no provisions in this specification for input of external data or output of computed results. Instead, it is expected that the computational environment of an ECMAScript program will provide not only the objects and other facilities described in this specification but also certain environment-specific objects, whose description and behaviour are beyond the scope of this specification except to indicate that they may provide certain properties that can be accessed and certain functions that can be called from an ECMAScript program.

In other words, ECMAScript is a programming language that requires additional features provided by its hosting environment. Most commonly this is the web browser, but it can be a web server or another operating environment that implements the language.

ECMAScript was originally designed to be used as a scripting language, but has become widely used as a general purpose programming language. A scripting language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. In such systems, useful functionality is already available through a user interface, and the scripting language is a mechanism for exposing that functionality to program control. In this way, the existing system is said to provide a host environment of objects and facilities, which completes the capabilities of the scripting language. A scripting language is intended for use by both professional and non-professional programmers.

ECMAScript was originally designed to be a Web scripting language, providing a mechanism to enliven Web pages in browsers and to perform server computation as part of a Web-based client-server architecture. ECMAScript is now used to provide core scripting capabilities for a variety of host environments. Therefore the core language is specified in this document apart from any particular host environment.

ECMAScript usage has moved beyond simple scripting and it is now used for the full spectrum of programming tasks in many different environments and scales. As the usage of ECMAScript has expanded, so has the features and facilities it provides. ECMAScript is now a fully featured general propose programming language.

Some of the facilities of ECMAScript are similar to those used in other programming languages; in particular C, Java™, Self, and Scheme as described in:

ISO/IEC 9899:1996, Programming Languages – C.

Gosling, James, Bill Joy and Guy Steele. The Java Language Specification. Addison Wesley Publishing Co., 1996.

Ungar, David, and Smith, Randall B. Self: The Power of Simplicity. OOPSLA ’87 Conference Proceedings, pp. 227–241, Orlando, FL, October 1987.

IEEE Standard for the Scheme Programming Language. IEEE Std 1178-1990.

ECMAScript originated by Netscape for use as a scripting language in browsers. It was created to enhance webpages and accessible to non-professional programmers. However, as we know, the language has evolved, grown, and blossomed into a fully featured and powerful general purpose programming language to be used in a host of environments.

Normative references

ES6 Language Specification

3. Normative references

Listed are 3 external specifications that go hand-in-hand with ECMAScript:

The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.

ISO/IEC 10646:2003: Information Technology – Universal Multiple-Octet Coded Character Set (UCS) plus Amendment 1:2005, Amendment 2:2006, Amendment 3:2008, and Amendment 4:2008, plus additional amendments and corrigenda, or successor

ECMA-402, ECMAScript 2015 Internationalization API Specification.
http://www.ecma-international.org/publications/standards/Ecma-402.htm

ECMA-404, The JSON Data Interchange Format.
http://www.ecma-international.org/publications/standards/Ecma-404.htm

  1. ISO/IEC 10646. This is a low level spec that is incorporated as part of Unicode. As mentioned in its abstract, this specification focuses on the “representation, transmission, interchange, processing, storage, input and presentation of the written form of the languages of the world as well as additional symbols”.
  2. ECMA-402. As mentioned earlier, this is the standard that needs to be referenced by implementations that provide international APIs. Known also as ECMAScript 2015 Internationalization API Specification.
  3. ECMA-404. This is the JSON specification. It is only 5 pages long. Although originally derived from ECMAScript, it is not part of the ECMAScript spec, because it is intended to be language independent.

Conformance

ES6 Language Specification

2. Conformance

Popular implementations of ECMAScript include Chrome’s V8 Engine, Firefox’s SpiderMonkey,  and Microsoft Edge’s Chakra. To properly conform to ECMAScript the following criteria must be met:

A conforming implementation of ECMAScript must provide and support all the types, values, objects, properties, functions, and program syntax and semantics described in this specification.

A conforming implementation of ECMAScript must interpret source text input in conformance with the Unicode Standard, Version 5.1.0 or later and ISO/IEC 10646. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the Unicode set, collection 10646.

A conforming implementation of ECMAScript that provides an application programming interface that supports programs that need to adapt to the linguistic and cultural conventions used by different human languages and countries must implement the interface defined by the most recent edition of ECMA-402 that is compatible with this specification.

A conforming implementation of ECMAScript may provide additional types, values, objects, properties, and functions beyond those described in this specification. In particular, a conforming implementation of ECMAScript may provide properties not described in this specification, and values for those properties, for objects that are described in this specification.

A conforming implementation of ECMAScript may support program and regular expression syntax not described in this specification. In particular, a conforming implementation of ECMAScript may support program syntax that makes use of the “future reserved words” listed in subclause 11.6.2.2 of this specification.

A conforming implementation of ECMAScript must not implement any extension that is listed as a Forbidden Extension in subclause 16.1.

In other words the following must be true to be considered a valid implementation of ECMAScript:

  1. Must support all the language features, syntax, and semantics provided in this spec.
  2. Must interpret source code in Unicode, minimum version 5.1, although note that Unicode 8.0 is available as of 2015. Implied and incorporated in every Unicode standard is ISO/IEC 10646.
  3. If the implementation provides APIs to support internationalization, then those must conform to the interface specified by the ECMAScript 2015 Internationalization API Specification, referred to here as ECMA-402.
  4. May create additional features beyond the core ECMAScript functionality.
  5. May provide additional syntax beyond what’s in the spec. This includes using future reserved words (enum, await, implements, interface, package, private, protected, and public).
  6. Cannot implement forbidden extensions which will be the topic of a future post.