Regex Grammars

Regex grammars are built around sets of rules which use regular expressions. These rules (or “scopes”) break down text into logical tokens which form a parse tree.

Note: With the introduction of Tree-sitter support in Nova 10, building new regex grammars is discouraged. Support will remain for the forseeable future for backward compatibility, but new features will be targeted exclusively at Tree-sitter.

Regular Expression Engine

Nova uses the Perl Compatible Regular Expressions, Version 2 (PCRE2) library for all regex grammars in its syntax engine. This means that any regular expression defined by an extension should be compatible with this format.

Scopes

Scopes are parsing rules that form the building blocks from which a Regular Expression grammar is constructed.

Each scope defines how a specific type of token, block, or region of text is parsed. There are several types of scopes available in the parser, but virtually all work around the concept of using a simple regular expression (or pair of expression) to quantify and tokenize text.

Within a regex grammar, scopes can be defined in several places:

Top-level Scopes (defined in a syntax’s <scopes> tag)
Template Scopes (defined in a syntax’s <template-scopes> tag)
Collection Scopes (defined in a syntax’s <collections> tag)

Top-Level Scopes

Each syntax grammar will have a top-level <scopes> tag that defines its first level of scopes. When parsing of a document begins, these scopes are evaluated. As scopes match, they may cause the parser to enter a deeper state and reference other scopes (like collection scopes).

The top-level scopes are only evaluated when the parser is in its top-level state. As the parser pushes scope rules onto its stack (see start-end scopes, below), the top level scopes will not be referenced until the parse state “pops” back to its root, or they are explicitly referenced using a syntax-wide include scope (see includes, below).

Template Scopes

Template scopes are optionally defined within the <template-scopes> element of the syntax grammar. They are otherwise defined in exactly the same way as the <scopes> element (see above).

These are a special set of scopes which allow easy construction of template languages, like PHP and Jinja.

They behave very similarly to top-level scopes, except that they are evaluated at every level of the parse tree during recursive parsing and not just at the root of the tree. They are also evaluated before any child scopes of the current parent. This allows template tags from an enclosing language to be handled deeply within the language they wrap, such as PHP tags within HTML.

If you are not developing a template language that uses this type of tag, template scopes are probably unnecessary.

Collections

To make building syntax grammars easier and cleaner, scopes may be grouped logically into Collections.

A syntax’s top level <collections> element contains reference to one or more collections, which in turn contain scopes that may be referenced elsewhere using an [Include scope][include-scopes].

One collection can easily include scopes from another collection, allowing for multiple levels of including for more complex syntaxes.

A collection is defined using a <collection> element.

Each collection should have a name attribute, used to reference the collection for including. Collection names may contain the same set of characters as scope names: alphanumeric characters, as well as the period, underscore, and dash.

Collections are scoped to their defining syntax, so collection names will not conflict between multiple unrelated syntaxes.

<collections>
    <!-- Keywords -->
    <collection name="keywords">
        <scope name="javascript.keyword">
            <strings>
                <string>await</string>
                <string>break</string>
                <string>case</string>
                <string>catch</string>
                <string>class</string>
                <string>const</string>
                […additional strings…]
            </strings>
        </scope>
        
        […additional scopes…]
    </collection>
    
    […additional collections…]
</collections>

Types of Scopes

There are four main types of scope:

Match Scopes: a single expression that matches a token within a line
Start-End Scopes: a pair of expressions that begin and end a recursive level of parsing
Cut-Off Scopes: a single expression that can be used to break out of the current recursive level of parsing early
Include Scopes: imports scopes from a collection or syntax into the current level of parsing

Scope Names

The first two types of scope (match and start-end) each require a name. The name of a scope is a set of components (identifiers) separated by period characters, such as mylang.identifier.keyword. The order of the components in the name does not matter, except for developer preference and readability.

Each identifier may consist of the ASCII alphanumeric set (a-z, A-Z, 0-9), the underscore, and the dash. They cannot contain spaces or any other special characters.

The name is most prominently used during syntax highlighting and theming, where theme rules are filtered based on if and how many components of the name they match.

As a general rule of thumb, names should include the syntax name in which they are defined, and stick to a predefined set of common components where they apply, which can be found in the Themes documentation.

Match Scopes

Match scopes are the simplest type. They define a regular expression (inside of an expression element) that is used as the basis for the resulting parse rule:

<scope name="mylang.keyword.let">
    <expression>let</expression>
</scope>

Match scopes support treating regular expression capture groups as a special type of sub-scope, called a Capture:

<scope name="mylang.identifier.variable">
    <expression>\b(let)\s+([a-zA-ZÀ-ÖØ-öø-ÿ_][A-Za-zÀ-ÖØ-öø-ÿ0-9_]*)</expression>
    <capture number="1" name="mylang.keyword.let" />
    <capture number="2" name="mylang.identifier.variable.name" />
</scope>

A capture element defines the regular expression capture group number (starting at 1, with 0 being the entire regular expression match), and can be referenced by name in the same way as scopes for syntax highlighting.

One note: Match scopes can only match content within the current line of the document being parsed. They cannot cause the parser to consume text within the next line (or consume the newline itself). To parse text that encompasses multiple lines, use a Start-End scope instead. Match scopes can, however, see content on the previous or next line through the use of regular expression look-behinds and look-aheads, respectively.

Match scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Start-End Scopes

Start-End scopes define expressions for the beginning and end of a recursive parse rule. The starts-with element behaves in the same way as a Match scope, with the same support for capture groups.

Once the starting expression is matched, the parser pushes a new state onto its stack, and begins parsing using scopes defined within the scope’s ends-with and subscopes element (or, alternatively, an subsyntax element). The subscopes element may contain any number of other scopes, including other Start-End scopes and Include scopes (see below).

The ends-with expression will take the highest priority during this state. If matched, the parser will end parsing the Start-End scope at this point, and pop its state from the parse stack, returning to the previous set of scopes that were being used before the starts-with expression was encountered.

<scope name="mylang.function">
    <starts-with>
        <!-- Opening bracket -->
        <expression>(\[)</expression>
        <capture number="1" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <!-- Closing bracket -->
        <expression>(\])</expression>
        <capture number="1" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        <scope name="mylang.number">
            <!-- Matches a number -->
            <expression>\d+</expression>
        </scope>
        <scope name="mylang.boolean">
            <!-- Matches a boolean keyword -->
            <expression>true|false</expression>
        </scope>
        <scope name="mylang.string">
            <!-- Matches text between double-quotes -->
            <expression>&quot;[^&quot;]*&quot;</expression>
        </scope>
    </subscopes>
</scope>

Start-End scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Anchored vs. Unanchored Parsing

The subscopes element of a start-end scope contains the scopes that will be used for matching after the scope is pushed onto the parser’s stack.

Unanchored Parsing

By default, subscopes behave just like the top-level <scopes> array: They are matched repeatedly ahead of the current parse position to the end of the line the parser is parsing, and any matches are combined, filtered, and prioritized based on which matched first, which intersect, etc. This is called Unanchored Parsing, as matches in the line are not anchored to any specific location in the line range being parsed.

In unanchored parsing, the order of matches doesn’t matter, so long as they can be sorted and prioritized properly. This is ideal for parsing where the order of tokens isn’t particuarly important outside of the start and end expressions of a scope, such as attributes within HTML tags.

Anchored Parsing

New in Nova 4.

Alternatively, a start-end scope may define its <subscopes> element using the anchored="true" attribute. By doing so, the parser will instead use a different parsing method, known as Anchored Parsing. When anchored parsing is used, the scope’s subscopes will be required to match in order, and may (by default) only match once each. This allows parser rules to be defined which will match specific procedural constructs using multiple subscopes, allowing for more expressive (and accurate) grammars to be constructed for certain types of language tokens.

If a subscope should only be matched conditionally, it can be annotated using the optional="true" attribute. By marking a subscope as optional, it will be attempted at the position within the array of subscopes but will not be required to match. If it does not, the parser will continue on to the next subscope (if any).

If a subscope should be attempted multiple times, it can be annotated using the repeat="true" attribute. By marking as subscope as repeating, it will be attempted continuously until it does not match. Combining both optional="true" and repeat="true" allows a subscope to match zero or more times.

If at any point the parser encounters a token that is not expected while performing subscope matching, the scope will be immediately ended as if a <cut-off> scope had been encountered, and the parser will pop back to the previous scope level.

Additionally, when using anchored parsing, it is not necessary to specify an expression within the ends-with element (although the element itself should still be included as a self-closing tag). Since the parser can determine a specific rule for when the scope should end, it can do so automatically. If an ends-with expression is provided, it will be matched only after all subscopes are matched or optionally ignored.

Finally, when using anchored parsing, whitespace will be automatically consumed between subscopes that do not otherwise match it. This allows parse rules to be written without needing to worry about whitespace. However, if the presence of whitespace preceeding or succeeding a subscope is important this behavior can be disabled by setting skip-whitespace="false" on the subscopes element of the containing scope.

Consider the following example:

<scope name="mylang.function">
    <starts-with>
        <!-- Matches the form 'function foobar' -->
        <expression>(function)\s+([a-zA-Z0-9_]+)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.identifier.name" />
        <capture number="3" name="mylang.bracket" />
    </starts-with>
    <ends-with />
    <subscopes anchored="true">
        <!-- Matches a function arguments list -->
        <scope name="mylang.function.arguments">
            <starts-with>
                <expression>\(</expression>
                <capture number="0" name="mylang.function.arguments.bracket" />
            </starts-with>
            <ends-with>
                <expression>\)</expression>
                <capture number="0" name="mylang.function.arguments.bracket" />
            </ends-with>
            <subscopes>
                <!-- ... -->
            </subscopes>
        </scope>
        
        <!-- Match comments (optionally) -->
        <include syntax="self" collection="comments" optional="true" />
        
        <!-- Matches a function body -->
        <scope name="mylang.function.body">
            <starts-with>
                <expression>\{</expression>
                <capture number="0" name="mylang.function.body.bracket" />
            </starts-with>
            <ends-with>
                <expression>\}</expression>
                <capture number="0" name="mylang.function.body.bracket" />
            </ends-with>
            <subscopes>
                <!-- ... -->
            </subscopes>
        </scope>
    </subscopes>
</scope>

In this example, the scope defines rules for parsing a JavaScript-like function definition. If the parser matches text of the form function <function-name>, the parser will begin performing anchored parsing with its subscopes. It will then attempt the following:

Attempt to match an arguments list by looking for the ( character. If found, it will begin parsing arguments using this child scope’s subscopes (in unanchored mode). If not, the parser will stop parsing the function immediately.
Attempt to match a comment using an include scope. If not found, it will advance to the next subscope.
Attempt to match a function body by looking for the { character. if found, it will begin parsing the body using this child scope’s subscopes (in unanchored mode). If not, the parser will stop parsing the function immediately.
End the function after this matching.

Back-Referencing Between Start and End Expressions

If the closing expression of a start-end scope (defined by <ends-with>) is somehow dependent on the starts-with expression, you can opt to use regular expression capture group references which resolve to the capture groups of the starts-with expression.

To use capture group references in ends-with, the <expression> tag should be replace with a <template> tag. This instructs the parser to resolve capture group references inside of the template expression before compiling its regular expression.

Capture group references use the “backslash” format, just as regular expression back-references do, such as \x, where x is the capture group number.

<scope name="mylang.tag">
    <starts-with>
        <expression>&lt;([a-zA-Z0-9_]+)&gt;</expression>
        <capture number="1" name="mylang.tag.name" />
    </starts-with>
    <ends-with>
        <template>&gt;/\1&lt;</template>
        <capture number="0" name="mylang.tag.name" />
    </ends-with>
    <subscopes />
</scope>

Subsyntaxes

Alternatively, Start-End scopes can be used to define a fenced block of code, also known as a Subsyntax. When a subsyntax element is used in place of subscopes, the parser will automatically treat this as a fenced code block and take extra care to parse the subsyntax with this in mind. Subsyntax elements may contain Cut-off scopes (see below) to further instruct the parser on how to “break out” of the fenced code block should the code within be incomplete.

<scope name="mylang.fenced-code-block">
    <starts-with>
        <expression>\`\`\`</expression>
    </starts-with>
    <ends-with>
        <expression>\`\`\`</expression>
    </ends-with>
    <subsyntax name="otherlang">
        
    </subsyntax>
</scope>

Subsyntaxes are most often used, as mentioned before, for fenced code blocks. Examples of this behavior include:

JavaScript within <script> tags in HTML
The PHP procedural language within <?PHP> template tags
The triple-backtick fenced blocks in Markdown

Subsyntaxes are not generally recommended for simple inclusion of specific parts of another syntax (such as including parts of CSS inside of SCSS, for example). For this, you should use Include scopes instead.

The <subsyntax> element defines the name of the syntax to use within the fenced code block, which must be a validly registered syntax. The subsyntax should not reference the outer syntax recursively. Instead, consider using an include scope (see below) to include the syntax’s own rules within itself.

Unlike normal Start-End scopes, the ends-with expression for a subsyntax scope will be evaluated deeply in the fenced region of code, so that its expression can “break” out of the subsyntax early. An example of this is the use of a </script> tag for a JavaScript fenced code block in HTML. The end tag should be able to be matched even if the JavaScript code within is not fully complete and valid.

The one exception to this behavior is the use of atomic scopes (such as comments and strings). The ends-with expression will never be evaluated within these scopes (for more information on atomic scopes, see Scope Options below.

The <subsyntax> element has the following possible options:

name: The name of the syntax that should be included
noncontiguous: Whether separate regions defined by the subsyntax represent a noncontiguous symbolic namespace. This is useful in cases such as PHP’s template tags, which use a shared namespace for all tags within a document. The default value is false. Set to true to enable this.
export-symbols: Whether symbols defined within the subsyntax are exported for completion in other files. By default this is false, indicating that symbols are scoped only to this file or module. Setting this to true will enable symbols to be visible in other files (depending on their assigned symbol scope). (New in Nova 1.1.)

String Expressions

As an alternative to using a regular expression, both match scopes and start-end scopes may optionally define its expression using a set of strings. This is very useful when the possible expressions being matched are from a known set of words or expressions:

<scope name="javascript.keyword">
    <strings>
        <string>await</string>
        <string>break</string>
        <string>case</string>
        <string>catch</string>
        <string>class</string>
        <string>const</string>
    </strings>
</scope>

Behind the scenes, the syntax engine will compile this word list into an optimized regular expression that will be used in much the same way as a normal match expression.

The <strings> element has several possible options:

prefix: a regular expression component that will be prepended before the word list expression
suffix: a regular expression component that will be appended after the word list expression
word-boundary: whether the expression will be wrapped in a regex word boundary check (\b) (enabled by default, may be set to "false" to disable)
case-insensitive: whether the regular expression’s word list will be case insensitive (disabled by default, may be set to "true" to enable)

Cut-Off Scopes

Cut-off scopes are used primarily for improving the performance of complex language definitions. When matched, a cut-off instructs the parser to stop evaluating the current start-end scope as if its ends-with expression had been encountered.

They are most often used for defining cases when known text should never be encountered within the current scope (such as encountering a class definition inside of a method’s implementation, assuming the language doesn’t support that).

Consider the following case:

<cut-off>
    <expression>(?=\b(?:class)\b)</expression>
</cut-off>

If this cut-off expression were placed within the subscopes of a method’s implementation, it would indicate that should the parser encounter the class token ahead of the current location, the method should immediately stop parsing, as the class definition is not valid here.

This type of “early cut-off” allows the parser to be far quicker at adjusting the parse tree when changes are made, as encountering this type of token would not cause the parser to continue parsing as if the method’s implementation was still open.

Include Scopes

Include scopes, along with Collections, are used to organize scopes into logical sets that can be reused in multiple places within a syntax definition.

When encountered, the parser will look up the collection referenced by the include, and evaluate the collection’s scopes as if they were defined in place of the include. An include scope can be used at any place other scope types are valid, including within the subscopes element of a start-end scope.

The most common use of an include is to reference a collection within the current syntax, through the use of the special self name:

<include syntax="self" collection="variables" />

Syntaxes that reference collections within themselves should always use self instead of the syntax name, as this allows the parser to properly handle cases where syntaxes inherit from one another and override collections.

An include scope can also be used to reference a collection within another syntax:

<include syntax="javascript" collection="comments" />

Doing so forms a dependency between the two syntaxes. If the referenced syntax is not available, the include will evaluate to an empty set. It is generally recommended to only use this behavior if both syntaxes are provided as part of the same extension. Additionally, relying on collections within built-in syntaxes should be avoided, as these may change at any time.

Finally, includes can also be used to include an entire syntax, including self:

<include syntax="self" />

This is effectively the same as including the special top-level scopes collection within the syntax. Care should be taken when doing this, as it can cause deep recursion within a syntax that can impact parsing performance.

Spell Checking

Spell checking is most often reserved for comments and prose within a language. When this is enabled, the editor will automatically perform spell checking using the user’s default language dictionary and highlight misspelled text.

By default, all scopes have spell checking set to “inherit” from their parent.

A scope can explicitly opt-in to spell checking by adding the spell-check="true" attribute to its <scope> tag. This value may also be set to "false" to disable spell checking in a scope when its closes non-inherited ancestors has enabled it.

A syntax may also include the spell-check attribute on its top-level <scopes> tag to enable spell checking for the entire syntax, after which individual scopes may disable it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Lookup

Lookup allows the user to perform the “define” gesture on their mouse or trackpad to invoke various actions. This gesture is most often a “deep-click” on force-touch trackpads or a three-finger tap on older trackpads and touch-enabled mice.

By default, all scopes have lookup set to “inherit” from their parent.

There are two types of lookup available to scopes:

Index (attempts to “jump” to the definition of the identifier under the cursor, which may be in the current or another file)
Dictionary (displays the standard system dictionary popup for the word under the cursor)

By default, “index” lookups are used. A scope may set the lookup="dictionary" attribute on its <scope> tag to define that words within the scope should use dictionary lookup behavior.

A syntax may also include the lookup attribute on its top-level <scopes> tag to set the default lookup behavior for the entire syntax, after which individual scopes may change it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Atomic Scopes

Start-End scopes represent a recursive level of parsing downward. Certain types of other scopes, such as template scopes, cut-offs, and ends-with expressions can be referenced during this deeper level of recursion to “break out” of the current level of parsing.

However, for certain types of parsing, this behavior is not ideal. The best examples are comments and strings. When typing a JavaScript string within an HTML script tag, for example, you would not want the expression let string = "use a </script> tag"; to be able to “break out” of the JavaScript subsyntax. Therefore, by default, all comments and strings are marked as “atomic”. This means that deeper cut-off rules (like ends-with expressions) will not be evaluated within their level of the parse tree unless that cut-off is defined within that specific scope.

Other types of Start-End scopes may opt-into to being atomic by adding the atomic="true" attribute to their <scope> tag. However, it is generally rare for this to be used outside of commends and strings, which is why these two types of token are automatically set to atomic.

Symbolication

The process of taking a parse tree built by a syntax grammar in the editor and forming a list of “Symbols” is known as Symbolication. This allows the elements of a parse tree to form logical “higher-level” Symbols that refer to language constructs that should appear in places such as the Symbols list and be used to power IDE features such as Jump To Definition.

Symbolication in the Nova parse engine is achieved by adding additional metadata to scopes using a <symbol> element. The presence of this element declares that a scope either defines a Symbol, or adds metadata to an already existing Symbol.

<scope name="mylang.function">
    <symbol type="function">
        <context behavior="subtree" />
    </symbol>
    <starts-with>
        <expression>(function)\s*(\{)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <expression>\}</expression>
        <capture number="0" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        
    </subscopes>
</scope>

Basic Symbols

The most basic of <symbol> elements defines attributes about its type:

<scope name="javascript.definition.class">
    <symbol type="class" />
    <starts-with>
        <expression>\b(class)\b</expression>
        <capture number="1" name="javascript.keyword.class" />
    </starts-with>
    …
</scope>

Defining a Symbol’s Type

In this example, a scope that parses JavaScript classes is marked using a symbol element that has its type set to class, indicating to the parse engine that this scope defines a Class symbol.

Types are provided for symbolic constructs common to most procedural, structured, and markup languages. Valid values are defined in the Symbol documentation.

Note: not all symbol types will appear in the Symbols list in the IDE. This list is filtered to specific types to ensure the most relevance to users.

Defining a Symbol’s Scope

The scope attribute of the <symbol> element may be used to define the lexical scope in which the symbol is valid (the use of the term “scope” here is an overloaded term: this should not be confused with “scopes” used in the syntax grammars.)

The scope of a symbol can affect how it is offered in completions and global project indexing.

Valid values for the scope attribute are:

global: The symbol is valid globally for the current file or module. It can be offered for completion nearly anywhere that is valid.
private: The symbol is valid only within the type in which it is defined (such as a class or module). For example: it can be offered for completion within methods of that class.
local: The symbol is valid only within the most relevant Symbolic Context depending on grammar grammar symbolication options. It can be offered only within this region of the text, and never outside of it.
external: The symbol is defined outside of the current file or module, and is referenced here only as an external link (such as by an import statement).

Most often, the constructs of a language defines the scope of symbols using generally understood rules:

Classes, interfaces, and other types are most often global
Global variables and constants are likely global
Variables and constants defined only in a specific context are most often local
Functions, methods, and properties are often global (indicating they are public) or private (indicating they are not)

If no scope is defined on a symbol, it will be inferred using rules like those above. If no scope can be inferred, it will be assumed to be local.

Additionally, a symbol may be marked using the anonymous attribute (set to true) to indicate that it does not export a name that should be indexed anywhere for completion, even if it has a name in its local scope. This is most often used for anonymous functions in languages like JavaScript and Python.

There are, however, no strict rules enforced for the scope of a symbol in relation to where it is defined; This allows a syntax grammar to tune symbols to the scope in which they make most sense with regard to completion and indexing.

Consider the following definition for a JavaScript arrow function:

<scope name="javascript.definition.function.arrow">
    <symbol type="function" scope="local" anonymous="true" />
    …
</scope>

Since JavaScript arrow functions are anonymous functions, the scope attribute is set to local, which indicates that the symbol should only be valid in its current parent (local) scope.

For symbols defined in the local scope, they will be made available within the most relevant symbolic context, which depends on the grammar’s Symbolication Options. See that section for more information on controlling the meaning of the “local” scope.

Computing a Symbol’s Name

When a symbol is created from a <scope> tag, the parser performs a set of heuristics to attempt to determine both syntactically-relevant and user-readable names for it.

By default, if no other options are specified, the parser will look through the symbol’s children for any <scope> or <capture> elements that contain the class name in their name. If one is found, it will be assumed that it represents the name of the symbol.

Consider this example, from the XML syntax:

<scope name="xml.tag.open">
    <symbol type="tag" />
    <starts-with>
        <expression>&lt;([a-zA-Z_][A-Za-zÀ-ÖØ-öø-ÿ0-9_:.-]*)</expression>
        <capture number="1" name="xml.tag.name" />
    </starts-with>
    …
</scope>

The scope is marked for symbolication with the type tag. No other options are specified, so the parser will search the scope’s starts-with expression for a capture that includes the class name, which it will find, defined as capture group 1. The name for this symbol will then be constructed by referencing what text was captured by this group.

Alternatively, if the scope or capture element that should be used to compute the name does not contain the class name, the symbol may specify the selector to use for searching by setting the name-selector attribute of the <symbol> element to a simple selector. (name-selector was added in Nova 4.)

The name of a symbol will be used when inserting symbols using completions, when searching for them using the project index, and more. It is important to ensure that the name of symbols is being properly constructed by the parser to ensure proper behavior in the language grammar.

Computing a Symbol’s Display Name

It is often the case that the user-displayable name for a symbol (such as in the Symbols list) needs to convey more information about the symbol that just its simple syntactic name.

In HTML, for example, a symbols list that contains only a list of the word div is not super helpful. For this, the parser supports complex expressions to build a Display Name.

The <display-name> element of the <symbol> element allows for this.

This element contains one or more <component> elements that pull pieces of the symbol’s subtree in the same way as it computes the name, which are then concatenated together in specific ways.

Component elements may have the following attributes:

variable: Uses the text of an available parser variable, which includes:
- name: The name of the symbol
selector: A simple selector (classes only) that defines a parse tree match that should be used for computing the component. The text of this match will be used.
prepend: Text to prepend to the resulting component
append: Text to append to the resulting component
replace: A regular expression to be used in targeting text within the component for replacement
replace-with: Text to use when using the replace attribute

Consider this example from the HTML syntax:

<scope name="html.tag.open.paired" spell-check="false" lookup="documentation">
    <symbol type="tag-heading">
        <display-name>
            <component variable="name" />
            <component selector="tag.attribute.value.id" prepend="#" />
            <component selector="tag.attribute.value.class" prepend="." replace="\s+" replace-with="." />
        </display-name>
        <context behavior="start" group-by-name="true" unclosed="parent">
            <auto-close string="&lt;/" completion="${name}&gt;" />
        </context>
    </symbol>
    <starts-with>
        <strings prefix="&lt;" suffix="\b" word-boundary="false" case-insensitive="true">
            <string>h1</string>
            <string>h2</string>
            <string>h3</string>
            …
        </strings>
        <capture number="1" name="html.tag.name" />
    </starts-with>
    …
</scope>

The <display-name> element in this symbol has three <component> elements within:

The first uses the symbol’s name (such as h1)
The second searches for a subscope with the name tag.attribute.value.id and uses its text, such as the myheading of id="myheading". It then prepends a hash (#) character to this text. This will result in myheading becoming #myheading.
The third searches for a subscope with the name tag.attribute.value.class and uses its text, such as the foo bar of class="foo bar". It then prepends a period (.) character and replaces all occurances of whitespace with another period character (.). This will result in foo bar becoming .foo.bar.

When these three computed components are joined, the result will be the display name `h1#myheading.foo.bar", which is a nicely specific descriptor for the user.

Parsing Arguments

For certain types of symbols, such as functions and methods, the parser can automatically symbolicate arguments in such a way that they can be used in completions when inside the function, or invoking the function in other code.

By default, argument parsing is enabled for functions and methods. It can be enabled for any other symbol type in which it makes sense by using the arguments attribute set to true on the <symbol> element. Likewise, it can be disabled for functions and methods using the value false.

When argument parsing is enabled, the parser will enable a special type of symbolication using the argument symbol type. This type specifically calls out that the symbol created is an argument to a symbol in its ancestry. If one or more argument symbol is found, it will automatically be parsed as an argument for the parent symbol.

The name of the argument can be further refined by using a subscope or sub-capture with the class name argument.name, much in the same way as computing symbol names. If no name can be found, the entire text of the argument will be used.

<scope name="javascript.identifier.argument.name">
    <symbol type="argument" />
    <expression>(?&lt;!\=)\b[a-zA-Z_][A-Za-zÀ-ÖØ-öø-ÿ0-9_]*\b</expression>
</scope>

Filtering Symbols

When constructing symbols, it is possible that a <scope> element present in the parse tree should only be turned into a symbol if certain textual characteristics are met, such as matching a regular expression.

The <filter> element used within the <symbol> element allows for this.

Consider this example, from XML, where only tags that are not self-closing will generate symbols.

<scope name="xml.tag.open">
    <symbol type="tag">
        <!-- Do not match self-closing tags -->
        <filter match-end="(?&lt;!/&gt;)" />
        <context behavior="start" group-by-name="true">
            <auto-close string="&lt;/" completion="${name}&gt;" />
        </context>
    </symbol>
    <starts-with>
        <expression>&lt;([a-zA-Z_][A-Za-zÀ-ÖØ-öø-ÿ0-9_:.-]*)</expression>
        <capture number="1" name="xml.tag.name" />
    </starts-with>
    …
</scope>

The <filter> element defines a match-end attribute that is a regular expression pattern that must match at the end of the scope’s text for it to be symbolicated.

Filters can have the following attributes:

match-start: A regular expression to match at the head of the scope’s text (including as a lookbehind)
match-end: A regular expression to match at the end of the scope’s text (including as a lookahead)

Grammar Symbolication Options

Certain symbolication behaviors can be controlled for the entire grammar by using the optional <symbols> element as a direct child of the grammar’s <syntax> element.

<syntax name="mylanguage">
    <meta>
        ...
    </meta>
    
    <symbols />
</syntax>

Controlling What “Local” Scope Means

Depending on the language, definition of a variable in “local” scope may mean different things. In Python, for example, defining a variable within an if statement’s block will make it visible for the entire outer construct (function, method, etc.) However for other languages like C this is not the case and local variables have their scope restricted within the current block.

To control how the symbolication process defines the local scope, use the local element within the grammar’s <symbols> element, with a scope attribute using one of the following values:

within-parent: The symbol will be scoped to the innermost symbolic context parent (the deepest block, function, method, etc.) in which it is defined. This is the default behavior, and mirrors how languages like C and JavaScript behave.
within-construct: The symbol will be scoped to the innermost named symbolic construct (such as a function, method, or type) in which it is defined. If the symbol is defined within an anonymous block within that construct, it will behave as if it were defined outside of it. This is how languages like Python behave.

<syntax name="mylanguage">
    <meta>
        ...
    </meta>
    
    <symbols>
        <local scope="within-construct" />
    </symbols>
</syntax>

Controlling Symbol Redefinition Behaviors

Certain languages behave differently with regard to the redefinition of a symbol, such as a variable. With some languages, such as C, redefinition of a variable in a deeper scope (such as within an if block) will create a second symbol that is only valid within that scope. Other languages, like Python, treat a redefined variable as the same symbol, whose value changes and is reflected in outer scopes, since Python does not have a keyword separately denoting variable definition from assignment.

To control how the symbolication process determines how redefinition affects symbols, use the redefinition attribute of the grammar’s <symbols> element, with one of the following values:

distinct: Redefinition of a symbol in a deeper scope is treated as a new symbol. This is the default behavior, and mirrors how languages like C and JavaScript behave.
within-construct: Redefinition of a symbol in a deeper scope is treated as the same symbol, so long as the redefinition occurs within the same named symbolic construct (function, method, type, etc.). This mirrors how languages like Python behave.
non-distinct: Redefinition of a symbol in a deeper scope is treated as the same symbol no matter the scope of the original symbol.

<syntax name="mylanguage">
    <meta>
        ...
    </meta>
    
    <symbols redefinition="within-construct" />
</syntax>

Symbolic Contexts

Many symbols, such as classes, functions, and expression blocks define regions of code in their respective language that are self-contained for the purposes of things such as variable resolution. This is most often called “scope” in procedural languages, but for the benefit of the term already being super-overloaded, is referred to as Symbolic Context in the Nova parse engine.

Symbolic Contexts are a special behavior of symbols that allows them to easily define the boundaries of code blocks to power IDE features such as code folding, identifier resolution, and intelligent completion.

Using the <context> element within a <symbol> allows the symbol to describe to the parse engine exactly how to build a symbolic context starting, including, or ending with that symbol.

Consider this example of a JavaScript class:

<scope name="javascript.definition.class">
    <symbol type="class">
        <context behavior="subtree" />
    </symbol>
    <starts-with>
        <expression>\b(class)\b</expression>
        <capture number="1" name="javascript.keyword.class" />
    </starts-with>
    …
</scope>

The <context> element here defines that the symbol is a symbolic context. This enables features like code folding for it. The behavior attribute determines how the bounds of the text that is contained by the symbol is defined.

Subtree Contexts

If the behavior attribute is defined as subtree (the default), then the symbolic context is completely defined within the subtree of the current <scope> element. The contents of the text parsed within the <scope> element and its subscopes all from the region that is the symbolic context.

Whitespace Contexts

If the behavior is defined as whitespace, then the symbol starts a symbolic context which is then computed based on the whitespace of lines succeeding the symbol. This is most often used in languages such as Python, which uses whitespace for block deliniation.

There are several automatic rules when using this type of context:

Lines that contain no text other than whitespace are ignored
Lines must have the same relative number of spaces but not necessarily the same indentation. If a line is indented by 1 tab, and the next line by 4 spaces, for documents using a 4-width tab stop these will be considered part of the same symbolic context

Start-Next-End Contexts

Complex symbolic context may be defined using multiple symbols. This is most often utilized when a single symbol, defined by a <scope> element, cannot fully express the boundary of a symbolic context, or when there are multiple parts to a symbolic context chained together.

For this, there are three values for the behavior attribute that define the boundaries of the context:

start: This symbol starts a symbolic context, but does not end it. This allows multiple symbols to define symbolic context between them
end: This symbol ends a symbolic context, but does not start it. This allows multiple symbols to define symbolic context between them
next: This symbol continues a symbolic context, but does not start nor end it. This allows multiple symbols to define symbolic context between them

The simplest combination of these is a single start and end symbol. This can be used to define a single symbolic context, must like using subtree, using two symbols. The second symbol need not even be present in the symbols list by specifying its type as expression. For example, this allows tokens such as end to close the symbolic context of functions and classes in Ruby and other similar languages.

A more complex context can chain together multiple parts, such as the use of if-elseif-else clauses in most procedural programming languages. In this example, the if expression would define a start symbol, elseif would define a next symbol, and else would either define an end symbol or a next symbol (depending on if the language in question had an explicit token construct to end the chain, such as end).

Using these rules, if a symbol appears that is marked with a context behavior of start, the parser will automatically begin looking for symbols marked as next and end, and will combine them together. All available options for of how they are combined may be controlled with the context options specified below.

The simplest way to ensure that the right expressions are always combined together properly is through the use of the group attribute on the <context> element. This defines a name for the context that will be used to match together start, next, and end symbols that appear in a sequence in the parse tree.

Additionally, the group-by-name attribute, set to true, may be used to automatically group together symbols in the same way using the symbol’s name instead of a constant string value.

Symbolic Context Options

The <context> element has several attributes that may be used to configure its behavior:

behavior: Defines how the symbolic context is constructed relative to the symbol’s scope element
group: Allows a “name” to be assigned for use with start-next-end contexts (see above)
group-by-name: Automatically groups symbols together by the computed name instead of a constant string value
priority: Allows an arbitrary integer value to be specified that determine’s a symbol’s priority in matching. This can be used to control the hierarchy of symbolic constructs in languages such as Markdown.
export-local: Whether locally-scoped variables defined within the symbolic context are exported into the global namespace for the purposes of indexing. (Default false, Set to true)
unclosed: How a multi-part context should be closed if an end symbol is not encountered. Valid values are:
- extend-parent: Extend the symbolic context up until the point where its parent ends (or the document ends). This is the behavior for most procedural languages like JavaScript, and this is the default.
- extend-document: Always extend the symbolic context and all ancestors to the end of the document.
- truncate: Truncate the symbolic context back to the most recent start or next symbol, effectively ignoring the last part of the chain.
foldable: Controls whether code folding is enabled for a symbolic context. By default, code folding is enabled if a symbolic context extends across multiple lines, but this may be set to false to disable code folding for the context (but not necessarily any child contexts).
fold-type: Controls what type of symbol is used for code folding. By default, the symbol’s type is used for code folding, but this allows a different symbol to be used specifically for it, such as if a context is defined as an expression but should be folded along with functions when invoking “Fold All Functions”.
arguments: Controls whether arguments are automatically parsed from the symbol’s subtree for completions if the symbol defines a function or other type that includes arguments. By default, only symbols that are typed as functions and methods compute arguments.

Auto-closing

Symbolic contexts have the ability to define an auto-closing behavior. This allows the context to be automatically closed if the user begins to type text that matches a specific expression.

An example of this behavior is the ability to automatically close the current HTML tag if a user types the </ expression. If the user were in a <div> tag, the editor would automatically finish out the expression as </div>.

This behavior is enabled and controlled by using the <auto-close> element within <context>.

Options for the auto-close element include:

string: the textual string that the user must type to invoke auto-closing dynamically
completion: the expression that will be inserted when auto-closing is invoked
indentation: whether indentation is adjusted when the auto-closing takes place. Valid values are:
- auto: Indentation should be adjusted using the syntax grammar’s indentation rules (the default)
- none: No indentation adjustment should be performed

The completion expression supports variable replacement syntax within ${} brackets. Available variables include:

name: The name of the current symbol