Scopes

Scopes are parsing rules that form the building blocks from which a Syntax is constructed. Each scope defines how a specific type of token, block, or region of text is parsed. There are several types of scopes available in the parser, but virtually all work around the concept of using a simple regular expression (or pair of expression) to quantify and tokenize text.

Regular Expressions

See notes in the Syntax Grammar Regular Expressions section for more information on the features supported by regular expressions in syntax grammars.

Defining Scopes

Within a syntax grammar, scopes can be defined in several places:

Top-Level Scopes

Each syntax grammar will have a top-level <scopes> tag that defines its first level of scopes. When parsing of a document begins, these scopes are evaluated. As scopes match, they may cause the parser to enter a deeper state and reference other scopes (like collection scopes).

The top-level scopes are only evaluated when the parser is in its top-level state. As the parser pushes scope rules onto its stack (see start-end scopes, below), the top level scopes will not be referenced until the parse state “pops” back to its root, or they are explicitly referenced using a syntax-wide include scope (see includes, below).

Template Scopes

Template scopes are a special set of scopes which allow easy construction of template languages, like PHP and Jinja.

They behave very similarly to Top-Level Scopes, except that they are evaluated at every level of the parse tree during recursive parsing, and not just at the root of the tree. They are evaluated before any subscopes of the current scope. This allows template tags from these languages to be handled deeply within the language they wrap, such as PHP tags within HTML.

If you are not developing a template language that uses this type of tag, template scopes are probably unnecessary.

Collection Scopes

To make building syntax grammars easier and cleaner, scopes may be grouped logically into Collections. A syntax’s top level <collections> element contains reference to one or more collections, which in turn contain scopes that may be referenced elsewhere using an include scope (see below). One collection can easily include scopes from another collection, allowing for multiple levels of including for more complex syntaxes.

Types of Scopes

There are four main types of scope:

Scope Names

The first two types of scope (match and start-end) each require a name. The name of a scope is a set of components (identifiers) separated by period characters, such as mylang.identifier.keyword. The order of the components in the name does not matter, except for developer preference and readability.

Each identifier may consist of the ASCII alphanumeric set (a-z, A-Z, 0-9), the underscore, and the dash. They cannot contain spaces or any other special characters.

The name is most prominently used during syntax highlighting and theming, where theme rules are filtered based on if and how many components of the name they match.

As a general rule of thumb, names should include the syntax name in which they are defined, and stick to a predefined set of common components where they apply, which can be found in the Themes documentation.

Match Scopes

Match scopes are the simplest type. They define a regular expression (inside of an expression element) that is used as the basis for the resulting parse rule:

<scope name="mylang.keyword.let">
    <expression>let</expression>
</scope>

Match scopes support treating regular expression capture groups as a special type of sub-scope, called a Capture:

<scope name="mylang.identifier.variable">
    <expression>\b(let)\s+([a-zA-ZÀ-ÖØ-öø-ÿ_][A-Za-zÀ-ÖØ-öø-ÿ0-9_]*)</expression>
    <capture number="1" name="mylang.keyword.let" />
    <capture number="2" name="mylang.identifier.variable.name" />
</scope>

A capture element defines the regular expression capture group number (starting at 1, with 0 being the entire regular expression match), and can be referenced by name in the same way as scopes for syntax highlighting.

One note: Match scopes can only match content within the current line of the document being parsed. They cannot cause the parser to consume text within the next line (or consume the newline itself). To parse text that encompasses multiple lines, use a Start-End scope instead. Match scopes can, however, see content on the previous or next line through the use of regular expression look-behinds and look-aheads, respectively.

Match scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Start-End Scopes

Start-End scopes define expressions for the beginning and end of a recursive parse rule. The starts-with element behaves in the same way as a Match scope, with the same support for capture groups.

Once the starting expression is matched, the parser pushes a new state onto its stack, and begins parsing using scopes defined within the scope’s ends-with and subscopes element (or, alternatively, an subsyntax element). The subscopes element may contain any number of other scopes, including other Start-End scopes and Include scopes (see below).

The ends-with expression will take the highest priority during this state. If matched, the parser will end parsing the Start-End scope at this point, and pop its state from the parse stack, returning to the previous set of scopes that were being used before the starts-with expression was encountered.

<scope name="mylang.function">
    <starts-with>
        <!-- Opening bracket -->
        <expression>(\[)</expression>
        <capture number="1" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <!-- Closing bracket -->
        <expression>(\])</expression>
        <capture number="1" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        <scope name="mylang.number">
            <!-- Matches a number -->
            <expression>\d+</expression>
        </scope>
        <scope name="mylang.boolean">
            <!-- Matches a boolean keyword -->
            <expression>true|false</expression>
        </scope>
        <scope name="mylang.string">
            <!-- Matches text between double-quotes -->
            <expression>&quot;[^&quot;]*&quot;</expression>
        </scope>
    </subscopes>
</scope>

Start-End scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Anchored vs. Unanchored Parsing

The subscopes element of a start-end scope contains the scopes that will be used for matching after the scope is pushed onto the parser’s stack.

Unanchored Parsing

By default, subscopes behave just like the top-level <scopes> array: They are matched repeatedly ahead of the current parse position to the end of the line the parser is parsing, and any matches are combined, filtered, and prioritized based on which matched first, which intersect, etc. This is called Unanchored Parsing, as matches in the line are not anchored to any specific location in the line range being parsed.

In unanchored parsing, the order of matches doesn’t matter, so long as they can be sorted and prioritized properly. This is ideal for parsing where the order of tokens isn’t particuarly important outside of the start and end expressions of a scope, such as attributes within HTML tags.

Anchored Parsing

New in Nova 4.

Alternatively, a start-end scope may define its <subscopes> element using the anchored="true" attribute. By doing so, the parser will instead use a different parsing method, known as Anchored Parsing. When anchored parsing is used, the scope’s subscopes will be required to match in order, and may (by default) only match once each. This allows parser rules to be defined which will match specific procedural constructs using multiple subscopes, allowing for more expressive (and accurate) grammars to be constructed for certain types of language tokens.

If a subscope should only be matched conditionally, it can be annotated using the optional="true" attribute. By marking a subscope as optional, it will be attempted at the position within the array of subscopes but will not be required to match. If it does not, the parser will continue on to the next subscope (if any).

If a subscope should be attempted multiple times, it can be annotated using the repeat="true" attribute. By marking as subscope as repeating, it will be attempted continuously until it does not match. Combining both optional="true" and repeat="true" allows a subscope to match zero or more times.

If at any point the parser encounters a token that is not expected while performing subscope matching, the scope will be immediately ended as if a <cut-off> scope had been encountered, and the parser will pop back to the previous scope level.

Additionally, when using anchored parsing, it is not necessary to specify an expression within the ends-with element (although the element itself should still be included as a self-closing tag). Since the parser can determine a specific rule for when the scope should end, it can do so automatically. If an ends-with expression is provided, it will be matched only after all subscopes are matched or optionally ignored.

Finally, when using anchored parsing, whitespace will be automatically consumed between subscopes that do not otherwise match it. This allows parse rules to be written without needing to worry about whitespace. However, if the presence of whitespace preceeding or succeeding a subscope is important this behavior can be disabled by setting skip-whitespace="false" on the subscopes element of the containing scope.

Consider the following example:

<scope name="mylang.function">
    <starts-with>
        <!-- Matches the form 'function foobar' -->
        <expression>(function)\s+([a-zA-Z0-9_]+)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.identifier.name" />
        <capture number="3" name="mylang.bracket" />
    </starts-with>
    <ends-with />
    <subscopes anchored="true">
        <!-- Matches a function arguments list -->
        <scope name="mylang.function.arguments">
            <starts-with>
                <expression>\(</expression>
                <capture number="0" name="mylang.function.arguments.bracket" />
            </starts-with>
            <ends-with>
                <expression>\)</expression>
                <capture number="0" name="mylang.function.arguments.bracket" />
            </ends-with>
            <subscopes>
                <!-- ... -->
            </subscopes>
        </scope>
        
        <!-- Match comments (optionally) -->
        <include syntax="self" collection="comments" optional="true" />
        
        <!-- Matches a function body -->
        <scope name="mylang.function.body">
            <starts-with>
                <expression>\{</expression>
                <capture number="0" name="mylang.function.body.bracket" />
            </starts-with>
            <ends-with>
                <expression>\}</expression>
                <capture number="0" name="mylang.function.body.bracket" />
            </ends-with>
            <subscopes>
                <!-- ... -->
            </subscopes>
        </scope>
    </subscopes>
</scope>

In this example, the scope defines rules for parsing a JavaScript-like function definition. If the parser matches text of the form function <function-name>, the parser will begin performing anchored parsing with its subscopes. It will then attempt the following:

Back-Referencing Between Start and End Expressions

If the closing expression of a start-end scope (defined by <ends-with>) is somehow dependent on the starts-with expression, you can opt to use regular expression capture group references which resolve to the capture groups of the starts-with expression.

To use capture group references in ends-with, the <expression> tag should be replace with a <template> tag. This instructs the parser to resolve capture group references inside of the template expression before compiling its regular expression.

Capture group references use the “backslash” format, just as regular expression back-references do, such as \x, where x is the capture group number.

<scope name="mylang.tag">
    <starts-with>
        <expression>&lt;([a-zA-Z0-9_]+)&gt;</expression>
        <capture number="1" name="mylang.tag.name" />
    </starts-with>
    <ends-with>
        <template>&gt;/\1&lt;</template>
        <capture number="0" name="mylang.tag.name" />
    </ends-with>
    <subscopes />
</scope>

Subsyntaxes

Alternatively, Start-End scopes can be used to define a fenced block of code, also known as a Subsyntax. When a subsyntax element is used in place of subscopes, the parser will automatically treat this as a fenced code block and take extra care to parse the subsyntax with this in mind. Subsyntax elements may contain Cut-off scopes (see below) to further instruct the parser on how to “break out” of the fenced code block should the code within be incomplete.

<scope name="mylang.fenced-code-block">
    <starts-with>
        <expression>\`\`\`</expression>
    </starts-with>
    <ends-with>
        <expression>\`\`\`</expression>
    </ends-with>
    <subsyntax name="otherlang">
        
    </subsyntax>
</scope>

Subsyntaxes are most often used, as mentioned before, for fenced code blocks. Examples of this behavior include:

Subsyntaxes are not generally recommended for simple inclusion of specific parts of another syntax (such as including parts of CSS inside of SCSS, for example). For this, you should use Include scopes instead.

The <subsyntax> element defines the name of the syntax to use within the fenced code block, which must be a validly registered syntax. The subsyntax should not reference the outer syntax recursively. Instead, consider using an include scope (see below) to include the syntax’s own rules within itself.

Unlike normal Start-End scopes, the ends-with expression for a subsyntax scope will be evaluated deeply in the fenced region of code, so that its expression can “break” out of the subsyntax early. An example of this is the use of a </script> tag for a JavaScript fenced code block in HTML. The end tag should be able to be matched even if the JavaScript code within is not fully complete and valid.

The one exception to this behavior is the use of atomic scopes (such as comments and strings). The ends-with expression will never be evaluated within these scopes (for more information on atomic scopes, see Scope Options below.

The <subsyntax> element has the following possible options:

String Expressions

As an alternative to using a regular expression, both match scopes and start-end scopes may optionally define its expression using a set of strings. This is very useful when the possible expressions being matched are from a known set of words or expressions:

<scope name="javascript.keyword">
    <strings>
        <string>await</string>
        <string>break</string>
        <string>case</string>
        <string>catch</string>
        <string>class</string>
        <string>const</string>
    </strings>
</scope>

Behind the scenes, the syntax engine will compile this word list into an optimized regular expression that will be used in much the same way as a normal match expression.

The <strings> element has several possible options:

Cut-Off Scopes

Cut-off scopes are used primarily for improving the performance of complex language definitions. When matched, a cut-off instructs the parser to stop evaluating the current start-end scope as if its ends-with expression had been encountered.

They are most often used for defining cases when known text should never be encountered within the current scope (such as encountering a class definition inside of a method’s implementation, assuming the language doesn’t support that).

Consider the following case:

<cut-off>
    <expression>(?=\b(?:class)\b)</expression>
</cut-off>

If this cut-off expression were placed within the subscopes of a method’s implementation, it would indicate that should the parser encounter the class token ahead of the current location, the method should immediately stop parsing, as the class definition is not valid here.

This type of “early cut-off” allows the parser to be far quicker at adjusting the parse tree when changes are made, as encountering this type of token would not cause the parser to continue parsing as if the method’s implementation was still open.

Include Scopes

Include scopes, along with Collections, are used to organize scopes into logical sets that can be reused in multiple places within a syntax definition.

When encountered, the parser will look up the collection referenced by the include, and evaluate the collection’s scopes as if they were defined in place of the include. An include scope can be used at any place other scope types are valid, including within the subscopes element of a start-end scope.

The most common use of an include is to reference a collection within the current syntax, through the use of the special self name:

<include syntax="self" collection="variables" />

Syntaxes that reference collections within themselves should always use self instead of the syntax name, as this allows the parser to properly handle cases where syntaxes inherit from one another and override collections.

An include scope can also be used to reference a collection within another syntax:

<include syntax="javascript" collection="comments" />

Doing so forms a dependency between the two syntaxes. If the referenced syntax is not available, the include will evaluate to an empty set. It is generally recommended to only use this behavior if both syntaxes are provided as part of the same extension. Additionally, relying on collections within built-in syntaxes should be avoided, as these may change at any time.

Finally, includes can also be used to include an entire syntax, including self:

<include syntax="self" />

This is effectively the same as including the special top-level scopes collection within the syntax. Care should be taken when doing this, as it can cause deep recursion within a syntax that can impact parsing performance.

Scope Options

Spell Checking

Spell checking is most often reserved for comments and prose within a language. When this is enabled, the editor will automatically perform spell checking using the user’s default language dictionary and highlight misspelled text.

By default, all scopes have spell checking set to “inherit” from their parent.

A scope can explicitly opt-in to spell checking by adding the spell-check="true" attribute to its <scope> tag. This value may also be set to "false" to disable spell checking in a scope when its closes non-inherited ancestors has enabled it.

A syntax may also include the spell-check attribute on its top-level <scopes> tag to enable spell checking for the entire syntax, after which individual scopes may disable it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Lookup

Lookup allows the user to perform the “define” gesture on their mouse or trackpad to invoke various actions. This gesture is most often a “deep-click” on force-touch trackpads or a three-finger tap on older trackpads and touch-enabled mice.

By default, all scopes have lookup set to “inherit” from their parent.

There are two types of lookup available to scopes:

By default, “index” lookups are used. A scope may set the lookup="dictionary" attribute on its <scope> tag to define that words within the scope should use dictionary lookup behavior.

A syntax may also include the lookup attribute on its top-level <scopes> tag to set the default lookup behavior for the entire syntax, after which individual scopes may change it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Atomic Scopes

Start-End scopes represent a recursive level of parsing downward. Certain types of other scopes, such as template scopes, cut-offs, and ends-with expressions can be referenced during this deeper level of recursion to “break out” of the current level of parsing.

However, for certain types of parsing, this behavior is not ideal. The best examples are comments and strings. When typing a JavaScript string within an HTML script tag, for example, you would not want the expression let string = "use a </script> tag"; to be able to “break out” of the JavaScript subsyntax. Therefore, by default, all comments and strings are marked as “atomic”. This means that deeper cut-off rules (like ends-with expressions) will not be evaluated within their level of the parse tree unless that cut-off is defined within that specific scope.

Other types of Start-End scopes may opt-into to being atomic by adding the atomic="true" attribute to their <scope> tag. However, it is generally rare for this to be used outside of commends and strings, which is why these two types of token are automatically set to atomic.

Symbolication

Building a tree of symbols in a text document is done through a process known as symbolication. Scopes help define the rules of symbolication directly, allowing the symbol tree to closely mirror the parse tree of the document.

The symbol tree is also used to power several IDE features, such as “Jump To Definition” and “Select All In Scope”.

Scopes define themselves as contributing to symbolication through the use of the <symbol> tag within their <scope> tag:

<scope name="mylang.function">
    <symbol type="function">
        <context behavior="subtree" />
    </symbol>
    <starts-with>
        <expression>(function)\s*(\{)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <expression>\}</expression>
        <capture number="0" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        
    </subscopes>
</scope>

More information on defining the <symbol> element can be found in Symbols.