Scopes

Scopes are parsing rules that form the building blocks from which a Syntax is constructed. Each scope defines how a specific type of token, block, or region of text is parsed. There are several types of scopes available in the parser, but virtually all work around the concept of using a simple regular expression (or pair of expression) to quantify and tokenize text.

Defining Scopes

Within a syntax grammar, scopes can be defined in several places:

Top-Level Scopes

Each syntax grammar will have a top-level <scopes> tag that defines its first level of scopes. When parsing of a document begins, these scopes are evaluated. As scopes match, they may cause the parser to enter a deeper state and reference other scopes (like collection scopes).

The top-level scopes are only evaluated when the parser is in its top-level state. As the parser pushes scope rules onto its stack (see start-end scopes, below), the top level scopes will not be referenced until the parse state “pops” back to its root, or they are explicitly referenced using a syntax-wide include scope (see includes, below).

Template Scopes

Template scopes are a special set of scopes which allow easy construction of template languages, like PHP and Jinja.

They behave very similarly to Top-Level Scopes, except that they are evaluated at every level of the parse tree during recursive parsing, and not just at the root of the tree. They are evaluated before any subscopes of the current scope. This allows template tags from these languages to be handled deeply within the language they wrap, such as PHP tags within HTML.

If you are not developing a template language that uses this type of tag, template scopes are probably unnecessary.

Collection Scopes

To make building syntax grammars easier and cleaner, scopes may be grouped logically into Collections. A syntax’s top level <collections> element contains reference to one or more collections, which in turn contain scopes that may be referenced elsewhere using an include scope (see below). One collection can easily include scopes from another collection, allowing for multiple levels of including for more complex syntaxes.

Types of Scopes

There are four main types of scope:

Scope Names

The first two types of scope (match and start-end) each require a name. The name of a scope is a set of components (identifiers) separated by period characters, such as mylang.identifier.keyword. The order of the components in the name does not matter, except for developer preference and readability.

Each identifier may consist of the ASCII alphanumeric set (a-z, A-Z, 0-9), the underscore, and the dash. They cannot contain spaces or any other special characters.

The name is most prominently used during syntax highlighting and theming, where theme rules are filtered based on if and how many components of the name they match.

As a general rule of thumb, names should include the syntax name in which they are defined, and stick to a predefined set of common components where they apply, which can be found in the Themes documentation.

Match Scopes

Match scopes are the simplest type. They define a regular expression (inside of an expression element) that is used as the basis for the resulting parse rule:

<scope name="mylang.keyword.let">
    <expression>let</expression>
</scope>

Match scopes support treating regular expression capture groups as a special type of sub-scope, called a Capture:

<scope name="mylang.identifier.variable">
    <expression>\b(let)\s+([a-zA-ZÀ-ÖØ-öø-ÿ_][A-Za-zÀ-ÖØ-öø-ÿ0-9_]*)</expression>
    <capture number="1" name="mylang.keyword.let" />
    <capture number="2" name="mylang.identifier.variable.name" />
</scope>

A capture element defines the regular expression capture group number (starting at 1, with 0 being the entire regular expression match), and can be referenced by name in the same way as scopes for syntax highlighting.

One note: Match scopes can only match content within the current line of the document being parsed. They cannot cause the parser to consume text within the next line. To parse text that encompasses multiple lines, use a Start-End scope instead. Match scopes can, however, see content on the previous or next line through the use of regular expression look-behinds and look-aheads, respectively.

Match scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Start-End Scopes

Start-End scopes define expressions for the beginning and end of a recursive parse rule. The starts-with element behaves in the same way as a Match scope, with the same support for capture groups.

Once the starting expression is matched, the parser pushes a new state onto its stack, and begins parsing using scopes defined within the scope’s ends-with and subscopes element. This element may contain any number of other scopes, including other Start-End scopes and Include scopes (see below).

The ends-with expression will take the highest priority during this state. If matched, the parser will end parsing the Start-End scope at this point, and pop its state from the parse stack, returning to the previous set of scopes that were being used before the starts-with expression was encountered.

<scope name="mylang.function">
    <starts-with>
        <expression>(function)\s*(\{)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <expression>\}</expression>
        <capture number="0" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        
    </subscopes>
</scope>

Start-End scopes can be configured with additional options, such as spell-checking and symbolication. See Scope Options below.

Subsyntaxes

Additionally, Start-End scopes can be used to define a fenced block of code, also known as a Subsyntax. When a subsyntax element is used in place of subscopes, the parser will automatically treat this as a fenced code block and take extra care to parse the subsyntax with this in mind. Subsyntax elements may contain Cut-off scopes (see below) to further instruct the parser on how to “break out” of the fenced code block should the code within be incomplete.

<scope name="mylang.fenced-code-block">
    <starts-with>
        <expression>\`\`\`</expression>
    </starts-with>
    <ends-with>
        <expression>\`\`\`</expression>
    </ends-with>
    <subsyntax name="otherlang">
        
    </subsyntax>
</scope>

Subsyntaxes are most often used, as mentioned before, for fenced code blocks. Examples of this behavior include:

Subsyntaxes are not generally recommended for simple inclusion of specific parts of another syntax (such as including parts of CSS inside of SCSS, for example). For this, you should use Include scopes instead.

The <subsyntax> element defines the name of the syntax to use within the fenced code block, which must be a validly registered syntax. The subsyntax should not reference the outer syntax recursively. Instead, consider using an include scope (see below) to include the syntax’s own rules within itself.

Unlike normal Start-End scopes, the ends-with expression for a subsyntax scope will be evaluated deeply in the fenced region of code, so that its expression can “break” out of the subsyntax early. An example of this is the use of a </script> tag for a JavaScript fenced code block in HTML. The end tag should be able to be matched even if the JavaScript code within is not fully complete and valid.

The one exception to this behavior is the use of atomic scopes (such as comments and strings). The ends-with expression will never be evaluated within these scopes (for more information on atomic scopes, see Scope Options below.

The <subsyntax> element has the following possible options:

String Expressions

As an alternative to using a regular expression, both match scopes and start-end scopes may optionally define its expression using a set of strings. This is very useful when the possible expressions being matched are from a known set of words or expressions:

<scope name="javascript.keyword">
    <strings>
        <string>await</string>
        <string>break</string>
        <string>case</string>
        <string>catch</string>
        <string>class</string>
        <string>const</string>
    </strings>
</scope>

Behind the scenes, the syntax engine will compile this word list into an optimized regular expression that will be used in much the same way as a normal match expression.

The <strings> element has several possible options:

Cut-Off Scopes

Cut-off scopes are used primarily for improving the performance of complex language definitions. When matched, a cut-off instructs the parser to stop evaluating the current start-end scope as if its ends-with expression had been encountered.

They are most often used for defining cases when known text should never be encountered within the current scope (such as encountering a class definition inside of a method’s implementation, assuming the language doesn’t support that).

Consider the following case:

<cut-off>
    <expression>(?=\b(?:class)\b)</expression>
</cut-off>

If this cut-off expression were placed within the subscopes of a method’s implementation, it would indicate that should the parser encounter the class token ahead of the current location, the method should immediately stop parsing, as the class definition is not valid here.

This type of “early cut-off” allows the parser to be far quicker at adjusting the parse tree when changes are made, as encountering this type of token would not cause the parser to continue parsing as if the method’s implementation was still open.

Include Scopes

Include scopes, along with Collections, are used to organize scopes into logical sets that can be reused in multiple places within a syntax definition.

When encountered, the parser will look up the collection referenced by the include, and evaluate the collection’s scopes as if they were defined in place of the include. An include scope can be used at any place other scope types are valid, including within the subscopes element of a start-end scope.

The most common use of an include is to reference a collection within the current syntax, through the use of the special self name:

<include syntax="self" collection="variables" />

Syntaxes that reference collections within themselves should always use self instead of the syntax name, as this allows the parser to properly handle cases where syntaxes inherit from one another and override collections.

An include scope can also be used to reference a collection within another syntax:

<include syntax="javascript" collection="comments" />

Doing so forms a dependency between the two syntaxes. If the referenced syntax is not available, the include will evaluate to an empty set. It is generally recommended to only use this behavior if both syntaxes are provided as part of the same extension. Additionally, relying on collections within built-in syntaxes should be avoided, as these may change at any time.

Finally, includes can also be used to include an entire syntax, including self:

<include syntax="self" />

This is effectively the same as including the special top-level scopes collection within the syntax. Care should be taken when doing this, as it can cause deep recursion within a syntax that can impact parsing performance.

Scope Options

Spell Checking

Spell checking is most often reserved for comments and prose within a language. When this is enabled, the editor will automatically perform spell checking using the user’s default language dictionary and highlight misspelled text.

By default, all scopes have spell checking set to “inherit” from their parent.

A scope can explicitly opt-in to spell checking by adding the spell-check="true" attribute to its <scope> tag. This value may also be set to "false" to disable spell checking in a scope when its closes non-inherited ancestors has enabled it.

A syntax may also include the spell-check attribute on its top-level <scopes> tag to enable spell checking for the entire syntax, after which individual scopes may disable it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Lookup

Lookup allows the user to perform the “define” gesture on their mouse or trackpad to invoke various actions. This gesture is most often a “deep-click” on force-touch trackpads or a three-finger tap on older trackpads and touch-enabled mice.

By default, all scopes have lookup set to “inherit” from their parent.

There are two types of lookup available to scopes:

By default, “index” lookups are used. A scope may set the lookup="dictionary" attribute on its <scope> tag to define that words within the scope should use dictionary lookup behavior.

A syntax may also include the lookup attribute on its top-level <scopes> tag to set the default lookup behavior for the entire syntax, after which individual scopes may change it as needed. This is most often used for prose-heavy languages like Markdown and HTML.

Atomic Scopes

Start-End scopes represent a recursive level of parsing downward. Certain types of other scopes, such as template scopes, cut-offs, and ends-with expressions can be referenced during this deeper level of recursion to “break out” of the current level of parsing.

However, for certain types of parsing, this behavior is not ideal. The best examples are comments and strings. When typing a JavaScript string within an HTML script tag, for example, you would not want the expression let string = "use a </script> tag"; to be able to “break out” of the JavaScript subsyntax. Therefore, by default, all comments and strings are marked as “atomic”. This means that deeper cut-off rules (like ends-with expressions) will not be evaluated within their level of the parse tree unless that cut-off is defined within that specific scope.

Other types of Start-End scopes may opt-into to being atomic by adding the atomic="true" attribute to their <scope> tag. However, it is generally rare for this to be used outside of commends and strings, which is why these two types of token are automatically set to atomic.

Symbolication

Building a tree of symbols in a text document is done through a process known as symbolication. Scopes help define the rules of symbolication directly, allowing the symbol tree to closely mirror the parse tree of the document.

The symbol tree is also used to power several IDE features, such as “Jump To Definition” and “Select All In Scope”.

Scopes define themselves as contributing to symbolication through the use of the <symbol> tag within their <scope> tag:

<scope name="mylang.function">
    <symbol type="function">
        <context behavior="subtree" />
    </symbol>
    <starts-with>
        <expression>(function)\s*(\{)</expression>
        <capture number="1" name="mylang.function.keyword" />
        <capture number="2" name="mylang.bracket" />
    </starts-with>
    <ends-with>
        <expression>\}</expression>
        <capture number="0" name="mylang.bracket" />
    </ends-with>
    <subscopes>
        
    </subscopes>
</scope>

More information on defining the <symbol> element can be found in Symbols.