Syntaxes

A syntax grammar can be used to add a new syntax language to the editor for parsing files.

Syntax grammars are XML documents defined within the Syntaxes/ folder of an extension. The name of the file does not matter, but is most often reflects the name of the syntax itself (such as HTML.xml for an HTML syntax).

Syntaxes declare a top-level <syntax> tag that contains the entirety of the rules for the syntax.

The syntax should define an attribute name that is the reference identifier for the syntax, such as "html". This identifier must consist only of ASCII alphanumeric characters (a-z, A-Z, 0-9), the underscore, and the dash. Any other characters are invalid in syntax names. This is also most often lower case for consistency with other syntaxes.

A syntax may also define the subsyntax="true" attribute to denote that it is used for subsyntax use only. A subsyntax will not be shown in UI elements and will not be able to be selected by the user as the parse language for a document. Instead, subsyntaxes can be referenced by other syntaxes in their Start-End and Include scopes.

There are several elements contained within this top-level tag that define features of the syntax:

Meta Properties

The <meta> element contains metadata about the syntax, including its identifying name, user-readable name, and language category.

The full set of elements that may be contained within the meta property are:

User-Readable Name

The <name> element contains the user-readable name of the syntax, which is display in user interface elements, such as “HTML”.

Syntax Type

The <type> element contains the category in which the syntax should be considered. This determines how documents using the syntax are presented to the user, including accent colors in the document’s tab.

Valid types are:

Preferred File Extension

The <preferred-file-extension> element contains the file extension used by default for new documents using the syntax, such as html for HTML files.

Parent

The <parent> element defines that the syntax is a conceptual “child” of another syntax. This is not used in parsing, but for IDE features that may restrict certain items by syntax, such as Clips.

Consider using this in the definition for TypeScript (a superset of JavaScript):

<parent>javascript</parent>

By declaring a syntax like TypeScript as having the parent JavaScript, any features of the IDE restricted to JavaScript (such as specific Clips) will also be valid within TypeScript documents.

It is generally allowed to reference built-in syntaxes when defining parents of extension-provided syntaxes, as the names of built-in syntaxes are unlikely to change in the future.

Scriptable

The <scriptable> element declares that a syntax can be used for scripting in shells and executed externally. This enables the language to be used in the IDE’s Tasks UI.

The scriptable tag can declare an attribute shebang, which defines the shebang expression inserted at the top of the script when used in Tasks.

<scriptable shebang="#!/usr/bin/env python" />

Detectors

Sets of Detectors are used when files are opened in the editor, and determine which syntax is used for the document based on a set of definable rules that evaluate to a “score”. Each syntax’s detectors are evaluated against the document being opened, and whichever set scores the “highest” will determine the language used.

Detectors are declared within the <detectors> element of the syntax.

There are four types of detectors available:

Syntaxes may declare multiple detectors within their <detectors> element. Each detector will be evaluated in succession to determine if a file matches, and will not affect one another’s score.

Detector Priorities

Each detector can define a “priority” as a value between 0.0 and 1.0. This value can be used to define the confidence the detector has that the file being evaluated should match its syntax.

Most often, detectors will use a priority of 1.0 (especially if matching based on filename or file extension). However, when matching content, it’s often useful to declare priorities lower that 1.0 to prevent over-ambitious matching of documents.

File Extension Detectors

File extension detectors will match a document based on one or more of its file extension components.

The text of the element may be a comma-separated list to include multiple possible extensions, and should not include the leading period character.

<extension priority="1.0">md,mdown,markdown</extension>

This defines a detector that matches several types of Markdown documents.

If a document contains a compound / multiple file extensions, such as myscript.min.js, the combined extension (min.js) will be evaluated first, followed by each file extension component (min, js).

Filename Detectors

Filename detectors will match a document based on one or more predefined filenames.

They are often useful when documents of a certain type may not have a file extension, or cases when a file type that uses a common extension should use a specific syntax.

The text of the element may be a comma-separated list to include multiple possible filenames.

<filename priority="1.0">Gemspec</filename>

This defines a detector that matches a file named Gemspec.

Content Match Detectors

Content match detectors will match the text of a document using a regular expression.

They are often used for documents that may use the same file extension or filename between multiple languages, such as HTML documents that contain template tags for various template engines (such as Django, ERB, and Jinja)

Content match detectors may define a lines attribute that determines the number of lines from the top of text that should be evaluated. By default, this is unbounded, but if the content being matched is guaranteed to be within a certain number of lines it is advised to include this to improve performance of detection in long documents.

<match-content lines="2" priority="0.3">&lt;(?:i\!DOCTYPE)</match-content>

This defines a detector that finds the prefix of an HTML <DOCTYPE tag in the first two lines of text.

Content match detectors are also very useful for detecting shebang declarations in various types of scripts, such as Perl, Python, and Ruby.

<match-content lines="1" priority="0.7">\#\!.*?\bpython\b</match-content>

This defines a detector that finds a shebang declaration that contains the word python.

Compound Detectors

Compound detectors can be used to combine multiple other detectors into a single rule. This is most often useful when you wish to restrict certain detectors together, such as only matching a content match detector when a specific filename or file extension also matches.

A compound detector can define its own priority, which overrides its children, or by default allow the children’s individual scores to be averaged together.

<combo priority="0.7">
    <extension>html</extension>
    <match-content>{% [a-zA-Z]</match-content>
</combo>

This defines a compound detector that matches documents with an html file extension that contain the start of Django/Jinja-style template tags.

Indentation Rules

The <indentation> element defines rules for automatically adjusting indentation as the user types.

These regular expressions are automatically evaluated during specific typing as the user types to determine when indentation should increase and decrease automatically. They are most often evaluated when pressing Return to determine the indentation of the line being inserted.

<indentation>
    <increase>
        <expression>(\{[^}\"']*$)|(\[[^\]\"']*$)|(\([^)\"']*$)</expression>
    </increase>
    <decrease>
        <expression>^\s*(\s*/\*.*\*/\s*)*[\}\]\)\\]</expression>
    </decrease>
</indentation>

This defines a set of rules that increase indentation after JavaScript-style opening bracket and comment tokens, and decrease indentation when typing JavaScript-style closing bracket and comment tokens.

Matching Increase

If this expression matches the current line just before the user presses Return, the succeeding line being inserted will be automatically indented one level.

Matching Decrease

If this expression matches the current line when the user types text matching the expression, the current line will be automatically dedented one level.

Matching Both Increase and Decrease

If both expressions match the current line when the user presses Return, the succeeding line will be indented, and an additional line will be inserted afterward that is dedented to the same level as the original line. The cursor will be placed on the middle line.

Comment Rules

The <comments> element of the syntax defines the rules for commenting and uncommenting text within a document. There are individual elements within to define rules for both single-line and multi-line commenting.

<comments>
    <single>
        <expression>//</expression>
    </single>
    <multiline>
        <starts-with>
            <expression>/*</expression>
        </starts-with>
        <ends-with>
            <expression>*/</expression>
        </ends-with>
    </multiline>
</comments>

The expression values are text that will be wrapped around text being commented, or detected and removed from text being uncommented.

Brackets

The <brackets> element defines the set of characters that should be treated as brackets by the editor when performing bracket matching, bracket highlighting, and bracket auto-closing.

<brackets>
    <pair open="{" close="}" />
    <pair open="[" close="]" />
    <pair open="(" close=")" />
</brackets>

This defines three sets of pairs for each of the common types of bracket used in procedural languages.

Pairs may be defined for any two sets of characters, as long as the characters are each one UTF-16 codepoint in length.

Surrounding Pairs

The <surrounding-pairs> element defines the set of characters, which are most often varying types of brackets and quotes, that should be treated as pairs by the editor when performing wrapping of selected text as well as inserting and consuming pairs during typing.

Surrounding pairs are treated as separate from brackets (see above) due to the greater number of pairs that are likely needed to be supported by languages.

<surrounding-pairs>
    <pair open="{" close="}" />
    <pair open="[" close="]" />
    <pair open="(" close=")" />
    <pair open="'" close="'" />
    <pair open="\" close="\" />
    <pair open="`" close="`" />
</surrounding-pairs>

This defines six sets of pairs for each of the common types of bracket and quote used in procedural languages.

Pairs may be defined for any two sets of characters, as long as the characters are each one UTF-16 codepoint in length.

Scopes

Each syntax grammar will have a top-level <scopes> element that defines its first level of Scopes. When parsing of a document begins, these scopes are evaluated. As scopes match, they may cause the parser to enter a deeper state and reference other scopes (like collection scopes).

The top-level scopes are only evaluated when the parser is in its top-level state. As the parser pushes scope rules onto its stack (see start-end scopes), the top level scopes will not be referenced until the parse state “pops” back to its root, or they are explicitly referenced using a syntax-wide include scope.

Template Scopes

Template scopes are defined within the <template-scopes> element of the syntax grammar, and otherwise are defined exactly the same way as the <scopes> element (see above).

Template scopes are a special set of scopes which allow easy construction of template languages, like PHP and Jinja.

They behave very similarly to Top-Level Scopes, except that they are evaluated at every level of the parse tree during recursive parsing, and not just at the root of the tree. They are evaluated before any subscopes of the current scope. This allows template tags from these languages to be handled deeply within the language they wrap, such as PHP tags within HTML.

If you are not developing a template language that uses this type of tag, template scopes are probably unnecessary.

Collections

To make building syntax grammars easier and cleaner, scopes may be grouped logically into Collections.

A syntax’s top level <collections> element contains reference to one or more collections, which in turn contain scopes that may be referenced elsewhere using an Include scope.

One collection can easily include scopes from another collection, allowing for multiple levels of including for more complex syntaxes.

A collection is defined using a <collection> element.

Each collection should have a name attribute, used to reference the collection for including. Collection names may contain the same set of characters as Scope names: alphanumeric characters, as well as the period, underscore, and dash.

Collections are scoped to their defining syntax, so collection names will not conflict between multiple unrelated syntaxes.

<collections>
    <!-- Keywords -->
    <collection name="keywords">
        <scope name="javascript.keyword">
            <strings>
                <string>await</string>
                <string>break</string>
                <string>case</string>
                <string>catch</string>
                <string>class</string>
                <string>const</string>
                […additional strings…]
            </strings>
        </scope>
        
        […additional scopes…]
    </collection>
    
    […additional collections…]
</collections>

Completions

Editor completions, displayed as the user types, can be provided for a syntax grammer in several ways, including using a static Completions definition file.

As a convenience, the <completions> element of a completions file may be included in the top-level of a syntax grammar if all of the following conditions are met:

This allows a syntax grammar to add symbolic completion providers easily inline in the syntax grammar without creating a separate completions file. Additionally, an extension may provide both of these (an inline <completions> element and a completions file) if needed.