Defining a Syntax

Syntaxes are the core piece of supporting a new language in Nova. A syntax defines both the grammar from which a language is parsed as well as auxillary information about editing documents writted in the language, such as indentation and commenting rules.

A syntax is defined by an XML document within the Syntaxes/ subfolder of an extension. The name of the file does not matter, but it most often reflects the name of the syntax itself (such as HTML.xml for an HTML syntax).

These XML documents should have a top-level <syntax> element.

This element should have an attribute name, which is the reference identifier for the syntax, such as name="html". This identifier must consist only of ASCII alphanumeric characters (a-z, A-Z, 0-9), the underscore, and the dash. Any other characters are invalid in syntax names. This is also most often lower case for consistency with other syntaxes.

The syntax element may also define the subsyntax="true" attribute to denote that it is used for subsyntax use only. A subsyntax will not be shown in UI elements (like the Syntax menu) and is not selectable by the user as the parse language for a document. Instead, subsyntaxes can be referenced by other syntaxes for embedded code fences.

Meta Properties

The <meta> element contains metadata about the syntax, including its identifying name, user-readable name, and language category.

The full set of elements that may be contained within the meta property are:

User-Readable Name

The <name> element contains the user-readable name of the syntax, which is display in user interface elements, such as “HTML”.

Language Type

The <type> element contains the category in which the syntax should be considered. This determines how documents using the syntax are presented to the user, including accent colors in the document’s tab.

Valid types are:

Preferred File Extension

The <preferred-file-extension> element contains the file extension used by default for new documents using the syntax, such as html for HTML files.

Parent

The optional <parent> element defines that the syntax is a conceptual “child” of another syntax. This is not used in parsing, but for IDE features that may restrict certain items by syntax, such as Clips.

Consider using this in the definition for TypeScript (a superset of JavaScript):

<parent>javascript</parent>

By declaring a syntax like TypeScript as having the parent JavaScript, any features of the IDE restricted to JavaScript (such as specific Clips) will also be valid within TypeScript documents.

It is generally allowed to reference built-in syntaxes when defining parents of extension-provided syntaxes, as the names of built-in syntaxes are unlikely to change in the future.

Scriptable

The optional <scriptable> element declares that a syntax can be used for scripting in shells and executed externally. This enables the language to be used in the IDE’s Tasks UI.

The scriptable tag can declare an attribute shebang, which defines the shebang expression inserted at the top of the script when used in Tasks.

<scriptable shebang="#!/usr/bin/env python" />

Detectors

Sets of Detectors are used when files are opened in the editor, and determine which syntax is used for the document based on a set of definable rules that evaluate to a “score”. Each syntax’s detectors are evaluated against the document being opened, and whichever set scores the “highest” will determine the language used.

Detectors are declared within the <detectors> element.

There are four types of detectors available:

Syntaxes may declare multiple detectors within their <detectors> element. Each detector will be evaluated in succession to determine if a file matches, and will not affect one another’s score.

Detector Priorities

Each detector can define a “priority” as a value between 0.0 and 1.0. This value can be used to define the confidence the detector has that the file being evaluated should match its syntax.

Most often, detectors will use a priority of 1.0 (especially if matching based on filename or file extension). However, when matching content, it’s often useful to declare priorities lower that 1.0 to prevent over-ambitious matching of documents.

File Extension Detectors

File extension detectors will match a document based on one or more of its file extension components.

The text of the element may be a comma-separated list to include multiple possible extensions, and should not include the leading period character.

<extension priority="1.0">md,mdown,markdown</extension>

This defines a detector that matches several types of Markdown documents.

If a document contains a compound / multiple file extensions, such as myscript.min.js, the combined extension (min.js) will be evaluated first, followed by each file extension component (min, js).

Filename Detectors

Filename detectors will match a document based on one or more predefined filenames.

They are often useful when documents of a certain type may not have a file extension, or cases when a file type that uses a common extension should use a specific syntax.

The text of the element may be a comma-separated list to include multiple possible filenames.

<filename priority="1.0">Gemspec</filename>

This defines a detector that matches a file named Gemspec.

Content Match Detectors

Content match detectors will match the text of a document using a regular expression.

They are often used for documents that may use the same file extension or filename between multiple languages, such as HTML documents that contain template tags for various template engines (such as Django, ERB, and Jinja)

Content match detectors may define a lines attribute that determines the number of lines from the top of text that should be evaluated. By default, this is unbounded, but if the content being matched is guaranteed to be within a certain number of lines it is advised to include this to improve performance of detection in long documents.

<match-content lines="2" priority="0.3">&lt;(?:i\!DOCTYPE)</match-content>

This defines a detector that finds the prefix of an HTML <DOCTYPE tag in the first two lines of text.

Content match detectors are also very useful for detecting shebang declarations in various types of scripts, such as Perl, Python, and Ruby.

<match-content lines="1" priority="0.7">\#\!.*?\bpython\b</match-content>

This defines a detector that finds a shebang declaration that contains the word python.

Compound Detectors

Compound detectors can be used to combine multiple other detectors into a single rule. This is most often useful when you wish to restrict certain detectors together, such as only matching a content match detector when a specific filename or file extension also matches.

A compound detector can define its own priority, which overrides its children, or by default allow the children’s individual scores to be averaged together.

<combo priority="0.7">
    <extension>html</extension>
    <match-content>{% [a-zA-Z]</match-content>
</combo>

This defines a compound detector that matches documents with an html file extension that contain the start of Django/Jinja-style template tags.

Indentation Rules

The <indentation> element defines rules for automatically adjusting indentation as the user types.

These regular expressions are automatically evaluated during specific typing as the user types to determine when indentation should increase and decrease automatically. They are most often evaluated when pressing Return to determine the indentation of the line being inserted.

<indentation>
    <increase>
        <expression>(\{[^}\"']*$)|(\[[^\]\"']*$)|(\([^)\"']*$)</expression>
    </increase>
    <decrease>
        <expression>^\s*(\s*/\*.*\*/\s*)*[\}\]\)\\]</expression>
    </decrease>
</indentation>

This defines a set of rules that increase indentation after JavaScript-style opening bracket and comment tokens, and decrease indentation when typing JavaScript-style closing bracket and comment tokens.

Matching Increase

If this expression matches the current line just before the user presses Return, the succeeding line being inserted will be automatically indented one level.

Matching Decrease

If this expression matches the current line when the user types text matching the expression, the current line will be automatically dedented one level.

Matching Both Increase and Decrease

If both expressions match the current line when the user presses Return, the succeeding line will be indented, and an additional line will be inserted afterward that is dedented to the same level as the original line. The cursor will be placed on the middle line.

Comment Rules

The <comments> element defines the rules for commenting and uncommenting text within a document. There are individual elements within to define rules for both single-line and multi-line commenting.

<comments>
    <single>
        <expression>//</expression>
    </single>
    <multiline>
        <starts-with>
            <expression>/*</expression>
        </starts-with>
        <ends-with>
            <expression>*/</expression>
        </ends-with>
    </multiline>
</comments>

The expression values are text that will be wrapped around text being commented, or detected and removed from text being uncommented.

Brackets

The optional <brackets> element defines the pairs of characters which should be treated as brackets by the editor when performing matching (when the cursor sits against them), ping highlighting (as the cursor crosses them), and auto-closing.

Generally, it is not required to specify this element, as the default bracket set is usually appropriate for most languages.

The default set of brackets are: {}, [], ().

To specify a custom set of brackets:

<brackets>
    <pair open="{" close="}" />
    <pair open="[" close="]" />
    <pair open="(" close=")" />
</brackets>

This defines three sets of pairs for each of the common types of bracket used in procedural languages.

Pairs may be defined for any two sets of characters, as long as the characters are each one UTF-16 codepoint in length.

Surrounding Pairs

The optional <surrounding-pairs> element defines the set of characters which should be treated as wrappable pairs by the editor. When selecting a range of text and typing the first half of the pair, the text will be wrapped in the pair instead of replaced.

Generally, it is not required to specify this element, as the default pair set is usually appropriate for most languages.

The default set of surrounding pairs are: {}, [], (), <>, "", '', “”, ‘’, (backquotes)

To specify a custom set of surrounding pairs:

<surrounding-pairs>
    <pair open="{" close="}" />
    <pair open="[" close="]" />
    <pair open="(" close=")" />
    <pair open="'" close="'" />
    <pair open="\" close="\" />
    <pair open="`" close="`" />
</surrounding-pairs>

This defines six sets of pairs for each of the common types of bracket and quote used in procedural languages.

Pairs may be defined for any two sets of characters, as long as the characters are each one UTF-16 codepoint in length.

Injections

To facilitate including one language within another, such as code fences in Markdown documents, a syntax can specify an Injection regular expression.

When a parent language defines a region that should be reparsed as another language, it can define text in the document which tells the syntax engine which language to use. This textual content is evaluated using the injection regular expressions of all registered languages. The first to match will be used to parse the injected region.

<injection>
    <expression>(html|HTML|Html)$</expression>
</injection>

For example, consider a Markdown document which contains:

Lorem ipsum dolor sit amet.

```html
<html>
    <head></head>
    <body></body>
</html>

When the document is parsed, the Markdown extension defines content between the triple-backticks as an injected region. It also defines that if text appears after the first set of backticks it defines which language should be used. This text (html) is what is compared against injection regular expressions.

If no injection regular expressions match the text, the syntax engine checks whether a syntax is registered whose name is equal to the text. If nothing matches, the region is not reparsed.

Added in Nova 10. Only supported with languages injected into Tree-sitter-based languages.

Tree-sitter

Tree-sitter-based languages, as the name implies, are built around the Tree-sitter parsing library. Tree-sitter grammars are written in a subset of JavaScript and then must be compiled into a native dynamic library for use in Nova.

For a deep look at integrating a Tree-sitter-based language into Nova, follow our Tree-sitter guide.

Added in Nova 10.

Regex Grammars

Regex Grammar-based languages are built around a set of rules which use regular expressions to break down text into logical tokens.

Note: With the introduction of Tree-sitter support in Nova 10, building new regex grammars is discouraged. Support will remain for the forseeable future for backward compatibility, but new features will be targeted exclusively at Tree-sitter.

For a deep look at building a Regex Grammar-based language, follow our Regex Grammars guide.

Language Completions

Basic completion rules for a language which do not require a full AssistantsRegistry completions provider may be provided by a language extension as part of its syntax definition.

These types of completions are known as Language Completions.