Kotlin provides an expressive syntax that enables clean and modern style but complicates parsing. This two-part blog series sheds light on Kotlin’s unique syntactic features and the parsing techniques needed to handle these features.

At Gitar, we build tools that analyze and transform various programming languages, and among the languages we work on, Kotlin has been the most challenging to parse. Syntax and parsing are important aspects in the design and implementation of modern programming languages, yet often over-shadowed by other aspects. So with this two-part blog series, we share our experience and learnings from working extensively on Kotlin, aiming to shed some light on these aspects of Kotlin for other tool developers and Kotlin enthusiasts.

An example

Consider the following Kotlin code:

class Employee(val name: String, val location: String) {
   override fun toString(): String = name
}

fun numEmployeesInLocation(employees: List<Employee>) =
   employees.groupingBy { it.location }.eachCount()

This example highlights two syntactic features that are relevant from a parsing point of view:

There are no semicolons in this code as semicolons are optional. In Kotlin, newlines can serve as statement separators, and semicolons are necessary only to separate statements on the same line. Among languages that allow newlines as statement separators, such as Go, Scala, and JavaScript, Kotlin feels the most flexible and natural: newlines can be used freely as if whitespace for laying out and formatting code. In other languages with newlines as separators, newlines can sometimes lead to surprising parse errors.
Curly brackets ({ and }) serve as both blocks and lambdas. In the above example, employees.groupingBy { it.location } is a method call with a lambda passed as the last argument. This overloaded use of curly brackets adds flexibility to the syntax but makes the Kotlin grammar ambiguous and complex to parse. Kotlin has other syntactic constructs that are similarly ambiguous or cannot be parsed with traditional deterministic parsing techniques.

In this two-part blog series, we discuss these two categories of parsing challenges in detail. Part I discusses how Kotlin handles newlines to allow flexible syntax. Part II focuses on ambiguities in the Kotlin grammar and disambiguation strategies. For discussions, we show many examples of Kotlin code that exhibit these parsing challenges, and compare the behavior of the Kotlin compiler parser to the Kotlin specification grammar written in ANTLR 4.

Newlines and optional semicolons

Semicolons are optional in Kotlin except when writing multiple statements on the same line; for example:

fun main() {
  print("Hello"); print(" "); println("World")
}

Mandatory semicolons are a well-known syntactic feature of the C-family of programming languages — C, C++, and Java being the most prominent members. Although semicolons may appear to unnecessarily clutter the code, they make the parser implementation simpler as semicolons unambiguously define statement boundaries.

Languages that allow newlines as statement separators need additional mechanisms to determine statement boundaries. In these languages, newline characters can have two meanings: either a whitespace, which is insignificant, or a statement separator, which is significant. This overloading makes parsing more complicated.

Depending on the mechanism used to handle newlines, some surprising behaviors can leak into the program as unexpected parse errors. If you are familiar with Go or Scala, you know that sometimes having one or more extra newlines leads to a parse error. What sets Kotlin apart from these languages is that newlines can be freely used everywhere as if they are whitespace.

Let’s consider the following Kotlin program, where there is a newline after the function signature:

fun main()
{
  println("Hello world")
}

The equivalent Go program gives a parse error:

func main() // Error: syntax error: unexpected semicolon or newline before {
{
  println("Hello world")
}

Scala does not have this behavior with method declarations, but has a similar issue with class declarations when there are two newlines between the class declaration and its body. The following Scala class gives a parse error, whereas Kotlin allows as many newlines as desired between a class declaration and its body:

class A

{ // Error: Illegal start of toplevel definition

}

It’s also possible in Kotlin to format class declarations with extra newlines; for example, the class constructor and its modifiers can be placed on a new line:

class A

  private constructor(i: Int)

The equivalent Scala code gives a parse error:

class A

  private (i: Int)  // Error: Expected start of definition

Another example is import statements. Kotlin, Go and Scala allow import separated by semicolons or newlines. But unlike Go and Scala, Kotlin allows multiple import statements on the same line separated by whitespace:

import java.util.HashMap import java.util.LinkedHashMap

The last example is method call chains. In Go, the dot character (.) must be placed on the same line as the closing parenthesis; otherwise, it leads to a parse error:

a
 .b() // Error: expected statement, found '.'
 .c()

Scala and Kotlin do not have this restriction as they allow the dot character on the next line when formatting method chains. (Method chaining is popular in object-oriented languages like Java, especially when using fluent interfaces.)

As can be seen, among the three modern languages with significant newline, Go has the most restrictions. Scala feels more flexible but still cannot be freely formatted with newlines. Kotlin has the most flexibility and in terms of formatting flexibility, and does not feel much different than a language like Java with insignificant newlines.

Newline handling

How does Kotlin achieve such flexibility with newline handling? The answer lies in the interaction between the lexer and the parser, and who determines the meaning of a newline character. Traditionally, parsing is done in two separate phases. First, the lexer converts the input source into a stream of tokens. Lexers typically remove insignificant characters — such as whitespace and comments — while producing tokens. Second, the parser processes this token stream to construct a parse tree based on the language grammar rules.

In Go and Scala, the lexer determines which newline characters are whitespace and which ones are statement separators. But because lexers do not know the parser context (the current grammar rules that the parser is processing), they rely on local rules to define when a newline is significant. These rules are defined so that they can be implemented as extensions to lexical scanning. Sometimes, these rules determine that a newline is a separator where the programmer naturally expects whitespace. This leads to surprising parse errors in seemingly valid programs. The parse errors in the previous section illustrate some of these behaviors.

Let’s take Go as an example. In Go, the lexer inserts semicolons in the token stream when it determines that a newline character serves as a separator. In other words, the Go lexer completely hides newline handling from the parser. This technique is known as semicolon insertion. The Go language specification section on semicolons, for example, states that a semicolon is automatically inserted if a line’s final token is a right parenthesis. In the Go example above, the newline after the function signature leads to a parse error because the lexer inserts a semicolon after the right parenthesis. This semicolon is an unexpected token for the parser at this position, which leads to the parse above error.

The Scala language specification similarly specifies rules for handling newlines in the lexer. In Scala, the rules for detecting significant newlines are more flexible than Go. Unlike Go, the lexer does not replace significant newlines with semicolons, rather, Scala defines an explicit newline token, which the lexer inserts into the token stream when a newline character is significant. The Scala grammar has rules with these newline tokens as explicit tokens, and also as separators.

To avoid limitations imposed by lexer-based handling of newlines, Kotlin exposes the parsing context when interpreting the meaning of newlines.

There are two existing Kotlin parser implementations: the Kotlin compiler parser and the Kotlin specification grammar written in ANTLR. The Kotlin compiler uses a handwritten, recursive-descent parser with a separate lexer. Rather than throwing the newlines away or replacing them with semicolons, the lexer passes them along to the parser, which then interprets their meaning based on the parsing context. In places where a separator is expected, the parser treats newline tokens as separators.

ANTLR has separate lexer and parser grammars. The Kotlin specification grammar excludes newlines from the whitespace declarations in the lexical grammar, and instead, directly includes newline characters in the grammar rules. For example, the rule for statements requires one or more semis (a non-empty list of semicolons or newlines) after each statement. This effectively enforces that between two statements, we need at least one semicolon or newline.

statements: (statement (semis statement)\*)? semis?

semis: (SEMICOLON | NL)+

Other rules, such function declarations, declare newlines as optional (NL*):

functionDeclaration
    : modifiers?
      FUN (NL\* typeParameters)? (NL\* receiverType NL\* DOT)? NL\* simpleIdentifier
      NL\* functionValueParameters
      (NL\* COLON NL\* type)?
      (NL\* typeConstraints)?
      (NL\* functionBody)?

Adding newline nonterminals to the grammar in this way, however, makes the grammar non-deterministic. This is not a problem with ANTLR 4, which supports all context-free grammars.

Newlines in Expressions

Despite the flexibility of Kotlin’s newline handling, there are cases in Kotlin where newlines introduce ambiguities in the grammar, causing surprising behaviors. The prime example is how newlines change the meaning of some binary expressions. In the following Kotlin program, the function returns a logical conjunction of its two boolean arguments:

fun f(a: Boolean, b: Boolean): Boolean {
  return a
         && b
}

Unlike &&, the binary add operator (+) cannot be placed on the next line. In the following program, the function returns the first argument, so the call f(1, 2) returns 1:

fun f(a: Int, b: Int): Int {
  return a
         + b
}

This program is parsed as a function with two expressions, a return expression followed by a unary + expression. The return semantics in Kotlin ensures that the function returns as soon as the first return expression is executed, so + b is unreachable. This behavior exists for some binary expressions such as * and +, but not for && and ||. Andrey Breslav, the creator of Kotlin, has commented on why only some binary operators show this behavior.

Other expressions can also be parsed differently when formatted with newlines; for example, the following code is parsed as an identifier followed by a parenthesized expression rather than a function call.

fun main() {
    f  
    ()
}

Context-sensitivity in newline handling in Kotlin

To allow arbitrary formatting of expressions using newlines, Kotlin considers newlines within parentheses or square brackets as insignificant. The following code example shows the behavior of newlines with regard to parentheses.

// Returns 3 as newlines are ignored inside ()
fun f(): Int {
    return (1
            + 2)
}

// Also, returns 3, as newlines are ignored at any nesting level inside ()
fun f(): Int { 
    (return 1
            + 2)
}

// Returns 2, as the body of the lambda is parsed as two separate expressions.
// In Kotlin, the last expression is the return value of a lambda.
// Note that the inner {} eclipses the outer (), making newlines significant again.
fun f(): Int {
    (
        return {
            1
            + 2
        }()
    )
}

This feature of Kotlin makes the grammar effectively context-sensitive: the meaning of a newline in an expression depends on whether that expression is enclosed in a parenthesized expression. Implementing this context-sensitive behavior requires maintaining a stack of open parentheses. This stack allows the parser to interpret the significance of newlines depending on whether the parser has encountered a left parenthesis or bracket (i.e., a ( or [) earlier in the current parsing path.

Kotlin compiler’s parser implements the tracking of opening and closing parentheses using the disableNewLines and enableNewLines calls, and because it is a recursive-descent parser, it uses the call stack frames rather than an explicit stack. The Kotlin specification grammar uses the ANTLR lexer mode feature, which effectively maintains a stack in the lexer when seeing the brackets, and uses this to alter the meaning of newlines.

Conclusions

Kotlin’s flexible approach to newlines enables cleanly formatted code but adds complexity to parsing. In this blog post, we showed how two existing Kotlin parsers handle newlines by leveraging parser context and advanced parsers that handle non-deterministic context-free grammars.

Part II of this blog series will cover ambiguity-related issues for Kotlin. Newline handling in Kotlin is unique and innovative, and its implementation nicely fits recursive-descent parsing. The problems related to ambiguities are harder to solve, where some tradeoffs have to be made between performance and expressivity.