Adding row polymorphism to Damas-Hindley-Milner

October 22, 2024

Intro to HM

Damas-Hindley-Milner (HM) is a type system for Standard ML and the ML-family languages. River and I wrote about it earlier this month. Check out that post if you would like an introduction or refresher. In this post, we’ll add row polymorphism to it. We’ll do it in the context of scrapscript’s implementation of Algorithm J, but the core ideas should be language and algorithm-independent.

Records

A record is an unordered collection of name to value mappings, like { x = 1, y = "hello" }. Scrapscript has a very dynamic notion of records in the interpreter and the compiler: any record can have any number of fields, with any names, with any values. This makes code like the following possible:

get_x { x=1, y=2 }
. get_x =
  | { x=x, ... } -> x

At run-time, the pattern matching looks for a field named x in the record and ignores the rest. The spread (...) indicates that it’s okay to also have other fields in the record. If the pattern were instead { x=x }, then the match would fail at run-time due to excess fields.

The following snippet also runs even though left and right have different leftover record parts:

get_x left + get_x right
. left  = { x = 1, y = 2 }
. right = { x = 1, y = 2, z = 3 }
. get_x = | { x = x, ... } -> x

This “leftover” bit isn’t straightforward to model in the type system. We’ll need some special sauce there.

Right now, in the type system, we don’t support records at all. If we try to infer the type of one, it’ll fail with “unsupported” or something. We need to model records in the type system.

Rows

A lot of papers use a notion of rows to model records, kind of how you can use cons cells to model lists. A row is a mapping of names to types, like {x=int, y=string}. Rows also have a special field called rest which contains the type of the leftover fields. It’s either a type variable or the empty row.

Having the empty row means that the record has no leftover fields. In fact, the example row above ({x=int, y=string}) is assumed to have the empty row as its rest field because all of these papers love shorthand. If it were extensible, if it had a type variable as its rest field, then it would be written as {x=int, y=string, ...'a}.

This does mean that it’s possible to have nested rows like {x=int, ...'a={y=string, ...'b}}. This is normal but you’ll never see it in notation because everybody implicitly flattens the rows because it looks nicer. Also because unification happens on the flattened rows, but we’ll get there later. Now let’s talk about the data structures we’ll use to implement rows. We’ll reuse some of the data structures from the intro-to-HM post.

The data structures

It’s as I said: a row is either an empty row (TyEmptyRow) or has fields and a leftover/rest field (TyRow).

@dataclasses.dataclass
class TyEmptyRow(MonoType):
    pass

@dataclasses.dataclass
class TyRow(MonoType):
    fields: dict[str, MonoType]
    rest: TyVar | TyEmptyRow = dataclasses.field(default_factory=TyEmptyRow)

In order to unify and print rows, we’ll often deal with them in their flattened representation, which is a collapsed tuple of fields and a rest type. One key difference: because it is collapsed, the rest must be either the empty row or an unbound type variable.

Also: not everybody uses the exact same representation. Some go for more cons-cell-like representations where every row has a label and a type and then a rest—no fields dict. They’re equivalent.

Inferring record literals

When we’re inferring the type of a record literal, we infer the types of the values and then construct a row from them. We should never see a ... in a record literal because we only support it in pattern matching.

def infer_type(expr: Object, ctx: Context) -> MonoType:
    # ...
    if isinstance(expr, Record):
        fields = {}
        rest: TyVar | TyEmptyRow = TyEmptyRow()
        for key, value in expr.data.items():
            assert not isinstance(value, Spread)
            fields[key] = infer_type(value, ctx)
        return set_type(expr, TyRow(fields, rest))
    # ...

Pattern matching on records

Pattern matching inference changed a little bit since the last post but not enough to make a big deal out of it. We now use this kind-of-similar infer_pattern_type function to infer the type of a pattern. It looks similar to infer_type but it’s a little more complicated because it has to invent fresh type variables for all variables in the pattern and it also has to deal with spread.

In the case of records, we default to building a closed row. However, if we see a spread, we invent a fresh type variable and use that as the rest of the row.

def infer_pattern_type(pattern: Object, ctx: Context) -> MonoType:
    # ...
    if isinstance(pattern, Record):
        fields = {}
        rest: TyVar | TyEmptyRow = TyEmptyRow()  # Default closed row
        for key, value in pattern.data.items():
            if isinstance(value, Spread):
                # Open row
                rest = fresh_tyvar()
                if value.name is not None:
                    ctx[value.name] = Forall([], rest)
                break
            fields[key] = infer_pattern_type(value, ctx)
        return set_type(pattern, TyRow(fields, rest))
    # ...

We also bind the rest variable to the name of the spread, if it’s named. This means we can do stuff like | { x=x, ...rest } -> something and then use rest in the something expression.

Unifying rows: building intuition

I don’t want to get too into it yet, but I want to explain that rows aren’t special in unification. I thought that there was some quirk or trick in the representation that enabled row polymorphism, but that’s not the case at all (shout-out to Sinan and River for making this clear).

Unification is just unification. Whatever types it gets, it will make them equal. Permanently. If it can’t, there’s a type error. Let’s look at some examples:

Okay, so that means that if you have a function that is meant to be generic over a single field, how does that work? How can you have | {x=x, ...} -> x if for every call to that function, the type of ... keeps expanding as it is unified? It might start as {x=int, ...'a} -> int but over time 'a will get bound to other row types and it might end up as {x=int, y=int, z=int, ...'d} and that will be too restrictive! What’s going on??

Let polymorphism

What’s actually happening is boring: let polymorphism strikes once again. We talked about let polymorphism in the last post, but here’s a refresher: if you have an unbound type variable in a let-bound function type, it will get generalized with a forall.

Have a function 'a list -> 'a? Well, the type in the context (the environment that deals in type variables) is actually forall 'a. 'a list -> 'a. Every use of that named function will instantiate that forall with a fresh type variable (maybe 'b or 't123 or whatever). This means that the same function can be generic over one or more type variables. Great.

So that pattern match function above? The one with type {x=int, ...'a} -> int? Well, if it’s bound to a name, it will get generalized to forall 'a. {x=int, ...'a} -> int. So every call, well… you know the rest. That’s row polymorphism.

Unifying rows: brass tacks

Alright, well, assuming we have what I thought was the hard part done already, let’s go over unification in excruciating detail. We’ll start with a neat little base case: empty rows can only unify with other empty rows:

def unify_type(ty1: MonoType, ty2: MonoType) -> None:
    # ...
    if isinstance(ty1, TyEmptyRow) and isinstance(ty2, TyEmptyRow):
        return
    # ...

Yay, nice. Let’s continue. We’ll start by flattening the rows.

For every field name present in either row, we’ll try to unify the types pairwise. But if it’s missing in one of the rows, we have to add it to the other one. So we collect that in ty1_missing (missing from ty1) and ty2_missing (missing from ty2).

This leaves us with four cases:

  1. Both rows have all the same fields. In that case, we unify the rests so they stay the same.
  2. One row has all the same fields as the other and also some more. In that case, add the missing fields to the other (smaller) row. Note that they share a rest.
  3. The same, but flipped the other direction.
  4. They each have fields the other lacks. In that case, create new rows with the missing fields and share a rest.

Now follow along in code:

def unify_type(ty1: MonoType, ty2: MonoType) -> None:
    # ...
    if isinstance(ty1, TyRow) and isinstance(ty2, TyRow):
        ty1_fields, ty1_rest = row_flatten(ty1)
        ty2_fields, ty2_rest = row_flatten(ty2)
        ty1_missing = {}
        ty2_missing = {}
        all_field_names = set(ty1_fields.keys()) | set(ty2_fields.keys())
        for key in sorted(all_field_names):  # Sort for deterministic error messages
            ty1_val = ty1_fields.get(key)
            ty2_val = ty2_fields.get(key)
            if ty1_val is not None and ty2_val is not None:
                unify_type(ty1_val, ty2_val)
            elif ty1_val is None:
                assert ty2_val is not None
                ty1_missing[key] = ty2_val
            elif ty2_val is None:
                assert ty1_val is not None
                ty2_missing[key] = ty1_val
        # In general, we want to:
        # 1) Add missing fields from one row to the other row
        # 2) "Keep the rows unified" by linking each row's rest to the other
        #    row's rest
        if not ty1_missing and not ty2_missing:
            # They both have the same fields for now, so make sure they have
            # the same fields *forever* by unifying the rests.
            unify_type(ty1_rest, ty2_rest)
            return
        if not ty1_missing:
            # The first row has fields that the second row doesn't have; add
            # them to the second row
            unify_type(ty2_rest, TyRow(ty2_missing, ty1_rest))
            return
        if not ty2_missing:
            # The second row has fields that the first row doesn't have; add
            # them to the first row
            unify_type(ty1_rest, TyRow(ty1_missing, ty2_rest))
            return
        # They each have fields the other lacks; create new rows sharing a rest
        # and add the missing fields to each row
        rest = fresh_tyvar()
        unify_type(ty1_rest, TyRow(ty1_missing, rest))
        unify_type(ty2_rest, TyRow(ty2_missing, rest))
        return
    # ...

There’s a lot of code but the important thing to know is that both rows are going to end up with the same fields and the same rest type.

Wrapping up

There you have it. That constitutes row polymorphism. There’s some other stuff we don’t need to get into because Scrapscript doesn’t have some language features. For example, Daan Leijen of scoped labels fame jumped into the Hacker News thread to explain why we might not need scoped/shadowed/duplicate labels:

About duplicate labels.. one needs to retain the duplicate field at runtime if there is a “remove_l” or “mask_l” operation that drops a field “l”. For example, {x=2,x=True}.remove_x.x == True. (Where the type of remove_l is {l:a|r} -> {r})

Since we don’t have such a masking operation, our lives are easier. Cool.

We’ve heard that we can use rows to implement polymorphic variants, but haven’t figured that part out yet…

See also

Check out the Scrapscript PR.

Our row polymorphism references are the same as last time, repeated here:

Please recommend additional papers, blog posts, and implementations.