Damas-Hindley-Milner (HM) is a type system for Standard ML and the ML-family languages. River and I wrote about it earlier this month. Check out that post if you would like an introduction or refresher. In this post, we’ll add row polymorphism to it. We’ll do it in the context of scrapscript’s implementation of Algorithm J, but the core ideas should be language and algorithm-independent.
A record is an unordered collection of name to value mappings, like { x =
1, y = "hello" }
. Scrapscript has a very dynamic notion of records in the
interpreter and the compiler: any record can have any number of fields, with
any names, with any values. This makes code like the following possible:
get_x { x=1, y=2 }
. get_x =
| { x=x, ... } -> x
At run-time, the pattern matching looks for a field named x
in the record and
ignores the rest. The spread (...
) indicates that it’s okay to also have
other fields in the record. If the pattern were instead { x=x }
, then the
match would fail at run-time due to excess fields.
The following snippet also runs even though left
and right
have different
leftover record parts:
get_x left + get_x right
. left = { x = 1, y = 2 }
. right = { x = 1, y = 2, z = 3 }
. get_x = | { x = x, ... } -> x
This “leftover” bit isn’t straightforward to model in the type system. We’ll need some special sauce there.
Right now, in the type system, we don’t support records at all. If we try to infer the type of one, it’ll fail with “unsupported” or something. We need to model records in the type system.
A lot of papers use a notion of rows to model records, kind of how you can
use cons cells to model lists. A row is a mapping of names to types, like
{x=int, y=string}
. Rows also have a special field called rest which
contains the type of the leftover fields. It’s either a type variable or the
empty row.
Having the empty row means that the record has no leftover fields. In fact, the
example row above ({x=int, y=string}
) is assumed to have the empty row as its
rest field because all of these papers love shorthand. If it were extensible,
if it had a type variable as its rest field, then it would be written as
{x=int, y=string, ...'a}
.
This does mean that it’s possible to have nested rows like {x=int,
...'a={y=string, ...'b}}
. This is normal but you’ll never see it in notation
because everybody implicitly flattens the rows because it looks nicer. Also
because unification happens on the flattened rows, but we’ll get there later.
Now let’s talk about the data structures we’ll use to implement rows. We’ll
reuse some of the data structures from the intro-to-HM post.
It’s as I said: a row is either an empty row (TyEmptyRow
) or has fields and a
leftover/rest field (TyRow
).
@dataclasses.dataclass
class TyEmptyRow(MonoType):
pass
@dataclasses.dataclass
class TyRow(MonoType):
fields: dict[str, MonoType]
rest: TyVar | TyEmptyRow = dataclasses.field(default_factory=TyEmptyRow)
In order to unify and print rows, we’ll often deal with them in their flattened representation, which is a collapsed tuple of fields and a rest type. One key difference: because it is collapsed, the rest must be either the empty row or an unbound type variable.
Also: not everybody uses the exact same representation. Some go for more
cons-cell-like representations where every row has a label
and a type
and
then a rest
—no fields
dict. They’re equivalent.
When we’re inferring the type of a record literal, we infer the types of the
values and then construct a row from them. We should never see a ...
in a
record literal because we only support it in pattern matching.
def infer_type(expr: Object, ctx: Context) -> MonoType:
# ...
if isinstance(expr, Record):
fields = {}
rest: TyVar | TyEmptyRow = TyEmptyRow()
for key, value in expr.data.items():
assert not isinstance(value, Spread)
fields[key] = infer_type(value, ctx)
return set_type(expr, TyRow(fields, rest))
# ...
Pattern matching inference changed a little bit since the last post but not
enough to make a big deal out of it. We now use this kind-of-similar
infer_pattern_type
function to infer the type of a pattern. It looks similar
to infer_type
but it’s a little more complicated because it has to invent
fresh type variables for all variables in the pattern and it also has to deal
with spread.
In the case of records, we default to building a closed row. However, if we see a spread, we invent a fresh type variable and use that as the rest of the row.
def infer_pattern_type(pattern: Object, ctx: Context) -> MonoType:
# ...
if isinstance(pattern, Record):
fields = {}
rest: TyVar | TyEmptyRow = TyEmptyRow() # Default closed row
for key, value in pattern.data.items():
if isinstance(value, Spread):
# Open row
rest = fresh_tyvar()
if value.name is not None:
ctx[value.name] = Forall([], rest)
break
fields[key] = infer_pattern_type(value, ctx)
return set_type(pattern, TyRow(fields, rest))
# ...
We also bind the rest variable to the name of the spread, if it’s named. This
means we can do stuff like | { x=x, ...rest } -> something
and then use
rest
in the something
expression.
I don’t want to get too into it yet, but I want to explain that rows aren’t special in unification. I thought that there was some quirk or trick in the representation that enabled row polymorphism, but that’s not the case at all (shout-out to Sinan and River for making this clear).
Unification is just unification. Whatever types it gets, it will make them equal. Permanently. If it can’t, there’s a type error. Let’s look at some examples:
{x=int}
unifies with {x=int}
{x=int}
does not unify with {x=string}
because the types that
correspond with x
don’t unify
{x=int}
does not unify with {y=string}
because the fields don’t match
and they are not extensible{x=int, ...'a}
unifies with {y=string, ...'b}
'a
equal to {y=string, ...'c}
and 'b
equal to {x=int, ...'c}
'c
because they are
unified and supposed to stay that wayOkay, so that means that if you have a function that is meant to be generic
over a single field, how does that work? How can you have | {x=x, ...} -> x
if for every call to that function, the type of ...
keeps expanding as it is
unified? It might start as {x=int, ...'a} -> int
but over time 'a
will get bound
to other row types and it might end up as {x=int, y=int, z=int, ...'d}
and
that will be too restrictive! What’s going on??
What’s actually happening is boring: let polymorphism strikes once again. We
talked about let polymorphism in the last post, but here’s a refresher: if you
have an unbound type variable in a let-bound function type, it will get
generalized with a forall
.
Have a function 'a list -> 'a
? Well, the type in the context (the environment
that deals in type variables) is actually forall 'a. 'a list -> 'a
. Every use
of that named function will instantiate that forall
with a fresh type
variable (maybe 'b
or 't123
or whatever). This means that the same function
can be generic over one or more type variables. Great.
So that pattern match function above? The one with type {x=int, ...'a} ->
int
? Well, if it’s bound to a name, it will get generalized to forall 'a.
{x=int, ...'a} -> int
. So every call, well… you know the rest. That’s row
polymorphism.
Alright, well, assuming we have what I thought was the hard part done already, let’s go over unification in excruciating detail. We’ll start with a neat little base case: empty rows can only unify with other empty rows:
def unify_type(ty1: MonoType, ty2: MonoType) -> None:
# ...
if isinstance(ty1, TyEmptyRow) and isinstance(ty2, TyEmptyRow):
return
# ...
Yay, nice. Let’s continue. We’ll start by flattening the rows.
For every field name present in either row, we’ll try to unify the types
pairwise. But if it’s missing in one of the rows, we have to add it to the
other one. So we collect that in ty1_missing
(missing from ty1
) and
ty2_missing
(missing from ty2
).
This leaves us with four cases:
Now follow along in code:
def unify_type(ty1: MonoType, ty2: MonoType) -> None:
# ...
if isinstance(ty1, TyRow) and isinstance(ty2, TyRow):
ty1_fields, ty1_rest = row_flatten(ty1)
ty2_fields, ty2_rest = row_flatten(ty2)
ty1_missing = {}
ty2_missing = {}
all_field_names = set(ty1_fields.keys()) | set(ty2_fields.keys())
for key in sorted(all_field_names): # Sort for deterministic error messages
ty1_val = ty1_fields.get(key)
ty2_val = ty2_fields.get(key)
if ty1_val is not None and ty2_val is not None:
unify_type(ty1_val, ty2_val)
elif ty1_val is None:
assert ty2_val is not None
ty1_missing[key] = ty2_val
elif ty2_val is None:
assert ty1_val is not None
ty2_missing[key] = ty1_val
# In general, we want to:
# 1) Add missing fields from one row to the other row
# 2) "Keep the rows unified" by linking each row's rest to the other
# row's rest
if not ty1_missing and not ty2_missing:
# They both have the same fields for now, so make sure they have
# the same fields *forever* by unifying the rests.
unify_type(ty1_rest, ty2_rest)
return
if not ty1_missing:
# The first row has fields that the second row doesn't have; add
# them to the second row
unify_type(ty2_rest, TyRow(ty2_missing, ty1_rest))
return
if not ty2_missing:
# The second row has fields that the first row doesn't have; add
# them to the first row
unify_type(ty1_rest, TyRow(ty1_missing, ty2_rest))
return
# They each have fields the other lacks; create new rows sharing a rest
# and add the missing fields to each row
rest = fresh_tyvar()
unify_type(ty1_rest, TyRow(ty1_missing, rest))
unify_type(ty2_rest, TyRow(ty2_missing, rest))
return
# ...
There’s a lot of code but the important thing to know is that both rows are going to end up with the same fields and the same rest type.
There you have it. That constitutes row polymorphism. There’s some other stuff we don’t need to get into because Scrapscript doesn’t have some language features. For example, Daan Leijen of scoped labels fame jumped into the Hacker News thread to explain why we might not need scoped/shadowed/duplicate labels:
About duplicate labels.. one needs to retain the duplicate field at runtime if there is a “remove_l” or “mask_l” operation that drops a field “l”. For example,
{x=2,x=True}.remove_x.x
==True
. (Where the type ofremove_l
is{l:a|r} -> {r}
)
Since we don’t have such a masking operation, our lives are easier. Cool.
We’ve heard that we can use rows to implement polymorphic variants, but haven’t figured that part out yet…
Check out the Scrapscript PR.
Our row polymorphism references are the same as last time, repeated here:
Please recommend additional papers, blog posts, and implementations.