Adding row polymorphism to Damas-Hindley-Milner
Intro to HM
Damas-Hindley-Milner (HM) is a type system for Standard ML and the ML-family languages. River and I wrote about it earlier this month. Check out that post if you would like an introduction or refresher. In this post, we’ll add row polymorphism to it. We’ll do it in the context of scrapscript’s implementation of Algorithm J, but the core ideas should be language and algorithm-independent.
Records
A record is an unordered collection of name to value mappings, like { x =
1, y = "hello" }
. Scrapscript has a very dynamic notion of records in the
interpreter and the compiler: any record can have any number of fields, with
any names, with any values. This makes code like the following possible:
get_x { x=1, y=2 }
. get_x =
| { x=x, ... } -> x
At run-time, the pattern matching looks for a field named x
in the record and
ignores the rest. The spread (...
) indicates that it’s okay to also have
other fields in the record. If the pattern were instead { x=x }
, then the
match would fail at run-time due to excess fields.
The following snippet also runs even though left
and right
have different
leftover record parts:
get_x left + get_x right
. left = { x = 1, y = 2 }
. right = { x = 1, y = 2, z = 3 }
. get_x = | { x = x, ... } -> x
This “leftover” bit isn’t straightforward to model in the type system. We’ll need some special sauce there.
Right now, in the type system, we don’t support records at all. If we try to infer the type of one, it’ll fail with “unsupported” or something. We need to model records in the type system.
Rows
A lot of papers use a notion of rows to model records, kind of how you can
use cons cells to model lists. A row is a mapping of names to types, like
{x=int, y=string}
. Rows also have a special field called rest which
contains the type of the leftover fields. It’s either a type variable or the
empty row.
Having the empty row means that the record has no leftover fields. In fact, the
example row above ({x=int, y=string}
) is assumed to have the empty row as its
rest field because all of these papers love shorthand. If it were extensible,
if it had a type variable as its rest field, then it would be written as
{x=int, y=string, ...'a}
.
This does mean that it’s possible to have nested rows like {x=int,
...'a={y=string, ...'b}}
. This is normal but you’ll never see it in notation
because everybody implicitly flattens the rows because it looks nicer. Also
because unification happens on the flattened rows, but we’ll get there later.
Now let’s talk about the data structures we’ll use to implement rows. We’ll
reuse some of the data structures from the intro-to-HM post.
The data structures
It’s as I said: a row is either an empty row (TyEmptyRow
) or has fields and a
leftover/rest field (TyRow
).
@dataclasses.dataclass
class TyEmptyRow(MonoType):
pass
@dataclasses.dataclass
class TyRow(MonoType):
fields: dict[str, MonoType]
rest: TyVar | TyEmptyRow = dataclasses.field(default_factory=TyEmptyRow)
In order to unify and print rows, we’ll often deal with them in their flattened representation, which is a collapsed tuple of fields and a rest type. One key difference: because it is collapsed, the rest must be either the empty row or an unbound type variable.
Also: not everybody uses the exact same representation. Some go for more
cons-cell-like representations where every row has a label
and a type
and
then a rest
—no fields
dict. They’re equivalent.
Inferring record literals
When we’re inferring the type of a record literal, we infer the types of the
values and then construct a row from them. We should never see a ...
in a
record literal because we only support it in pattern matching.
def infer_type(expr: Object, ctx: Context) -> MonoType:
# ...
if isinstance(expr, Record):
fields = {}
rest: TyVar | TyEmptyRow = TyEmptyRow()
for key, value in expr.data.items():
assert not isinstance(value, Spread)
fields[key] = infer_type(value, ctx)
return set_type(expr, TyRow(fields, rest))
# ...
Pattern matching on records
Pattern matching inference changed a little bit since the last post but not
enough to make a big deal out of it. We now use this kind-of-similar
infer_pattern_type
function to infer the type of a pattern. It looks similar
to infer_type
but it’s a little more complicated because it has to invent
fresh type variables for all variables in the pattern and it also has to deal
with spread.
In the case of records, we default to building a closed row. However, if we see a spread, we invent a fresh type variable and use that as the rest of the row.
def infer_pattern_type(pattern: Object, ctx: Context) -> MonoType:
# ...
if isinstance(pattern, Record):
fields = {}
rest: TyVar | TyEmptyRow = TyEmptyRow() # Default closed row
for key, value in pattern.data.items():
if isinstance(value, Spread):
# Open row
rest = fresh_tyvar()
if value.name is not None:
ctx[value.name] = Forall([], rest)
break
fields[key] = infer_pattern_type(value, ctx)
return set_type(pattern, TyRow(fields, rest))
# ...
We also bind the rest variable to the name of the spread, if it’s named. This
means we can do stuff like | { x=x, ...rest } -> something
and then use
rest
in the something
expression.
Unifying rows: building intuition
I don’t want to get too into it yet, but I want to explain that rows aren’t special in unification. I thought that there was some quirk or trick in the representation that enabled row polymorphism, but that’s not the case at all (shout-out to Sinan and River for making this clear).
Unification is just unification. Whatever types it gets, it will make them equal. Permanently. If it can’t, there’s a type error. Let’s look at some examples:
{x=int}
unifies with{x=int}
{x=int}
does not unify with{x=string}
because the types that correspond withx
don’t unify- Some type systems allow this by having a notion of “type sets” and I think this might be related to polymorphic variants but I’m not sure
{x=int}
does not unify with{y=string}
because the fields don’t match and they are not extensible{x=int, ...'a}
unifies with{y=string, ...'b}
- unification will set
'a
equal to{y=string, ...'c}
and'b
equal to{x=int, ...'c}
- note how they both get the same leftover row
'c
because they are unified and supposed to stay that way
- unification will set
Okay, so that means that if you have a function that is meant to be generic
over a single field, how does that work? How can you have | {x=x, ...} -> x
if for every call to that function, the type of ...
keeps expanding as it is
unified? It might start as {x=int, ...'a} -> int
but over time 'a
will get bound
to other row types and it might end up as {x=int, y=int, z=int, ...'d}
and
that will be too restrictive! What’s going on??
Let polymorphism
What’s actually happening is boring: let polymorphism strikes once again. We
talked about let polymorphism in the last post, but here’s a refresher: if you
have an unbound type variable in a let-bound function type, it will get
generalized with a forall
.
Have a function 'a list -> 'a
? Well, the type in the context (the environment
that deals in type variables) is actually forall 'a. 'a list -> 'a
. Every use
of that named function will instantiate that forall
with a fresh type
variable (maybe 'b
or 't123
or whatever). This means that the same function
can be generic over one or more type variables. Great.
So that pattern match function above? The one with type {x=int, ...'a} ->
int
? Well, if it’s bound to a name, it will get generalized to forall 'a.
{x=int, ...'a} -> int
. So every call, well… you know the rest. That’s row
polymorphism.
Unifying rows: brass tacks
Alright, well, assuming we have what I thought was the hard part done already, let’s go over unification in excruciating detail. We’ll start with a neat little base case: empty rows can only unify with other empty rows:
def unify_type(ty1: MonoType, ty2: MonoType) -> None:
# ...
if isinstance(ty1, TyEmptyRow) and isinstance(ty2, TyEmptyRow):
return
# ...
Yay, nice. Let’s continue. We’ll start by flattening the rows.
For every field name present in either row, we’ll try to unify the types
pairwise. But if it’s missing in one of the rows, we have to add it to the
other one. So we collect that in ty1_missing
(missing from ty1
) and
ty2_missing
(missing from ty2
).
This leaves us with four cases:
- Both rows have all the same fields. In that case, we unify the rests so they stay the same.
- One row has all the same fields as the other and also some more. In that case, add the missing fields to the other (smaller) row. Note that they share a rest.
- The same, but flipped the other direction.
- They each have fields the other lacks. In that case, create new rows with the missing fields and share a rest.
Now follow along in code:
def unify_type(ty1: MonoType, ty2: MonoType) -> None:
# ...
if isinstance(ty1, TyRow) and isinstance(ty2, TyRow):
ty1_fields, ty1_rest = row_flatten(ty1)
ty2_fields, ty2_rest = row_flatten(ty2)
ty1_missing = {}
ty2_missing = {}
all_field_names = set(ty1_fields.keys()) | set(ty2_fields.keys())
for key in sorted(all_field_names): # Sort for deterministic error messages
ty1_val = ty1_fields.get(key)
ty2_val = ty2_fields.get(key)
if ty1_val is not None and ty2_val is not None:
unify_type(ty1_val, ty2_val)
elif ty1_val is None:
assert ty2_val is not None
ty1_missing[key] = ty2_val
elif ty2_val is None:
assert ty1_val is not None
ty2_missing[key] = ty1_val
# In general, we want to:
# 1) Add missing fields from one row to the other row
# 2) "Keep the rows unified" by linking each row's rest to the other
# row's rest
if not ty1_missing and not ty2_missing:
# They both have the same fields for now, so make sure they have
# the same fields *forever* by unifying the rests.
unify_type(ty1_rest, ty2_rest)
return
if not ty1_missing:
# The first row has fields that the second row doesn't have; add
# them to the second row
unify_type(ty2_rest, TyRow(ty2_missing, ty1_rest))
return
if not ty2_missing:
# The second row has fields that the first row doesn't have; add
# them to the first row
unify_type(ty1_rest, TyRow(ty1_missing, ty2_rest))
return
# They each have fields the other lacks; create new rows sharing a rest
# and add the missing fields to each row
rest = fresh_tyvar()
unify_type(ty1_rest, TyRow(ty1_missing, rest))
unify_type(ty2_rest, TyRow(ty2_missing, rest))
return
# ...
There’s a lot of code but the important thing to know is that both rows are going to end up with the same fields and the same rest type.
Wrapping up
There you have it. That constitutes row polymorphism. There’s some other stuff we don’t need to get into because Scrapscript doesn’t have some language features. For example, Daan Leijen of scoped labels fame jumped into the Hacker News thread to explain why we might not need scoped/shadowed/duplicate labels:
About duplicate labels.. one needs to retain the duplicate field at runtime if there is a “remove_l” or “mask_l” operation that drops a field “l”. For example,
{x=2,x=True}.remove_x.x
==True
. (Where the type ofremove_l
is{l:a|r} -> {r}
)
Since we don’t have such a masking operation, our lives are easier. Cool.
We’ve heard that we can use rows to implement polymorphic variants, but haven’t figured that part out yet…
See also
Check out the Scrapscript PR.
Our row polymorphism references are the same as last time, repeated here:
- A Polymorphic Type System for Extensible Records and Variants (PDF, 1996)
- Extensible records with scoped labels (PDF, 2005)
- Extensible Programming with First-Class Cases (PDF, 2006)
- Set-Theoretic Types for Polymorphic Variants (2016)
- Generic programming in OCaml (PDF, 2018)
- Abstracting extensible data types: or, rows by any other name (PDF, 2019)
- Incredible paper title
- Structural Subtyping as Parametric Polymorphism (PDF, 2023)
- Fast polymorphic record access (2023)
Please recommend additional papers, blog posts, and implementations.