Parse, Don't Validate (2019)
lexi-lambda.github.io132 points by shirian 3 hours ago
132 points by shirian 3 hours ago
Maybe I'm missing something and I'm glad this idea resonates, but it feels like sometime after Java got popular and dynamic languages got a lot of mindshare, a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.
In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around. You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date (Edit: Changed this from email because email validation is a can of worms as an example). So there, "parse, don't validate" is the norm and not a tip/idea that would need to gain traction.
> In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around.
In 99% of the projects I worked on my professional life, anything that is coming from an human input is manipulated as a string and most of the time, it stays like this in all of the application layers (with more or less checks in the path).
On your precise exemple, I can even say that I never saw something like an "Email object".
I've seen a mix between stringly typed apps and strongly typed apps. The strongly typed apps had an upfront cost but were much better to work with in the long run. Define types for things like names, email address, age, and the like. Convert the strings to the appropriate type on ingest, and then inside your system only use the correct types.
What's funny, is this is exactly one of the reasons I happen to like JavaScript... at its' core, the type coercion and falsy boolean rules work really well (imo) for ETL type work, where you're dealing with potentially untrusted data. How many times have you had to import a CSV with a bad record/row? It seems to happen all the time, why, because people use and manually manipulate data in spreadsheets.
In the end, it's a big part of why I tend to reach for JS/TS first (Deno) for most scripts that are even a little complex to attempt in bash.
> On your precise exemple, I can even say that I never saw something like an "Email object".
Well that's.... absolutely horrifying. Would you mind sharing what industry/stack you work with?
The easiest and most robust way to deal with email is to have 2 fields. string email, bool isValidated. (And you'll need some additional way to handle a time based validation code). Accept the user's string, fire off an email to it and require them to click a validation link or enter a code somewhere.
Email is weird and ultimately the only decider of a valid email is "can I send email to this address and get confirmation of receipt".
If it's a consumer website you can so some clientside validation of ".@.\\..*" to catch easy typos. That will end up rejecting a super small amount of users but they can usually deal with it. Validating against known good email domains and whatnot will just create a mess.
I've seen some devs prefer that route of programming and it very often results in performance problems.
An undiscussed issue with "everything is a string or dictionary" is that strings and dictionaries both consume very large amounts of memory. Particularly in a language like java.
A java object which has 2 fields in it with an int and a long will spend most of it's memory on the object header. You end up with an object that has 12 bytes of payload and 32bytes of object header (Valhala can't come soon enough). But when you talk about a HashMap in java, just the map structure itself ends up blowing way past that. The added overhead of 2 Strings for each of the fields plus a Java `Long` and `Integer` just decimates that memory requirement. It's even worse if someone decided to represent those numbers as Strings (I've seen that).
Beyond that, every single lookup is costly, you have to hash the key to lookup the value and you have to compare the key.
In a POJO, when you say "foo.bar", it's just an offset in memory that Java ends up doing. It's absurdly faster.
Please, for the love of god, if you know the structure of the data you are working with it, turn it into your language's version of a struct. Stop using dictionaries for everything.
I work with PHP, where classes are supposedly a lot slower than strings and arrays (PHP calls dictionaries "associative arrays").
this is likely an ecosystem sort of thing. if your language gives you the tools to do so at no cost (memory/performance) then folks will naturally utilize those features and it will eventually become idiomatic code. kotlin value classes are exactly this and they are everywhere: https://kotlinlang.org/docs/inline-classes.html
My condolences, I urge you to recover from past trauma and not let it prohibit a happy life.
> it feels like sometime after Java got popular [...] a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.
I think you have a very rose-tinted view of the past: while on the academic side static types were intended for proof on the industrial side it was for efficiency. C didn't get static types in order to prove your code was correct, and it's really not great at doing that, it got static types so you could account for memory and optimise it.
Java didn't help either, when every type has to be a separate file the cost of individual types is humongous, even more so when every field then needs two methods.
> In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around.
In most strong statically typed languages you would not, but in most statically typed codebases you would. Just look at the Windows interfaces. In fact while Simonyi's original "apps hungarian" had dim echoes of static types that got completely washed out in system, which was used widely in C++, which is already a statically typed language.
> I think you have a very rose-tinted view of the past
I think they also forgot the entire Perl era.
> You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date
It's tricky because `class` conflates a lot of semantically-distinct ideas.
Some people might be making `Date` objects to avoid writing defensive code everywhere (since classes are types), but...
Other people might be making `Date` objects so they can keep all their date-related code in one place (since classes are modules/namespaces, and in Java classes even correspond to files).
Other people might be making `Date` objects so they can override the implementation (since classes are jump tables).
Other people might be making `Date` objects so they can overload a method for different sorts of inputs (since classes are tags).
I think the pragmatics of where code lives, and how the execution branches, probably have a larger impact on such decisions than safety concerns. After all, the most popular way to "avoid writing defensive code everywhere" is to.... write unsafe, brittle code :-(
This is an idea that is not ON or OFF
You can get ever so gradually stricter with your types which means that the operations you perform on on a narrow type is even more solid
It is also 100% possible to do in dynamic languages, it's a cultural thing
In 2 out of 3 problematic bugs I've had in the last two years or so were in statically typed languages where previous developers didn't use the type system effectively.
One bug was in a system that had an Email type but didn't actually enforce the invariants of emails. The one that caused the problem was it didn't enforce case insensitive comparisons. Trivial to fix, but it was encased in layers of stuff that made tracking it down difficult.
The other was a home grown ORM that used the same optional / maybe type to represent both "leave this column as the default" and "set this column to null". It should be obvious how this could go wrong. Easy to fix but it fucked up some production data.
Both of these are failures to apply "parse, don't validate". The form didn't enforce the invariants it had supposedly parsed the data into. The latter didn't differentiate two different parsing.
that's a bit of a hairy situation. You're doing it wrong. Or not really, but.. complicated.
As per [RFC 5321](https://www.rfc-editor.org/rfc/rfc5321.html):
> the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address.
You're not allowed to do that. The email address `foo@bar.com` is identical to `foo@BAR.com`, but not necessarily identical to `FOO@bar.com`. If we're going to talk about 'commonly applied normalisations at most email providers', where do you draw that line? Should `foo+whatever@bar.com` be considered equal to `foo@bar.com`? That souds weird, except - that is exactly how gmail works, a couple of other mail providers have taken up that particular torch, and if your aim is to uniquely identify a 'recipient', you can hardcode that `a@gmail.com` and `a+whatever@gmail.com` definitely, guaranteed, end up at the same mailbox.
In practice, yes, users _expect_ that email addresses are case insensitive. Not just users, even - various intermediate systems apply the same incorrect logic.
This gets to an intriguing aspect of hardcoding types: You lose the flex, mostly. types are still better - the alternative is that you reliably attempt to write the same logic (or at least a call to some logic) to disentangle this mess every time you do anything with a string you happen to know is an email address which is terrible but gives you the option of intentionally not doing that if you don't want to apply the usual logic.
That's no way to program, and thus actual types and the general trend that comes with it (namely: We do this right, we write that once, and there is no flexibility left). Programming is too hard to leave room for exotic cases that programmers aren't going to think about when dealing with this concept. And if you do need to deal with it, it can still be encoded in the type, but that then makes visible things that in untyped systems are invisible (if my email type only has a '.compare(boolean caseSensitive)' style method, and is not itself inherently comparable because of the case sensitivity thing, that makes it _seem_ much more complicated than plain old strings. This is a lie - emails in strings *IS* complicated. They just are. You can't make that go away. But you can hide it, and shoving all data in overly generic data types (numbers and strings) tends to do that.
In my experience that's pretty rare. Most people pass around string phone numbers instead of a phonenumber class.
Java makes it a pain though, so most code ends up primitive obsessed. Other languages make it easier, but unless the language and company has a strong culture around this, they still usually end up primitive obsessed.
record PhoneNumber(String value) {}
Huge pain.What have you gained?
Without any other context? Nothing - it's just a type alias...
But the context this type of an alias should exist in is one where a string isn't turned into a PhoneNumber until you've validated it. All the functions taking a string that might end up being a PhoneNumber need to be highly defensive - but all the functions taking a PhoneNumber can lean on the assumptions that go into that type.
It's nice to have tight control over the string -> PhoneNumber parsing that guarantees all those assumptions are checked. Ideally that'd be done through domain based type restrictions, but it might just be code - either way, if you're diligent, you can stop being defensive in downstream functions.
>But the context this type of an alias should exist in is one where a string isn't turned into a PhoneNumber until you've validated it.
Even if you don't do any validation as part of the construction (and yeah, having a separate type for validated vs unvalidated is extremely helpful), universally using type aliases like that pretty much entirely prevents the class of bugs from accidentally passing a string/int typed value into a variable of the wrong stringy/inty type, e.g. mixing up different categories of id or name or whatever.
> All the functions taking a string that might end up being a PhoneNumber need to be highly defensive
Yeah, I can't relate at all with not using a type for this after having to write gross defensive code a couple of times e.g. if it's not a phone number, return -1...throw an exception? The typed approach is shorter, cleaner, self-documenting, reduces bugs and makes refactoring easier.
one issue is it’s not a type alias but a type encapsulation. This have a cost at runtime, it’s not like in some functionnals languages a non cost abstraction.
An explicit type
Obviously the pseudo code leaves to the imagination, but what benefits does this give you? Are you checking that it is 10-digits? Are you allowing for + symbols for the international codes?
Can't pass a PhoneNumber to a function expecting an EmailAddress, for one, or mix up the order of arguments in a function that may otherwise just take two or more strings
That's going to be up to the business building the logic. Ideally those assumptions are clearly encoded in an easily readable manner but at the very least they should be captured somewhere code adjacent (even if it's just a comment and the block of logic to enforce those restraints).
How to make a crap system that users will hate: Let some architecture astronaut decide what characters should be valid or not.
If you are not checking that the phone number is 10 digits (or whatever the rules are for the phone number for your use case), it is absolutely pointless. But why would you not?
I would argue it's the other way around. If I take a string I believe to be a phone number and wrap it in a `PhoneNumber` type, and then later I try to pass it in as the wrong argument to a function like say I get order of name & phone number reversed, it'll complain. Whereas if both name & phone number are strings, it won't complain.
That's what I see as the primary value to this sort of typing. Enforcing the invariants is a separate matter.
And parentheses. And spaces (that may, or may not, be trimmed). And all kind of unicode equivalent characters, that might have to be canonicalized. Why not treat it as a byte buffer anyway.
Strong static type checking is helpful when implementing the methodology described in this article, but it is besides its focus. You still need to use the most restrictive type. For example, uint, instead of int, when you want to exclude negative values; a non-empty list type, if your list should not be empty; etc.
When the type is more complex, specific contraints should be used. For a real live example: I designed a type for the occupation of a hotel booking application. The number of occupants of a room must be positiv and a child must be accompanied by at least one adult. My type Occupants has a constructor Occupants(int adults, int children) that varifies that condition on construction (and also some maximum values).
> Edit: Changed this from email because email validation is a can of worms as an example
Email honestly seems much more straightforward than dates... Sweden had a Feb 30 in 1712, and there's all sorts of date ranges that never existed in most countries (e.g. the American colonies skipped September 3-13 in 1752).
Dates are unfortunate in that you can only really parse them reliably with a TZDB.
this is very much a nitpick, but I wouldn't call throwing an exception in the constructor a good use of static typing. sure, it's using a separate type, but the guarantees are enforced at runtime
Given that the compiler can't enforce that users only enter valid data at compile time, the next best thing is enforcing that when they do enter invalid data, the program won't produce an `Email` object from it, and thus all `Email` objects and their contents can be assumed to be valid.
I think you're quite right that the idea of "parse don't validate" is (or can be) quite closely tied to OO-style programming.
Essentially the article says that each data type should have a single location in code where it is constructed, which is a very class-based way of thinking. If your Java class only has a constructor and getters, then you're already home free.
Also for the method to be efficient you need to be able to know where an object was constructed. Fortunately class instances already track this information.
It's a design choice more than anything. Haskell's type safety is opt-in — the programmer has to actually choose to properly leverage the type system and design their program this way.
I'm not sure, maybe a little bit. My own journey started with BASIC and then C-like languages in the 80s, dabbling in other languages along the way, doing some Python, and then transitioning to more statically typed modern languages in the past 10 years or so.
C-like languages have this a little bit, in that you'll probably make a struct/class from whatever you're looking at and pass it around rather than a dictionary. But dates are probably just stored as untyped numbers with an implicit meaning, and optionals are a foreign concept (although implicit in pointers).
Now, I know that this stuff has been around for decades, but it wasn't something I'd actually use until relatively recently. I suspect that's true of a lot of other people too. It's not that we forgot why strong static type checking was invented, it's that we never really knew, or just didn't have a language we could work in that had it.
This is a great article, but people often trip over the title and draw unusual conclusions.
The point of the article is about locality of validation logic in a system. Parsing in this context can be thought as consolidating the logic that makes all structure and validity determination about incoming data into one place in the program.
This lets you then rely on the fact that you have valid data in a known structure in all other parts of the program, which don't have to be crufted up with validation logic when used.
Related, it's worth looking at tools that further improve structure/validity locality like protovalidate for protobuf, or Schematron for XML, which allow you to outsource the entire validity checking to library code for existing serialization formats.
When I came to this idea on my own, I called it "translation at the edge." But for me it was more that just centralizing data validation, it also was about giving you access to all the tools your programming language has for manipulating data.
My main example was working with a co-worker whose application used a number of timestamps. They were passing them around as strings and parsing and doing math with them at the point of usage. But, by parsing the inputs into the language's timestamp representation, their internal interfaces were much cleaner and their purpose was much more obvious since that math could be exposed at the invocation and not the function logic, and thus necessarily, through complex function names.
I disagree. I think the key insight is to carry the proof with you in the structure of the type you 'parse' into.
I think that's an excellent way to build a defensive parsing system but... I still want to build that and then put a validator in front of it to run a lot of the common checks and make sure we can populate easy to understand (and voluminus) errors to the user/service/whatever. There is very little as miserable as loading a 20k CSV file into a system and receiving "Invalid value for name on line 3" knowing that there are likely a plethora of other issues that you'll need to discover one by one.
A frequent visitor to HN. Tip: if you click on the "past" link under the title (but not the "past" link at the top of the page), you'll trigger a search for previous posts.
https://hn.algolia.com/?query=Parse%2C%20Don%27t%20Validate&...
However, it's more effective to throw quotes into the mix, reduces false positives.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
I make great use of value objects in my applications but there are things I needed to do to make it ergonomic/performant. A "small" application of mine has over 100 value objects implemented as classes. Large apps easily get into the 1000s of classes just for value objects. That is a lot of boilerplate. It's a lot of boxing/unboxing. It'd be a lot of extra typing than "stringly typed" programs.
To make it viable, all value objects are code-generated from model schemas, and then customized as needed (only like 5% need customization beyond basic data types). I have auto-upcasting on setters so you can code stringly when wanted, but everything is validated (very useful for writing unit tests more quickly). I only parse into types at boundaries or on writes/sets, not on reads/gets (limit's the amount of boxing, particularly on reading large amounts of data). Heavy use of reflection, and auto-wiring/dependency injection.
But with these conventions in place, I quite enjoy it. Easy to customize/narrow a type. One convention for all validation. External inputs are by default secure with nice error messages. Once place where all values validation happens (./values classes folder).
A great piece.
Unfortunately, it's somewhat of a religious argument about the one true way. I've worked on both sides of the fence, and each field is equally green in its own way. I've use OCaml, with static typing, and Clojure, with maybe-opt-in schema checking. They both work fine for real purposes.
The big problem arrives when you mix metaphors. With typing, you're either in, or you're out - or should be. You ought not to fall between stools. Each point of view works fine, approached in the right way, but don't pretend one thing is the other.
It seems modern statically-typed and even dynamically-typed languages all adopted this idea, except Go, where they decided zero values represent valid states always (or mostly).
A sincere question to Go programmers – what's your take on "Parse, Don't Validate"?
Not speaking for all Go programmers, but I think there is a lot of merit in the idea of "making zero a meaningful value". Zero Is Initialization (ZII) is a whole philosophy that uses this idea. Also, "nil-punning" in Clojure is worth looking at. Basically, if you make "zero" a valid state for all types (the number 0, an empty array, a null pointer) then you can avoid wrapping values in Option types and design your code for the case where a block of memory is initialized to zero or zeroed out.
Only if you ignore the billion cases where it doesn't work, such that half the standard library explodes if you try to use it with zero values because they make no sense[0], special mention to reflect.Value's
> Panic: call of reflect.Value.IsZero on zero Value
And the "cool" stuff like database/sql's plethora of Null* for every single type it can support. So you're not really avoiding "wrapping values in Option types", you're instead copy/pasting ad-hoc ones all over, and have to deal with zero values in places where they have no reason to be, forced upon you by the language.
And then of course it looks even worse because... not having universal default values doesn't preclude having opt-in default values. So when that's useful and sensible your type gets a default value, and when it's not it doesn't, and that avoids having to add a check in every single method so your code doesn't go off the rail when it encounters a nonsensical zero value.
[0] or even when you might think it does, like a nil Logger or Handler
Each repost is worth it.
This, along with John Ousterhout's talk [1] on deep interfaces was transformational for me. And this is coming from a guy who codes in python, so lots of transferable learnings.
I did a lightning talk on this topic last year, with a concrete example in Yesod.
Semi tangent but I am curious. for those with more experience in python, do you just pass around generic Pandas Dataframes or do you parse each row into an object and write logic that manipulates those instead?
Definitely do not parse each row into eg pydantic models. You lose the entire performance benefit of pandas / polars by doing this.
If you need it, use a dataframe validation library to ensure that values are within certain ranges.
There are not yet good, fast implementations of proper types in Python dataframes (or databases for that matter) that I am aware of.
Pass as immutable values, and try to enforce schema (eg, arrow) to keep typed & predictable. This is generally easy by ensuring initial data loads get validated, and then basic testing of subsequent operations goes far.
If python had dependent types, that's how i'd think about them, and keeping them typed would be even easier, eg, nulls sneaking in unexpectedly and breaking numeric columns
When using something like dask, which forces stronger adherence to typings, this can get more painful
Speaking personally, I try not to write code that passes around dataframes at all. I only really want to interact with them when I have to in order to read/write parquet.
The circumstances where you would use one or the other are vastly different. A dataframe is an optimized datastructure for dealing with columnar data, filtering, sorting, aggregating, etc. So if that is what you are dealing with, use a dataframe.
The goal is more about cleaning and massaging data at the perimeter (coming in, and going out) versus what specific tool (a collection of objects vs a dataframe) is used.
The author's point here is great, but the post does (imho) a poor job illustrating it.
The tl;dr on this is: stop sprinkling guards and if statements all over your codebase. Convert (parse) the data into truthful objects/structs/containers at the perimieter. The goal is to do that work at the boundaries of your system, so that inside of your system you can stop worrying about it and trust the value objects you have.
I think my hangup here is on the use of terms parse vs validate. They are not the right terms to describe this.
Hot take: Static typing is often touted as the end all be all, and all you need to do is "parse, don't validate" at the edge of your program and everything is fine and dandy.
In practice, I find that staunch static typing proponents are often middle or junior engineeers that want to work with an idealised version of programming in their heads. In reality what you are looking for is "openness" and "consistency", because no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.
This is also why in practice alot of customer input ends up being passed as "strings" or have a raw copy + parsed copy, because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program.
> no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.
That's not a fault of type systems, though.
> because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program
That's a problem with overly-tight coupling, poor design, and poor planning, not type systems
> In practice, I find that staunch static typing proponents are often middle or junior engineeers
I find people become enthusiastic about it around intermediate stages in their career, and they sometimes embrace it in ways that can be a bit rigid and over-zealous, but again it isn't a problem with type systems
> I find that staunch static typing proponents are often middle or junior engineeers
I wouldn't go this far as it depends on when the individual is at that phase of their career. The software world bounces between hype cycles for rigorous static typing and full on dynamic typing. Both options are painful.
I think what's more often the case is that engineers start off by experiencing one of these poles and then after getting burned by it they run to the other pole and become zealous. But at some point most engineers will come to realize that both options have their flaws and find their way to some middle ground between the two, and start to tune out the hype cycles.
This is such a tired take. The burden of using static types is incredibly minimal and makes it drastically simpler to redesign your program around changing business requirements while maintaining confidence in program behavior.