Loading Pydantic models from JSON without running out of memory

pythonspeed.com

134 points by itamarst 7 months ago


scolvin - 7 months ago

Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

fidotron - 7 months ago

Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

jmugan - 7 months ago

My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?

deepsquirrelnet - 7 months ago

Alternatively, if you had to go with json, you could consider using jsonl. I think I’d start by evaluating whether this is a good application for json. I tend to only want to use it for small files. Binary formats are usually much better in this scenario.

dgan - 7 months ago

i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

Automatic, statically typed deserialization is worth the trouble in my opinion

fjasdfas - 7 months ago

So are there downsides to just always setting slots=True on all of my python data types?

thisguy47 - 7 months ago

I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.

zxilly - 7 months ago

Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.

kayson - 7 months ago

How does the speed of the dataclass version compare?

m_ke - 7 months ago

Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/

- 7 months ago
[deleted]
- 7 months ago
[deleted]