Loading Pydantic models from JSON without running out of memory

56 itamarst 16 5/22/2025, 6:06:37 PM pythonspeed.com ↗

Comments (16)

m_ke · 2h ago
Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/
mbb70 · 1h ago
A great feature of pydantic are the validation hooks that let you intercept serialization/deserialization of specific fields and augment behavior.

For example if you are querying a DB that returns a column as a JSON string, trivial with Pydantic to json parse the column are part of deser with an annotation.

Pydantic is definitely slower and not a 'zero cost abstraction', but you do get a lot for it.

No comments yet

itamarst · 2h ago
msgspec is much more memory efficient out of the box, yes. Also quite fast.
jmugan · 2h ago
My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?
enragedcacti · 1h ago
The only reason I can think of for the behavior you are describing is if one of the unioned types at some level of the hierarchy is equivalent to Dict[str, Any]. My understanding is that Pydantic will explore every option provided recursively and raise a ValidationError if none match but will never just give up and hand you a partially validated object.

Are you able to share a snippet that reproduces what you're seeing?

causasui · 1h ago
You probably want to use Discriminated Unions https://docs.pydantic.dev/latest/concepts/unions/#discrimina...
cbcoutinho · 1h ago
At some point, we have to admit we're asking too much from our tools.

I know nothing about your context, but in what context would a single model need to support so many permutations of a data structure? Just because software can, doesn't mean it should.

zxilly · 1h ago
Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.
itamarst · 1h ago
Once you switch to ijson it will not save any memory, no, because ijson essentially uses zero memory for the parsing. You're just left with the in-memory representation.
fjasdfas · 2h ago
So are there downsides to just always setting slots=True on all of my python data types?
itamarst · 2h ago
You can't add extra attributes that weren't part of the original dataclass definition:

  >>> from dataclasses import dataclass
  >>> @dataclass
  ... class C: pass
  ... 
  >>> C().x = 1
  >>> @dataclass(slots=True)
  ... class D: pass
  ... 
  >>> D().x = 1
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      D().x = 1
      ^^^^^
  AttributeError: 'D' object has no attribute 'x' and no __dict__ for setting new attributes
Most of the time this is not a thing you actually need to do.
monomial · 1h ago
I rarely need to dynamically add attributes myself on dataclasses like this but unfortunately this also means things like `@cached_property` won't work because it can't internally cache the method result anywhere.
masklinn · 1h ago
Also some of the introspection stops working e.g. vars().

If you're using dataclasses it's less of an issue because dataclasses.asdict.

dgan · 1h ago
i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

Automatic, statically typed deserialization is worth the trouble in my opinion

thisguy47 · 2h ago
I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.
itamarst · 2h ago
For my PyCon 2025 talk I did this. Video isn't up yet, but slides are here: https://pythonspeed.com/pycon2025/slides/

The linked-from-original-article ijson article was the inspiration for the talk: https://pythonspeed.com/articles/json-memory-streaming/