Microsoft Office is using an artificially complex XML schema as a lock-in tool

209 firexcy 120 7/19/2025, 4:22:45 AM blog.documentfoundation.org ↗

Comments (120)

jonathaneunice · 11h ago

I wish this article had shown side-by-side examples. Back when I built document transformation tools as part of a publishing pipeline, the simplicity and clarity benefit of OpenDocument's XML over Microsoft's OOXML were *staggering* in practice. A beautiful, clean, logical approach vs beyond-Byzantine cruft and complexity at every turn.

I don't remember every element enough to render from memory, but ChatGPT's example feels about right:

OpenDocument

<text:p text:style-name="Para"> This is some <text:span text:style-name="Bold">bold text</text:span> in a paragraph. </text:p>

OOXML

<w:p> <w:pPr> <w:pStyle w:val="Para"/> </w:pPr> <w:r> <w:t>This is some </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>bold text</w:t> </w:r> <w:r> <w:t> in a paragraph.</w:t> </w:r> </w:p>

OpenDocument is not always 100% "simple," but it's logical and direct. Comprehensible on sight. OOXML is...something else entirely. Keep in mind the above are the simplest possible examples, not including named styles, footnotes, comments, change markup, and 247 other features commonly seen in commercial documents. The OpenDocument advantage increases at scale. In every way except breadth of adoption.

_the_inflator · 10h ago

By accident, I saw firsthand how a simple layout, such as a page and a few paragraphs, can make you question why formats like Markdown are even possible, because the number one text processing tool would throw such a gigantuan load of crude syntax at you for a few paragraphs.

Respect to MS for keeping the lights on.

People need to understand that there is no MS format per se, but different standards from which you can choose. Years ago, when OpenDocument was fairly popular, MS was kind of hesitant to use an XML format. XML is a strict format, no matter the syntax.

And I bet that MS intended such a complicated format to prevent Open Source Projects from developing parsers and MS from losing market share this way. I bet there are considerations about such a strategy discussed at the time, buried in Archive.org.

On the other hand, MS didn't want nor see the XML chaos, which would follow later on. XML is a format, and all it demands is being formally correct. It is like Assembler, fixed instruction sets with lots of freedom, and only the computer needs to "understand" the code - if it runs, ship it.

ZEN of whatever cannot be enforced. JavaScript was once the Web's assembly language. Everything was possible, but you had to do the gruntwork and encapsulate every higher-level function in a module that consisted of hundreds of LoCs. Do in hundreds of LoCs, what a simple instruction in Python could achieve with one.

Babel came, TypeScript, and today I lost track of all the changes and features of the language and its dialects. The same goes for PHP, Java, C++, and even Python. So many features that were hyped, and you must learn this crap nevertheless, because it is valid code.

Humans cannot stand a steady state. The more you add to something, the more active and valuable it seems. I hate feature creep — kudos to all the compiler devs, who deserve credit for keeping the lights on.

Someone · 9h ago

> And I bet that MS intended such a complicated format to prevent Open Source Projects from developing parsers and MS from losing market share this way.

It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”.

The source code of MS Word almost certainly has lots of now weird-looking design choices based on having to run in constrained memory. It also has dark corners for “we released a version that did this slightly different, so we have to keep supporting it”

ninkendo · 2h ago

> It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”

That’s exactly what it was. They originally had a binary representation (.doc) which was pretty much just a straight-up dump of their internal data structures to disk. When they felt forced to make an “open” “xml-based” format, they basically converted their binary serialization to XML without changing what it represented at all. It was basically malicious compliance.

dathinab · 9h ago

as far as I understand just parsing OOXML is by far not enough to get anywhere close to having a reasonable correct understanding of the layout of the document due to how it's "supper flexible" in ways going "beyond the OOXML standard", i.e. you still have to reverse engineer tone of things.

(i.e. they worked around the "XML is a strict format" part ;) )

or at least it was that way way back then when OOXML was new and the whole scandal about MS "happening" to not correctly implement their own standard thing was still news (so like 10+ years ago)

dathinab · 9h ago

I wonder how much of this is related accidental grown complexity (in their original closed format) and their WYSIWYG just doing dump stuff devs aren't sure how or why it ended that way but also don't want to touch least it breaks.

Which they then carried over into OOXML.

Just to be clear, MS has back then and recently again repeatedly shown very clearly they the whole embrace extend extinguish thing is the core of their action for most things open or standardized(1). And what is a better way to "extinguish" open text standards by making one themself which is build in a way guaranteed to not work well, i.e. fail, for anyone(/most) but first party MS products and then use that to push the propaganda fud that open text standards just can't be good.

So I'm very sure them having an obscure, hyper complex, OOXML "open standard" format where actually implementing it standard compliant is far from sufficient for correct displayed/interpreted documents is a very intentional thing.

But if you already have a mess internally it is a very good move to just use/expand on that, because it does give you a excuse why things ended up how they are and save implementation time.

----

(1): disclaimer: In between there where a view years where they acted quite friendly; Specific dev of MS still love Open Source in a honest way; in some areas open source also has won; and in some places it's just a very bad time vor "extend and extinguish" so it's not (yet) done; And sometimes it's done very slowly and creepingly; So yes you will find good MS open source project and contributions. But its still pretty much everywhere no matter in which direction you look as long as you look close enough.

dathinab · 8h ago

honestly OOXML looks a loot like someone took a non XML format and gave it a XML encoding

like XML is a mark up language so it _should_ interleave quite "naturally" and well for text formatting tasks (i.e. see OpenDocument example or supper simple "ancient style" HTML)

but OOXML looks more like someone force serialized some live OOP object hierarchy with (potential cyclic) references and tone of subclasses etc.

tl;dr: i.e. it looks a loot similar to a simplified form of how text editors internal represent formatted test

like w:r looks like a text section, you could say a r_ow of wide characters or words, w:p looks like a subclass of a implicit type which is basically a `Vec<w:r>`, w:pPr looks like ".presentation" property of w:p, same for w:rPr, probably both being subtypes of some generic Presentation base class. w:t looks like a generic `.text: String` property. w:pStyle looks like a property of Presentation or it's ParagraphPresentation sub-class, it's `w:val` property makes it look like it's a shared reference which can be looked up by the key `"Para"`. w:b is just another subclass of Presentation you can use in any context etc.

which opens the question

"do they mostly just dump their internal app state"?

and did they make their format that over-complicated and "over" flexible so that they can just change their internal structure and still dump it?

which would also explain how they might have ended up with "accidentally" incorrectly implementing their own standard around 10 years ago during early OOXML times

and if so isn't that basically "proof" that OOXML isn't really an open format but just a "make pretend" of one?

xg15 · 7h ago

I read somewhere that in the first versions of Office, the "documents" were literally just memory dumps.

So I guess they're going back to that old strategy...

Edit: Source might have been this: https://news.ycombinator.com/item?id=39402595 , so part of it might have been an urban myth.

jpalomaki · 14h ago

What we should really do is abandon the WYSIWYG approach to document editing. This inevitably leads into vendor lock in.

Instead of perfect looks, we should focus on the content. Formats like markdown are nice, because they force you to do this. The old way made sense 30 yers ago when information was consumed on paper.

Ekaros · 12h ago

Looking at Latex, I don't think hand tuning some parameters until you get right look in every single case is much better user experience...

Ofc, if we stop really caring what things look like we could save lot of energy and time. Just go back to pure HTML without any JavaScript or CSS...

lcnielsen · 11h ago

> Looking at Latex, I don't think hand tuning some parameters until you get right look in every single case is much better user experience...

Having written many papers, reports and my entire Ph. D. thesis in Latex, and also moved between LaTeX classes/templates when changing journals... I'm inclined to agree to an extent. I think every layout system has a final hand-tweaking component (like inline HTML in markdown for example), but LaTeX has a very steep learning curve once you go beyond the basic series of plots and paragraphs. There are so many tricks and hacks for padding and shifting and adjusting your layout, and some of them are "right" and others are "wrong" for really quite esoteric reasons (like which abstraction layer they work at, or some priority logic).

Of course in the end it's extremely powerful and still my favourite markup language when I need something more powerful than markdown (although reStructuredText is not so bad either). But it's really for professionals with the time to learn a layout system.

Then again there are other advantages to writing out the layout, when it comes to archiving and accessibility, due to the structured information contained in the markup beyond what is rendered. arXiv makes a point about this and forces you to submit LaTeX without rendering the PDF, so that they can really preserve it.

techjamie · 3h ago

I've been quite happy with using Typst to write things at home. It's less arcane than LaTeX and easier to reason about.

At work I use our ChatGPT page to generate an HTML+CSS skeleton of what I want and tweak that. It's quicker for me than doing the equivalent in word, and easier to manipulate later. Most of the time I don't need anyone else editing my docs, so it works out.

dathinab · 8h ago

> Looking at Latex, I don't think hand tuning some parameters until you get right look in every single case is much better user experience...

In my experience you do that more in Word then in Latex (the I added some paragraphs here and wtf is that picture two pages later doing now problem).

The issue is to some degree quite fundamental to the underlying challenges of laying out formatted text with embedded things, affecting both word and LaTex.

through that is assuming you now how to properly use Word / LaTex if you don't you can cause yourself a huge amount of work ;)

homebrewer · 11h ago

> if we stop really caring what things look like we could save lot of energy and time

Yet simple Markdown documents automatically converted into pdf by pandoc look ten times better than most MS Office documents I've had to deal with over the past couple of decades. Most MS Office users have very little knowledge of its capabilities and do things like adjusting text blocks with spaces, manually number figures (which results in broken references that lead to the wrong figure — or nowhere), manually apply styles to text instead of using style presets (resulting in similar things being differently styled), etc.

DemocracyFTW2 · 11h ago

Latex is just as bad as WYSIWYG, pure unadultered TeX is what real programmers use! Of course, real hardcore programmers just bathe their hard drives in showers of cosmic rays until just the right bits have flipped et voila!—document ready

mannycalavera42 · 9h ago

I am a pro when it comes to M-x butterfly

dathinab · 8h ago

even many of the very widely used LaTex "editors" do have some semi live "preview" feature which often allow some WYSIWYG like features in the preview

similar if someone random non technical person just needs to write idk. 5 paragraphs of text with a head line, no high requirements for formatting, no kind of templates, no anything fancy why would you force them to not have WYSIWYG if that is the perfekt fit for their use case in every single aspect?

Similar markdown doesn't scale to a lot of writing requirements, like _AT ALL_. I know because I have written pretty much any thesis or larger reports etc. during my studies in markdown. And I had to step out of markdown all the time. Weather that is by using inline latex, or by tweaking markdown to PDF conversion templates (by interleaving non mark down sources with markdown by splitting the markdown into many different files and folders which each could be markdown or anything else, and/or inline latex to include non markdown sources imported into latex) etc. It was a nice pipeline, but it also wasn't really markdown anymore, but instead some amalgamation of markdown LaTex and other things. As a programmer that was fine, but that doesn't scale at all to the "standard" user of office applications.

ozim · 12h ago

Not fun part is that focusing on content will lead you to a place where you get customers using BS arguments for power play.

You want to be able to do everything just right for the looks. Because there always will be someone negotiating down because your PDF report does not look right and they know a competitor who „does this heading exactly right”.

In theory if you have garbled content that is not acceptable of course, but small deviations should be tolerated.

Unfortunately we have all kinds of power games where you want exact looks. You don’t always have option to walk away from asshole customers nitpicking on BS issues.

znpy · 11h ago

This could only work if all people using computers are tech people, and willing to spend a couple of months learning latex or whatever.

None of the two conditions are reality, of course.

So yeah, that’s not happening.

redeeman · 1h ago

no, instead they spend all their professional life not knowing what they work with, and waste much more time accumulated.

p_ing · 7h ago

> Instead of perfect looks, we should focus on the content.

Many documents are created for looks rather than content.

constantcrying · 13h ago

I don't think WYSIWYG is the issue here, WYSIWYG editors for markdown exist. It is the premise that document creation is about representing a piece of paper digitally.

For most documents nowadays it makes no sense to see them as a representation of physical paper. And the word paradigm of representing a document as a if it were a piece of paper is obsolete in many areas where it is still being used.

Ironically Atlassian, with confluence, is a large force pushing companies away from documents as a representation of paper.

xcrunner529 · 57m ago

People inevitably want to print the contents, never mind the implications of column size for ease of reading.

mawadev · 13h ago

Do HTML WYSIWYG editors ever lead to vendor locking?

ZiiS · 13h ago

Yes, Frontpage tried to lock reading to Internet Explorer and hosting to IIS several times. Even with the best will in the world switching editors lost fidelity.

nikanj · 13h ago

HTML WYSIWYG editors lead to eldrich horrors, the likes of which haven’t been seen since someone tried parsing HTML with regex

sixtyj · 12h ago

Wix, Webflow, WordPress… I have tried all wysiwyg editors (block editors). Oh my, what a mess in html, if you try to edit a file manually…

piker · 14h ago

Not sure why you're getting so downvoted. It's a totally reasonable opinion in 2025 but it faces massive adoption headwind. People still cling to the idea of printing pages of documents even if it's increasingly rare (even, say in legal) for them to do so.

Arainach · 12h ago

This is getting downvoted because it's ludicrous. Users want WYSIWIG - documents that are what appears on the printer page or when people they share the document to open it.

"Interoperability" is something technical enthusiasts talk about and not something that users creating documents care about outside of the people they share with seeing exactly what was created.

nlitened · 11h ago

as you say, tech enthusiasts only _talk_ about interoperability, but very very few actually care about it. Try interoperating between two pieces of code written in two different languages without spinning up an HTTP server or a separate virtual machine with a database system. Not even two different languages, just between two major versions of the same damn programming language.

piker · 5h ago

I think you're missing the point that the "G" in WYSIWIG is "when printed". Everybody wants uniform rendering across applications in the way that, say, websites render uniformly across browsers. But fewer people care today what it looks like when printed--they're never going to print it anyway. They're just going to docusign it and call it a day.

In other words, fidelity to the printed page isn't really as important or as magical today as it was in 1984.

Arainach · 5h ago

No, the G is "get". Users want to create documents in their presentation format. When they create a table they want to see a table, not a bunch of pipes and plus signs. When they create a hyperlink they want to see the link text, not the URL and a bunch of brackets.

piker · 3h ago

Yes, the markup is hidden and the document is rendered in all cases. The point I was making, and I believe the parent was making, is that it doesn't have to be wed to the physical medium. Perhaps we're talking past each other here.

Arainach · 1h ago

Writing markdown is decidedly not WYSIWIG. Storing documents with a WYSIWIG editor in a markdown-like format isn't viable either - look how frustrated word processor users get if changing one element repositions another ("all my pictures move"). Users want things in a precise format.

piker · 43m ago

Nobody is proposing that people will be writing markdown by hand in text editors. Just that a lot of the complexity of rendering stuff on the screen precisely as it would be printed on, say, A4 paper can perhaps soon be left behind. People print a lot less these days, and thus WYSIWYG (in the original 70s, 80s and early 90s sense of the term https://en.wikipedia.org/wiki/WYSIWYG) may be less important.

bboygravity · 13h ago

Yeah and let's all move to Arch Linux without any Window manager. And you have to write your own driver to use wifi.

Death to user friendlyness! Advanced users only! /s

fxtentacle · 12h ago

Try to create a PDF report with collapsible subheadings in Excel. After you have learned the necessary MacroScript and JavaScript to pull that off, writing a Wi-Fi driver will feel like a joke in comparison.

flohofwoe · 12h ago

I don't even think it's intentional, they had to come up with a file format which supports all the weird historical artefacts in the various Office tools. They didn't have the luxury to first come up with a clean file format and then write the tools around it.

And I bet they didn't switch to XML because it was superior to their old file formats, but simply because of the unbelievable XML hype that existed for a short time in the late 1990s and early 2000s.

Arainach · 12h ago

An XML format, even one with a lot of cruft to handle legacy complexity, is absolutely easier to parse/interop with than a legacy binary format that was to a large degree a serialization of undocumented in-memory content.

OOXML was, if anything, an attempt to get ahead of requirements to have a documented interoperable format. I believe it was a consequence of legal settlements with the US or EU but am too tired at the moment to look up sources proving that.

dathinab · 7h ago

> is absolutely easier to parse/interop with than a legacy binary format

depends

you can have well, clean and fully documented binary formats which are relatively easy to parse (e.g. msgpack, cbor, bson)

you might still not know what the parsed things mean, but that also applies to text formats (including random documented binary blob fields, thanks to base64 they also fit into any text format)

folbec · 9h ago

Exactly, there is no need for nefarious intentions, when time constraint et mild incompetence suffice.

The OOXML format is likely a not very deeply thought out XML serialization of the in memory structure or of the old binary format, done under time pressure (there was legal pressure on Microsoft at the time).

dathinab · 7h ago

> The OOXML format is likely a not very deeply thought out XML serialization of the in memory structure or of the old binary format

it somewhat looks like that, but that old binary format changed with every nth yearly major new version and IMHO it looks like not being far away from a slightly serialized dump of their internal app data structures ;)

but

putting aside that they initially managed to incorrectly implement their own standard OOXML and the mess that "accident" caused

they also did support import and even exports (with limited features) of the Open Document format before even fully supporting OOXML, and even use that as standard save option.... (when edition such a document)

like there really was no technical reason why they couldn't just have adopted the Open Document format, maybe at worst with some "custom" (but open and "standardized" (by MS itself) extensions to it)

MS at the time had all insensitive to comply as bad in faith as they could get away with

and what we saw at that time was looking like exactly that

sure hidden behind "accidents" and incompetence

but lets be honest if a company has all interest and insensitive to make something in bad faith and make it go bad absurdly and then exactly that happens then it's very naive to assume that it was actually accidentally most likely it wasn't

that doesn't mean any programmer sat down and intentional thought about how to make it extra complicated, there is no need for that and that would just be a liability, instead you do bad management decision, like (human) resource starve the team responsible (especially keep you best seniors away), give them all messed up deadlines, give them all messed up requirements you know can't work out. Mess up communication channels. Only give them bad tooling for the task. etc. etc. Most funny thing due to how messy software production often is the engines involved might not even notice ;), means no liability on that side.

redeeman · 1h ago

and they should have gotten the corporate death penalty for it. I think it should still be done. the sheer amount of crap microsoft has purposely bestowed upon the world should lead to life in prison for many of its decision makers

dathinab · 7h ago

> They didn't have the luxury to first come up with a clean file format and then write the tools around it.

This is just not right.

They where not required (AFIK) and in some edge cases also didn't provide a perfect conversion of all old documents to the open format. Actually even just converting between different versions of their proprietary formats had a tendency to break things sometimes! (back then)

> unbelievable XML hype that existed for a short time in the late 1990s and early 2000s.

(EDIT: actually 2006, so uh, maybe XML hype) we speak about ~2010, the hype was pretty dead again at that time, and the main reason they choose it is to position it as "completion" to emerging standardized open office document formats which all used XML as markup language (except they don't really use XML as mark down language but more like serialization to JSON but way more complex, but that doesn't matter they mostly need to convince not supper tech affine people about them "no longer trying to hamper competition" to preclude legislative action and governments from switching to other office suites due to the closed format making them worry).

so they where more then able to

- do a clean design, if anyway a lot of old "proprietary" documents break subtly when converting it doesn't matter (and they did break)

- just adopt OpenDocument format

Neil44 · 12h ago

This format of XML in a zip with a docx extension came into existence in Office 2007

mickeyp · 12h ago

Sorry but XML is a good fit for this. Most people who've never used XML cannot ever fathom that it does actually do a number of things well.

Being able to layer markup with text before, inside elements, and after is especially important --- as anyone with HTML knowledge should know. Being able to namespace things so, you know, that OLE widget you pulled into your documents continue to work? Even more important. And that third-party compiled plugin your company uses for some obscure thing? Guess what. Its metadata gets correctly embedded and saved also, and in a way that is forward and backwards compatible with tooling that does not have said plugin installed.

So no, it wasn't 'hype'.

Hnrobert42 · 10h ago

There are good use cases for XML.

There was also huge hype. XML databases, anyone? XML is now an also-ran next to json, yaml, markdown. At the time, it was XML all the things!

mcswell · 2h ago

I guess this is not directly related, but I worked on a modified DocBook XML schema for some years. We were writing grammars of natural languages, so we had no use for some of the DocBook constructs, and needed to add others. That wasn't hard. And there are WYSIWYM (What You See Is What You Mean) editors, like XMLmind, which read the schema and helped you create conforming documents.

There are at least two ways to get from such an XML document to a PDF; we used pdfLaTeX, modified to handle our extra constructs, and then XeLaTeX.

I won't say it was a simple toolpath, but it allowed us to do at least two things that would have been difficult with Word or OpenOffice:

(1) It gave us an archival XML format, which will probably be readable and understandable for centuries. For grammars of endangered languages, that's important, because the languages won't be around more than a couple decades.

(2) It gave us the ability to cleanly typeset documents that had multiple scripts (including both Roman and various right-to-left scripts, like Arabic and Thaana).

wvenable · 17h ago

> Unfortunately, while an XML schema can be simple, it can also be unnecessarily complex, bloated, convoluted and difficult to implement without specific knowledge of its features.

One could now use that exact sentence to describe the most popular open document format of all: HTML and CSS.

masa331 · 15h ago

Can you be more specific here? HTML and CSS can't be described like that in my opinion.

It is complex but not complicated. You can start with just a few small parts and get to a usable and clean document within hours from the first contact with the languages. The tags and rules are usually quite self-describing while consice and there are tons and tons of good docs and tools. The development of the standards is also open and you can peek there if you want to understand decisions and rationals.

yegle · 14h ago

You could say the existing browser vendors pushed to make the HTML standard more complicated to the point that there's no chance for a newcomer to compete with the existing ones.

Voultapher · 9h ago

Ladybird would like a word.

Though I agree that the web standards are extremely large. Not sure if they are too large, given their cross-platform near OS layer functionality.

alterom · 14h ago

It's not about making a document.

It's about making software that would display a document in that format correctly.

I.e., a browser.

perching_aix · 14h ago

The current HTML spec alone is a 1000+ page PDF, and I can't imagine the CSS spec being much shorter.

Wordsmithing your way around this doesn't make them any easier.

acdha · 10h ago

Sure, technical documents are long but that still doesn’t support the original claim that they are “unnecessarily complex, bloated, convoluted” and it’s actually evidence against the assertion that they’re “difficult to implement without specific knowledge of its features”: most of why those are long documents is that they carefully detail how necessarily complex systems interact in sufficient detail to implement them whereas the Office XML specs at least historically had things like flags telling to behave like, say, Word95 without fully specifying the behaviour in question.

perching_aix · 10h ago

The original claim was clearly actually just an opinion; I don't think there's merit to treating it as a series of logical statements, or at intricate depth in general.

Evidence for this is in the very words used: unnecessary, complex, bloated, convoluted. These are very human terms that are thus subject to personal interpretation and opinions.

It shouldn't be surprising then that their "claim" thus fails scrutiny. All they actually meant to say is that HTML and CSS are both verbose standards with a lot of particularities - still something subjective, but I think page / word / character counts are pretty agreeable attributes to estimate this with in an objective way. Hence why I brought those up exactly.

acdha · 10h ago

Of course it’s an opinion: the point is that it’s neither persuasive nor internally inconsistent. They haven’t given any reason to believe they have enough domain knowledge to compare the two authoritatively. It’s also inconsistent to criticize OOXML for being difficult to implement without extra knowledge and then to criticize a truly open spec for being detailed enough to implement without extra – the entire HTML5 process was intended to reduce the number of cases where people were relying on things which required implementers to know how a specific engine like IE worked.

masa331 · 11h ago

Sure the spec might be enormous but you don't need to touch it at all to be productive quickly. In no HTML or CSS tutorial i'v ever seen was a reference to the spec nor did i need to go there to solve something. And that in itself is another proof how nicely it is designed actually. Because on the other hand there are other document types or schemas where you absolutely have to go to the spec because it's is so cryptic and badly designed and not self-explaining that there is nothing else you can do.

perching_aix · 11h ago

HTML and CSS tutorials are for people authoring HTML and CSS documents, not for people authoring HTML and CSS parsers and renderers.

72deluxe · 10h ago

Since HTML is valid XML, it really is perfectly acceptable to say it's the same!

mdaniel · 30m ago

  <p>I don't think that's true.<br>
  Perhaps you're thinking of xhtml?

Observe the lack of a closing p tag, to say nothing of the multiple self-closing tags in html: hr, img, link, meta, ...

https://html.spec.whatwg.org/multipage/grouping-content.html...

leonewton253 · 14h ago

Yeah but those are open standards, where as Microsoft is the only one with true knowledge of its XML.

wvenable · 1h ago

You know you're referring to ECMA-376 and ISO/IEC 29500?

mdaniel · 25m ago

In OP's defense, is there a freely available reference implementation of that standard? I know that LibreOffice certainly tries but I'd guess theirs is closer to a reverse engineered than reference impl

dathinab · 6h ago

yes HTML and CSS have went unhinge without question

big reason for that is that they where not designed for modern requirements like being used as a general purpose application UI toolkit

especially CSS was designed printable documents, not modern websites

and HTML was designed to represent the core semantic structure of a "classical" document (and not a too fancy one either), with minimal formatting (e.g. bold, italic, underline) but even on old websites it was very common to not be used like that at all (e.g. think old table for the whole site to create header and side bar tricks, now doable nicer with HTML5/modern CSS)

so its kinda a markup and style language chosen in the very early internet days only to realize shortly later that websites develop in a direction very mismatched to the designs of both languages (but both happen to be squeezable into their new roles, barely).

Kinda funny. But not really the situation behind OOXML.

rullelito · 14h ago

This is similar in zero ways.

bob1029 · 15h ago

This is a comical perspective to me. I've been ass-deep in core banking APIs where we generate service references from WSDL/XSDs. Some of the resulting codegen measures in the tens of megabytes for some files. I wouldn't even attempt to quantify the number of pages of documentation. And this is just for mid size US banking domain. Microsoft Office has to work literally everywhere for everything. The fact that it's only 8000 pages of documentation is likely a miracle.

If you're working with an XML schema that is served up in XSD format, using code gen is the best (only) path. I understand it's old and confusing to the new generation, but if you just do it the boomer way you can have the whole job done in like 15 minutes. Hand-coding to an XML interface would be like cutting a board with an unplugged circular saw.

deknos · 11h ago

It's not only about the XML itself, but that microsoft really likes to change the standard any time opensource catches up.

and most of the time they do not use their open standard, but the other document type.

The artificial vendor lockin is real.

bob1029 · 11h ago

You can simply re-run your codegen on the newly published schema and review for compile-time errors.

We do this about once a quarter in the banking industry. It takes about an hour on average.

piker · 14h ago

While I generally agree, I don't think the author is complaining about the XML spec's complexity per se but rather that rendering the underlying structures to a page is hard.

perching_aix · 14h ago

Interfacing sounds like only just half the battle though? Like, I don't understand why this is a counter-argument.

bob1029 · 11h ago

As argued elsewhere in this thread there is essential complexity inherent to any problem that cannot be eliminated.

https://news.ycombinator.com/item?id=44613270

If you have this much complexity and there is nothing you can do to reduce it, then the next best thing is to have an incredibly convenient way to stand up a perfect client on the other side of the fence within a single business day.

perching_aix · 10h ago

It's not that I agree with the characterization in the OP, that these formats are deliberately obtuse. It's that I do agree about them being obtuse, and that being able to say that you can auto-generate bindings for them doesn't actually help to make them not obtuse.

I do also think that Office should have created separate formats for project files and export files; if an RTF can hold onto all the formatting details of a typical Word document sufficient for pixel-accurately rendering it for example, then they should have conveyed that better and promoted it as the default export format (along with the idea of an export format), rather than immediately hitting people with a popup that claims their data will be partially lost. If this does exist (just not as an RTF), this point still stands - I don't use it, nobody I know uses it, so it may as well not exist.

Current state of affairs is people passing around docx, xlsx, etc. files, which are project files, hence why they (have to) contain (fancifully) serialized application state. Imagine if people passed around PSDs rather than PNGs. Or if people passed around FLPs rather than WAVs, FLACs or MP3s. It's this separation between the features of a document / spreadsheet / presentation and the features of the authoring software that appears to be completely absent from Microsoft Office, and this is something that just based on the information I have available, MS can legitimately be faulted for. Transitioning from a bespoke binary format to an XML based format with schemas available did basically nothing to help this.

And while it might seem like that I'm suggesting that export formats are this cleanly definable, self-evident things, I don't actually mean to suggest so either. It'd have had to have been a business decision. To where to draw the line would have been a decision that apparently never came to be debated internally, from what anyone can currently tell in retrospect at least, from the outside.

jajko · 14h ago

Yeah another b(w)anker dev here, complex xsds seem to be the baseline in industry as soon as the role of that spec escapes simple 1 server : 1 client use case.

One example I work with sometimes is almost 1MB of xsds and thats a rather small internal data tool. They even have restful json variant but its not that used, and complexity is roughly the same (you escape namescape hell, escaping xml chars etc but then tooling around json is a bit less evolved). Xml to object mapping tool is a must.

markus_zhang · 10h ago

I think the lock-in is more about MSFT's contracts with schools, governments and corporations. I wish they break large corporations to pieces.

donatj · 7h ago

Mind you, Microsoft already had an earlier very capable XML spreadsheet format that was much easier to parse, SpreadsheetML.

Back in the early 2000's I wrote readers and writers for it and made pretty heavy use of the format at my job at the time.

The biggest problem with SpreadsheetML was that it expected the extension to be .XML - Microsoft had some sort of magic that would still associate the files with Excel on Windows but it wasn't super reliable. We started using .xls but after an update Excel started barking about files with the wrong extension.

https://en.wikipedia.org/wiki/SpreadsheetML

praseodym · 7h ago

They had a few more variants in the same era, such as WordProcessingML for Word: https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats

khelavastr · 17h ago

Does this person not understand XML serializers..?

yftsui · 17h ago

Exactly, there is no data provided on why the author believes it is “too complex”, just one random person ranting.

ranger_danger · 17h ago

You're not wrong, but it's funny that this same topic was posted just earlier today with a very different sentiment in the comments.

https://news.ycombinator.com/item?id=44606646

But if you dig hard enough, there's actually links to more evidence of why it is that complicated... so I don't think it was necessarily intentionally done as a method of lock-in, but where's the outrage in that? /s

"Complicated file format has legitimate reasons for being complicated" just doesn't have the same ring to it as a sensationalized accusation with no proof.

dathinab · 6h ago

you can have both

having a lot of intend to keep it complicated and cause vendor locking and comply in bad faith

and this being very easy to archive just by not trying to improve on a status quo and creating a standard where you are the only one to decide what goes in where. Or other simple things like intentionally putting a senior engineer you know tends to painfully overweening things but keep it in a working state, etc. etc. Just by management decisions done in a higher level then the project you can pretty reliable mess up things in various ways as needed pretty reliable as long as you have enough people to choose from.

mjevans · 16h ago

Weren't those reasons effectively...?

'special case everything we ever used to do in office so everything renders exactly the same'

Instead of offering some suitable placebo for properly rendering into a new format ONCE with those specific quirks fixed in place?

lozenge · 14h ago

What would that look like?

"You have opened your Word 97 document in Office 2003. The quirks have been removed, so it might look different now. Check every page before saving as docx."

"You have pasted from a Word 97 document into an Office 2003 OOXML document. Some things will not work."

ranger_danger · 7h ago

If you want a look at how this (doesn't) work in practice, just look at Libreoffice. I once made 3 hours worth of comments on a Word doc, and as soon as I saved it, they all vanished into thin air.

And for the pedantic, yes it warns you when saving as a .docx that "not all features are supported", but it does that every time, for every document, so nobody pays attention to it or has any idea what it even means. To me the way it handles this is just completely unacceptable.

dathinab · 6h ago

they probably do, parsing the XML syntax was never the complexity issue with OOXML, or XML formats in general

everything on top of the XML AST is the issue

constantcrying · 13h ago

Do you not understand that there is a difference between parsing something and implementing a specification? These are totally separate things.

Obviously parsing the XML is trivial. What is not trivial is what you do with parsed XML and what the parsed structure represents.

pessimizer · 15h ago

I have no idea what you mean to express by this. I've never met an XML or SOAP truther, but are you really saying that because XML can be serialized, it's impossible for an XML schema to be artificially complex?

What is it about serializing XML that would optimize the expression of a data model?

piker · 14h ago

This is a dupe from: https://news.ycombinator.com/item?id=44606646 but I'll repeat what I said over there.

I feel qualified to opine on this as both a former power user of Word and someone building a word processor for lawyers from scratch[1]. I've spent hours pouring over both the .doc and OOXML specs and implementing them. There's a pretty obvious journey visible in those specs from 1984 when computers were under powered with RAM rounding to zero through the 00's when XML was the hot idea to today when MSFT wants everyone on the cloud for life. Unlike say an IDE or generic text editor where developers are excited to work on and dogfood the product via self-hosting, word processors are kind of boring and require separate testing/QA.

It's not "artificial", it's just complex.

MSFT has the deep pockets to fund that development and testing/QA. LibreOffice doesn't.

The business model is just screaming that GPL'd LibreOffice is toast.

[1] Plug: https://tritium.legal

dathinab · 6h ago

> The business model is just screaming that GPL'd LibreOffice is toast.

or MS might find itself accidentally toasting themself

a lot of places (including very important MS Office customers) insist in a open document format for various reasons

if MS convinces people that LibreOffice and similar is toast because they can't afford keeping steep with the format in question because it's too expensive they also might end up convincing this customers that it's also too expensive to _them_, and try to find way to switch away from MS Office

unyttigfjelltol · 12h ago

LO is at least as functional as some other market leading SaaS word processors. LO could spin their product into a cloud application and not at all be "toast", because people in separate walled gardens no longer expect interoperability.

As for complexity, an illustration-- while using M365 I recently was confounded by a stretch of text that had background highlighting that was neither highlight markup, not paragraph or style formatting. An AI turned me onto an obscure dialog for background shading at a text level which explained the mystery. I've been a sophisticated user of M365 for decades and never encountered such a thing, nor have a clear idea of why anyone would use text-level background formatting in preference of the more obvious choices. Yet, there it is. With that kind of complexity and obscurity in the actual product, it's inevitable the file format would be convoluted and complex.

piker · 12h ago

Agreed, but the point the author is missing is that complexity doesn't exist due to deliberate corporate lock in, but because the product is 40 years old and has had 10-11 ways to do just about everything it does. Unfortunately, as your case illustrates, there are still documents in the wild that depend on these legacy features. So to render with 100% fidelity, you end up in a sprawling web of complexity. Microsoft can afford to navigate that web (and already owns it). It's neigh impossible for an open-source product to do so.

cranberryturkey · 15h ago

The post is essentially reminding people that XML doesn’t magically equal openness. A schema can be “unnecessarily complex, bloated, convoluted and difficult to implement”, and in the case of Office 365 the spec runs to “over 8 000 pages” and uses deeply nested tags, overloaded elements and wildcards. The result is that only the vendor can feasibly implement it, which eliminates third‑party implementations and lets the vendor dictate terms. The rail‑control analogy in the article makes the point well.

What isn’t acknowledged is that a lot of that complexity isn’t purely malicious. OOXML had to capture decades of WordPerfect/Office binary formats, include every oddball feature ever shipped, and satisfy both backwards‑compatibility and ISO standardisation. A comprehensive schema will inevitably have “dozens or even hundreds of optional or overloaded elements” and long type hierarchies. That’s one reason why the spec is huge. Likewise, there’s a difference between a complicated but documented standard and a closed format—OOXML is published (you can go and download those 8 000 pages), and the parts of it that matter for basic interoperability are quite small compared with the full kitchen‑sink spec.

That doesn’t mean the criticism is wrong. The sheer size and complexity of OOXML mean that few free‑software developers can afford to implement more than a tiny subset. When the bar is that high, the practical effect is the same as lock‑in. For simple document exchange, OpenDocument is significantly leaner and easier to work with, and interoperability bodies like the EU have been encouraging governments to use it for years. The takeaway for anyone designing document formats today should be the same as the article’s closing line: complexity imprisons people; simplicity and clarity set them free.

mrweasel · 13h ago

The complaint that OOXML was overly complex was a criticism when Microsoft first introduced the format, but as you point out, it needed to be able to handle decades of old formatting rules back then already. While I'm sure that there are stuff in the format that Microsoft made needlessly complex, one has to remember that they still need to be able to maintain the code, so throwing in to many roadblocks for open source developers would likely come back to haunt them. Still we know they did just that with SMB, so why not with OOXML.

What surprises me is how well LibreOffice handles various file formats, not just OOXML. In some cases LibreOffice has the absolute best support for abandoned file formats. I'm not the one maintaining them, so it's easy enough for me to say "See, you managed just fine". It much be especially frustrating when you have the OpenDocument format, which does effectively the same thing, only simpler.

grahameb · 10h ago

A friend had a book she'd written in a Mac version of word from the early 90s; none of the current Microsoft versions of Word (windows, mac, web) would read it, but Libreoffice worked fine, so a little script later using Libreoffice's CLI tools and it was all converted, pretty much intact.

charcircuit · 14h ago

>that few free‑software developers can afford to.

Considering how little most free software makes they can't afford to do a lot of things. It's not a hard bar to hit.

catmanjan · 12h ago

Does software that produces files have an obligation to provide interoperability?

dathinab · 6h ago

no, only if you have a quasi monopoly on Office Application in pretty much every single (western) government through all departments and sectors.

happymellon · 12h ago

When they have a monopoly, places like the EU will frown on purposefully breaking compatibility.

Its called antitrust.

graemep · 11h ago

> When they have a monopoly, places like the EU will frown on purposefully breaking compatibility.

What exactly have they done about it?

piker · 9h ago

Without knowing too much of the EU history, I have always understood that anti-trust pressure from the EU effectively forced Microsoft to publish the OOXML spec in the first place.

ranger_danger · 7h ago

> Purposefully

According to who? With what proof? And how/why do they get to be the arbiters of that?

another_twist · 15h ago

How hard would it be to generate a parser for this spec with AI code gen ?

choeger · 15h ago

A parser is trivial. It's XML and you have a schema.

What you want is a compiler (e.g., into a different document format) or an interpreter (e.g., for running a search or a spell checker).

That's a task that's massively complicated because you cannot give an LLM the semantic definition of the XML and your target (both typically are under documented and under specified). Without that information, the LLM would almost certainly generate an incomplete or broken implementation.

constantcrying · 13h ago

If a spec is "difficult to implement without specific knowledge of its features" it is ridiculous to assume an AI could do an adequate job.

kaleidawave · 10h ago

HTML?

sim7c00 · 7h ago

well essentially considering desktop office apps are web browsers basically :')

danjc · 14h ago

So, basically the same as Adobe with PDF

jiggawatts · 15h ago

The opinion in the article misses something fundamental.

The complexity is not artificial, it is completely organic and natural.

It is incidental complexity born of decades of history, backwards compatibility, lip-service to openness, and regulatory compliance checkbox ticking. It wasn't purposefully added, it just happened.

Every large document-based application's file format is like this, no exceptions.

As a random example, Adobe Photoshop PSD files are famously horrific to parse, let alone interpret in any useful way. There are many, many other examples, I don't aim to single out any particular vendor.

All of this boils down to the simple fact that these file formats have no independent existence apart from their editor programs.

They're simply serialised application state, little better than memory-dumps. They encode every single feature the application has, directly. They must! Otherwise the feature states couldn't be saved. It's tautological. If it's in Word, Excel, PowerPoint, or any other Office app somewhere, it has to go into the files too.

There are layers and layers of this history and complex internal state that has to be represented in the file. Everything from compatibility flags, OLE embedding, macros, external data source, incremental saves, the support for quirks of legacy printers that no longer exist, CYMK, external data, document signing, document review notes, and on and on.

No extra complexity had to be added to the OOXML file formats, that's just a reflection of the complexity of Microsoft Office applications.

Simplicity was never engineered into these file formats. If it had been, it would have been a tremendous extra effort for zero gain to Microsoft.

Don't blame Microsoft for this either, because other vendors did the exact same thing, for the exact same pragmatic reasons.

Ekaros · 12h ago

You might start with something simple with aim for simplicity. Then you need to add more features. Eventually in enough years you will have lost the simplicity as you have that many features to support.

You might not add features, but well that is most likely losing proposition against those competitors that have features. As generally normal users want some tiny subset of features. Be it images, tables, internal links, comments, versions.

jiggawatts · 10h ago

Everyone uses 10% of the features of complex software... it's just not the same 10%, which is why the other 90% needs to be in there and included in the file formats.

It's also not sufficient to find that "perfect" lean and mean application that happens to cover precisely the 10% that you need for yourself, because now you can't interchange content with other people that need different features!

I regularly open and edit Office documents created by others that utilise features I had never even heard of. I didn't know until very recently that Power Point has extensive animation support, or that Excel embeds Python, or that both it and Power BI can reach out to OData API endspoints to refresh data tables or even ingest Parquet directly.

You might not need that, but the guy that prepared the report for you needed it.

ranger_danger · 7h ago

100% agree... I think most people don't get this. People whine that a program doesn't use a "standardized" (read: popularized FOSS) format, but then dismiss logical rebuttals like it not supporting everything they need.

What do they expect people to do, remove features in order to support other formats? Users won't like that.

drewcoo · 11h ago

As opposed to the original binary format, designed to copy directly to the heap on restore?

skywhopper · 11h ago

While the critique is correct, the complexity is probably not “artificial”. Rather, it directly reflects the internal decades-old complex architecture of Office applications rather than making any attempt to be an actually useful schema for sharing between applications.

It only exists because Microsoft was desperate to avoid antitrust consequences for the dominance of Office 25 years ago.

scarface_74 · 12h ago

There have been third party support for importing and exporting Office documents as long as I can remember. It was part of Apple’s File Exchange extension in 1994. No one is locked into Office because of file formats.

lcnielsen · 11h ago

> No one is locked into Offive because of file formats

A lot of people are locked in because those import/export features are typically imperfect (or perhaps the documents themselves are) and will badly and often "invisibly" (to the non-Office user) break something.

scarface_74 · 11h ago

You could say the same about a web page or even Markdown…

But honestly these days, the only time I use Word is to keep my resume up to date once per quarter. That’s a really simple document.

fithisux · 14h ago

I have seen in the past the same claim for Bluetooth.

I think this needs to end and it is up to ordinary people to seek alternatives.

Apart from LibreOffice, we still have many other alternatives.

ddtaylor · 16h ago

Again?

pessimizer · 15h ago

Strange that this is getting traction again, and good on the people getting it out there. Saw something about "OOXML" make Google News the other day.

Having a debate about the quality of OOXML feels like a waste of time, though. This was all debated in public when Microsoft was making its proprietary products into national standards, and nobody on Microsoft's side debated the formats on the merits because there obviously weren't any, except a dubious backwards compatibility promise that was already being broken because MS Office couldn't even render OOXML properly. People trying to open old MS Office documents were advised to try Openoffice.

They instead did the wise thing and just named themselves after their enemy ("Open Office? Well we have Office Open!"), offered massive discounts and giveaways to budget-strapped European countries for support, and directly suborned individual politicians.

Which means to me that it's potentially a winnable battle at some point in the future, but I don't know why now would be a better outcome than then. Maybe if you could trick MS into fighting with Google about it. Or just maybe, this latest media push is some submarine attempt by Google to start a new fight about file formats?

jongjong · 14h ago

Microsoft is using an artificially complex everything as a lock-in tool. I learned this many years ago when I learned how to create a window in C++ and it took around 100 lines of over-engineered code just to create an empty window on Windows.

Even TypeScript encourages artificial complexity of interfaces and creates lock-in, that's why Microsoft loves it. That's why they made it Turing Complete and why they don't want TypeScript to be made backwards with JavaScript via the type annotations ECMAScript proposal. They want complex interfaces and they want all these complex interfaces to be locked into their tsc compiler which they control.

They love it when junior devs use obscure 'cutting edge' or 'enterprise grade' features of their APIs and disregard the benefits of simplicity and backwards compatibility.

Ask HN: What could I build to make your life a little easier?

I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA

Which SaaS have you been able to replace with AI?

Ask HN: Any active COBOL devs here? What are you working on?

Ask HN: What's the competitive advantage these days?

How do you retain what you read from nonfiction books?

Ask HN: What Pocket alternatives did you move to?

Ask HN: Will AI models over time converge into the same system?

Ask HN: OpenAI zero'd balance (actual money, not free credits) after inactivity

Ask HN: Does anyone have OpenBSD projects looking for unpaid/paid help?

Ask HN: How did Soham Parekh get so many jobs?

Ask HN: What's Your Useful Local LLM Stack?

Ask HN: Is clearing browser data (cookies, cache, history) enough?

Ask HN: Is it time to fork HN into AI/LLM and "Everything else/other?"

Tell HN: Notion Desktop is monitoring your audio and network

Ask HN: Changing Developer Career Specialty

Ask HN: GCP Outage?

Ask HN: Where do you guys find audiobooks?

Gmail's backup codes are useless to access account

Ask HN: Cursor is using 269,738 tokens to edit 1200 token file

Ask HN: How do you stay on top of AI tech?

Ask HN: How are you tracking dev productivity without feeling micromanaging?

Ask HN: What is the state of support for mutable torrents?

Ask HN: How do you find free academic/scientific material?

Ask HN: Is OpenAI Acquiring Cursor?

Microsoft Office is using an artificially complex XML schema as a lock-in tool

Comments (120)