Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory (dariobalinzo.medium.com)

How can it take 3-4 months to get an eCommerce site back online? I assume you could redeploy everything from scratch in less time if you have source code and release assets. With backups and failover sites I can’t think of any world where this would happen?

paxys · 17h ago

It isn't surprising at all. There's a reason why tech companies have insanely large engineering teams even though it feels to an outsider (and inept management) that nobody is doing anything. It takes a lot of manpower and hours to keep a complex system working and up to date. Who validates the backups? Who writes the wikis? Who trains new hires? Who staffs all the on-call rotations? Who organizes disaster recovery drills? Who runs red team exercises? After the company has had repeated layoffs and fired, outsourced or otherwise pushed out all this "overhead" eventually there's no one remaining who actually understands how the system works. One small outage later, this is exactly the situation you end up in.

coliveira · 16h ago

Agreed, and that is a wonderful punishment to these companies.

phatfish · 14h ago

Yup, it turns out all those Indian contractors/outsourced staff don't really give a shit.

gosub100 · 12h ago

They're paid not to. Disagreement rocks the boat, they get fired and have 2 weeks to pack it up and fly home.

nebula8804 · 11h ago

And yet somehow Twitter plods along.

chownie · 2h ago

It turns out if you login-wall the site you can get away with a few catastrophic outages, people won't even remember them.

wiether · 3h ago

I never understood this Twitter thing.

- Everybody in their right mind agreed that, for what they were achieving, Twitter was completely over-staffed. Like most of the big tech co in this period. And like most of those co, they went through a leaning program with mass layoffs.

- If the service is running fine with only 10% of the staff, it doesn't necessarily means that the 90% that got fired were useless. I can get a 6yo to heat their food using a microwave. Does it mean that the kid is a genius, or that the people who made the microwave did it in a way that allows a kid to operate it, even though it's a complex system at its core?

- Comparing Twitter to an international eCom website is disingenuous. If "design Twitter" is a common system design interview question, it's not because the website is popular, it's because the basics are quite simple. Whereas, behind an eCom website, there's dozens of parts going on at any time, with hundreds of interoperability issues. You're not mainly relying on your main DB for your data, most of it is coming from external systems.

chatmasta · 11h ago

Sure, but for every efficiently run company, there’s another with 80% of its engineers working on a “new vision” with zero customers, while the revenue-generating software sits idle or attended by one or two developers…

And maybe this is intentional, rational strategy - why not reinvest profits in R&D? But just because an organization is large does not mean that it’s efficient.

ksec · 11h ago

Which means it is an opportunity for most of these to be SaaS and not internal. I wish Shopify could help them to migrate to their own.

CobrastanJorji · 17h ago

Yep. It takes way fewer people to operating a working system than to build a new one. And the nature of capitalism is that you will pare down your numbers until you have the absolute minimum staffing you need to keep the lights on. Then when everything explodes, you completely lack the know-how to fix it. Then the CEO yells as the tech executive who responds by demanding hourly updates from the two junior devs who operate the site, and nobody wants to admit that they aren't capable of fixing it, and nobody's gonna OK a really expensive "we're gonna spend a month emergency building a new thing" plan because nobody's okay with because a month is obviously way too much time you need to fix it right now, and then three months go by and here you are.

chatmasta · 11h ago

A friend of a friend told me about an organization that has a steady income from existing products maintained by just enough engineers to keep the lights on, while the other 80% of the organization is building the “new version” that no customer asked for and that nobody is currently paying for. There’s one product that is used by more than 80% of customers that’s maintained by 2 developers and that the CEO isn’t aware even exists.

jemmyw · 10h ago

Ya I've been there. I even tried pitching to management that a small team of us wanted to move to the legacy product and iteratively improve it because it had customers and revenue and we could make an impact while the new product was under development. They said no. I left about 6 months after. 9 years later the legacy product is still running. I can't find any evidence that they launched a new one.

spacebanana7 · 16h ago

I get the opposite impression. Stale software organisations with steady operating products seem to use massive headcounts, whereas startups building new products often get by with relatively few people.

esseph · 16h ago

Startups don't have to run a software stack for decades, hardware refreshes or SKU updates and replatforms, dealing with multiple types of turnover and reogs, knowledge transfer, etc.

Plus at least monthly if not daily, even hourly system patching.

Planting a garden is one thing.

Weeding it is another.

donnachangstein · 12h ago

> whereas startups building new products often get by with relatively few people

90% of startups fail within 5 years so probably not the best example of how to run things.

The few that do "succeed" often carry over mountains of cruft and garbage code into perpetuity (for example Reddit).

nebula8804 · 11h ago

There is way too much disrespect of true technologists. I get frustrated with right wing 'bros'(eg. Joe Rogan and other online right wingers) who benefit so much from what the technologists provide while at the same time calling them a bunch of 'dorks' because they don't believe the same stuff that he does or because they push for progressive ways of thinking (ie support of LGBTQ+).

These same people also claim that "Real America" do real work like running a farm whereas tech people get paid too much to do nothing all day. So many people don't understand how much work and effort it takes to keep all these online systems running. Its only when you've been humbled by these systems as a tech worker that you truly understand.

Tech CEOs who are 'in the know' have a way of treating technologists like crap as well. I go to DEFCON and other conferences often and it's crazy how little all these security researchers get paid for their countless hours of work.

People like Musk who are 'in the know' used to get kudos from places like DEFCON because they were willing to just drop their car in Automotive village with an employee or two and watch others do free work for them "for fun" or for tiny prizes.

Technologists empowered people like Musk and now that he is nearly invisible monetarily and security wise, he shows his true colors.

I think the powers that be realized the disrupting powers of technologists early on and so they pushed nonsense like 'learn to code' to reduce their sway. Now the technologists are losing their power because there is enough desperation that someone will do the work for less money and less headache.

The only coming back from this is if there is a giant collapse from the vibe coded disasters that AI might produce + enough of these 'learn to code' people have already exited the industry into something else. Then I could see some power returning to technologists.

levocardia · 11h ago

I'm sorry but if an enterprise team can't at least get a stopgap ecommerce site up and running in a week, what are you even doing? Literal amateurs can launch a WooCommerce site from nothing in a weekend; two Stanford grads in YC can do a hundred-fold better than that. Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up, but this is total madness.

donnachangstein · 11h ago

> what are you even doing?

Forensics, among a hundred other things.

> Literal amateurs can launch a WooCommerce site from nothing in a weekend

Selling low-volume horseshit out of your garage is in no way comparable to running a major eCommerce site.

> two Stanford grads in YC can do a hundred-fold better than that.

No they literally can't.

> Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up

Great idea, we'll have Chloe in Accounts manage all the orders in a million-row Excel sheet. Only problem might be they come in at 50 orders a minute, but don't worry I hear she's a fast typist.

didroe · 17h ago

How do you know it's safe to redeploy? If your entire operation may be compromised, how can you trust the code hasn't been modified, that some information the attackers have doesn't present a further threat, or that flaws that allowed the attack aren't still present in your services? It's a large company so likely has a mess of microservices and outsourced development where no-one really understands parts of it. Also, if they get compromised again it would be a PR disaster.

They're probably having to audit everything, invest a lot of effort in additional hardening, and re-architect things to try and minimise the impact of any future attack. And via some bureaucratic organisational structure/outsourcing contract.

ajb · 4h ago

You literally have some of your team buy new laptops and hang out in a temporary wework to set it up on entirely new infra, air-gapped from your ongoing forensic exercise. You just need to make sure none of the people you send are dumb enough to reuse their password. You need to take the domain name, but they will be using one of the high end domain companies so that can be handled.

Bear in mind that this is a company which still sells physically and has retail and warehouse staff. All that the e-commerce side needs to do is issue orders of what skus to send to what addresses, and pause items that are out of stock. M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.

Sure, customers will need to make a new account or buy as a guest. But this stuff is not hard on the technical side. There is no interaction between customers like a social media site, so horizontal scaling is easy.

Now I get that there are loads of refinements that go into maximising profit, like analytics, price optimization, etc. But to get in revenue these guys don't even need to set up advertising on day one because they have customers that have been buying from them for decades. The time to set up all that stuff is when your revenue is nonzero

Oras · 4h ago

I don’t think you realise how complicated the e-commerce is for a company. You are thinking of a garage sale.

With each order:

- you need warehouse integration to keep the sync of physical to digital store. That has to happen fast or you’ll get orders with no stock.

- You need to sync the payment to whatever ancient accounting system they use, again while issuing invoices, consolidating customers … etc.

- Logistics management, where to get the order from, issuing a label, using the right fleet, making sure it is dispatched on time, arrive on time.

- Customer support, refunds, partial refunds, adding items after order … etc.

So yeah, 5 people!

ajb · 2h ago

I didn't say 5 people in total

prmoustache · 2h ago

> M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.

I can't speak about M&S buy all big physical retail brand which started selling online are exactly operating as Amazon with SKUs coming from various third party entities. The offering is much bigger than what is sold at the physical shops.

ajb · 2h ago

I had the impression that M&S wasn't, but if that's the case then yeah, that would invalidate my analysis. Especially if even their retail stock goes through that route when bought online.

cjs_ac · 16h ago

Your comment suggests that you're not familiar with the diversity in M&S' operation.

Marks and Spencers started as a department store; they still have this operation. They sell clothes, beauty products, cookware, homeware and furniture. All these things are sold in physical shops and online. Most of this is straightforward for an e-commerce operation, but the furniture will involve separate warehousing and delivery systems.

They also offer financial services (bank accounts, credit cards and insurance). These are white labelled products, but they are closely linked to their loyalty programme (the Sparks card).

Finally, they have their food operation: M&S is also a high-end supermarket. You can't do your food shop on the M&S website (although their food products are available from online-only supermarket Ocado), but you can order some food products (sandwich platters and party food) and fresh flowers from the website.

So M&S is a mid-tier department store and a high-end supermarket. These are very different styles of retail operation: supermarkets require a lot of data processing to ensure the right things get to the right shops at the right time to ensure that food doesn't go to waste but also shoppers aren't annoyed by the unavailability of staples like bread and milk.

Finally, M&S is traditionally fairly strong in customer service; it's not exactly Harrod's or Fortnum and Mason's, but their bra-fitting service, for example, has a legendary reputation. The internet isn't their natural home.

So all-in-all, you have a business doing complicated things online because they think they have to, not because they want to: a pretty clear recipe for disaster.

neepi · 14h ago

Their banking op is a fucking mess as well. Had no end of problems with their card services which were rebranded HSBC.

donnachangstein · 12h ago

HN posters love talking gangster shit when something goes offline but never walked a mile in their boots.

I most recently remember sifting through gloating that 4chan - a shoestring operation with basically no staff - was offline for a couple weeks after getting hacked.

I've worked at a shop that had DR procedures for EVERYTHING. The recovery time for non-critical infra was measured in months. There are only so many hands to go around, and stuff takes time to rebuild. And that's assuming you have procedures on file! Not to mention if there was a major compromise you need to perform forensics to make sure you kick the bad guys out and patch the hole so the same thing doesn't happen again a week after your magical recovery.

And if you don't know, you shut it down till it's deemed safe. How do you know the backups and failover sites aren't tainted? Nothing worse than running an e-commerce site processing customer payment card data when you know you're owned. That's a good way to get in deeper trouble.

pavel_lishin · 17h ago

> with backups and failover sites

What a fun pair of assumptions!

chatmasta · 17h ago

The Co-Op (grocery store chain) was hacked around the same time in likely the same incident. It took three weeks for them to get food back on the shelves at my local store. I don’t understand how that’s even possible… what happened to all the meat and vegetables in the supply chain? They just stopped flowing? They rotted? Why couldn’t they use pen and paper? It’s unbelievable to me that a business would go three weeks without stocking inventory.

tonyhart7 · 16h ago

You can say this because ignorant, stock inventory is really hard especially huge warehouse where many items come and go 24/7

they can "move" it of course but who can guarantee how many amount goes from where and who ????

paper and pen where there are thousand items in single rack is nightmare, I can tell you that

gosub100 · 12h ago

Don't call someone ignorant for asking a question. He said "I don't understand how". If you know the answer, answer. Don't call him ignorant.

chatmasta · 16h ago

well, apparently co-op couldn’t answer those questions with their computers because they got locked out of them…

glenjamin · 14h ago

I chatted to a staff member on the checkout of my local coop supermarket

She said that every shelf item is ordered on a JIT basis as the store stock levels require them - there are no standing orders to a store

Based on that, I presume they didn’t really know what any store would need

Even when they were struggling my local store still had a decent stock of lots of stuff - just some shelves were empty

bobthepanda · 14h ago

You could (and people did) run this in the pre-internet days with basically just phone calls and a desk to receive them. The problem is that by now this represents an incredible increase in manpower required overnight.

grues-dinner · 8h ago

And you need a process to follow. You can't just have nearly 4000 supermarkets ringing up HQ at random and reading out lists of 1000 items each. Then what? Back when a supermarket chain did operate like that, the processes like "fill in form ABC in triplicate, forward two to department DEF for batching and then the forward one to department GHI for supplier orders and they produce forms XYZ to send to department JKL for turning into orders for dispatch from warehouses". And so on and so on. You can't just magic up that entire infrastructure and knowledge even if you could get the warm bodies to implement it. Everyone who remembers how to operate a system like that is retired or has forgotten the details, all the forms were destroyed years ago and even the buildings with the phones and vacuum tubes and mail rooms don't exist.

Of course you could stand up a whole new system like that eventually, but you could also use the time to fix the computers and get back to business probably sooner.

But I imagine during those 3 weeks, there were a lot of phone calls, ad-hoc processes being invented and general chaos to get some minimal level of service limping along.

7952 · 2h ago

I agree, although it seems like a failure of imagination that this is so difficult. The staff will have a good understanding of what usually happens and what needs to happen. What they are lacking is some really basic things that are the natural monopoly of "the system".

Perhaps we need fallback systems that can rebuild some of that utility from scratch...

* A communication channel of last resort that can be bootstrapped. Like an emergency RCS messaging number that everyone is given or even a print/mailing service.

* A way to authenticate people getting in touch using photo ID, archived employee data or some kind of web of trust.

* A way to send messages to everyone using a he RCS system.

* A way to commission printing, delivery and collection of printed forms.

* A bot that can guide people to enter data into a particular schema.

* An append only data store that records messages. A filtering and export layer on top of that.

* A way to give people access to an office suite outside of the normal MS/Google subscription.

* A reliable third party wifi/cell service that is detached from your infrastructure.

* A pool of admin people who can run OCR, do data entry.

Basically you onboard people onto an emergency system. And have some basic resources that let people communicate and start spreadsheets.

chatmasta · 7h ago

> Everyone who remembers how to operate a system like that is retired or has forgotten the details

Anyone who’s experienced the sudden emergence of middle management might feel otherwise :) please don’t teach those people the meaning of “triplicate,” they might try to apply it to next quarter’s Jira workflows…

grues-dinner · 7h ago

One day you'll find a sheet of carbon paper in the office laserjet and you'll know it's starting.

I wonder if we could negotiate a return to typewriters and paper if it means individual offices and a tea trolley?

chatmasta · 11h ago

I remember when I was a teenager working the register at a local store. The power went out one day, and we processed credit cards with a device that imprinted the embossed card number onto a paper for later reconciliation.

That wouldn’t work today for a number of reasons but it was cool to see that kind of backup plan in place.

phinnaeus · 10h ago

I’ve seen cc impression machines within the past 5 years in small town america

fredoralive · 16m ago

In the UK the credit / debit cards I've had issued in the last few years have been flat, with details just printed, so that level of manual processing is presumably defunct here.

chatmasta · 14h ago

In my case all the perishable shelves were empty - no fruit, no vegetables, no meat, no dairy. I checked every few days for multiple weeks and it wasn’t until three weeks after the incident I was able to buy chicken again.

It’s possible they were ordering some default level of stock and I just didn’t go at the right time to see it, but it sure looked like they were missing the inventory… when I first asked the lady “is the food missing because of the bank holiday?” and she said “no because of the cyber attack” I thought she was joking! It reminded me of the March 2020 shelves.

Henchman21 · 16h ago

You forget we have entered the “Who the fuck cares?” era. When no one in the chain is incentivized to care, things just fall apart.

chatmasta · 16h ago

Interestingly Co-Op is so-called because it’s a cooperative business, which vaguely means it’s owned by its employees, and technically means it’s a “Registered Society” [0].

If you check CompaniesHouse [1], which normally has all financial documents for UK corporations, it points you to a separate “Public Register” for the Co-Op [2].

So, your comment has more basis in reality than simply being snark… the fact that “nobody is incentivized to care” is actually by design. That has some positive benefits but in this case we’re seeing how it breaks down for the same reasons nobody in a crowd calls an ambulance for someone hurt… it’s the bystander effect applied to corporate governance with diluted accountability.

[0] https://www.gov.uk/hmrc-internal-manuals/company-taxation-ma...

[1] https://find-and-update.company-information.service.gov.uk/c...

[2] https://mutuals.fca.org.uk/Search/Society/7240

bonaldi · 13h ago

I’m not following your logic. The co-op is designed for everyone to care _more_ because they are part-owners and because the organisation is set up for a larger good than simple profit-making.

In practice the distinction has long been lost both for employees and members (customers), but the intent of the organisational structure was not for nobody to care; quite the opposite

chatmasta · 11h ago

But there are millions of part-owners. Every “member” of co-op (i.e. a customer in the same membership program that just lost all their data to this hack) is an owner of it. Maybe the employees get more “shares” but it’s not at all significant.

And at the executive governance level, there are a few dozen directors.

There is a CEO who makes £750k a year, so it has elements of traditional governance. I’m not saying the structure is entirely to blame for the slow reaction to the hack, or that there is zero accountability, but it’s certainly interesting to see the lack of urgency to restore business continuity.

My family used to own a local market, and as my dad said when I told him this story, “my father would have been on the farm killing the chickens himself if that’s what he had to do to ensure he had inventory to sell his customers.”

You simply won’t get that level of accountability in an organization with thousands of stakeholders. And a traditional for-profit corporation will have the same problems, but it will also have a stock price that starts tanking after half a quarter of empty shelves. The co-op is missing that sort of accountability mechanism.

Henchman21 · 11h ago

Responsibility diluted to the point of no actual responsibility?

chatmasta · 11h ago

Exactly, the bystander effect. But it’s not strictly due to the large size. Other big companies get hacked too. But if they have a stock price then there’s an obvious metric to indicate when the CEO needs to be fired. It’s the dilution of responsibility combined with a lack of measurable accountability that causes the dysfunction.

grues-dinner · 8h ago

The problem is that cutting IT and similar functions to the bone is really good for CEOs. It juices the profits in the short/mid term, the stock price goes up because investors just see line go up, money goes in, and the CEO gets plaudits. There's only one figure of merit: stock price. What you measure is what you get.

It's only much later that the wheels fall off and it all goes to hell. The hack isn't a result of the CEOs actions this quarter, it's years and years of cumulative stock price optimisation for which the CEO was rewarded.

And you can't even blame all the investors because many will be diluted and mixed though funds and pensions. Is Muriel to blame because her private pension, which everyone told her is good and responsible financial planning, invested in Co-Operative Group on the back of strong growth and "business optimisation intiatives"? Is she supposed to call up Legal and General and say "look I know 2% of my pension is invested in Co-Op Group Ltd and it's doing well, and yes I'm with you guys because you have good returns, but I'm concerned their supermarket division is outsourcing their IT too much, could you please reduce my returns for the next few years and invest in companies that make less money by doing the IT more correctly?"

The incentives are fucked from end to end.

Henchman21 · 14h ago

I guess this is more snark, but honestly I am genuinely shocked when people care about anything anymore. Sad times.

chatmasta · 14h ago

There is a serious crisis of competence and caring all throughout society and it is indeed frightening. It’s this nagging worry that never goes away, while little cracks keep appearing in the mechanisms we usually take for granted…

coliveira · 16h ago

When everything is done by computers, no human really knows what needs to be done even for a simple thing as buying vegetables.

TheOtherHobbes · 11h ago

Buying and distributing vegetables for stores is not remotely a simple thing. It includes statistical analysis with estimates of demand for every store, seasonal scheduling, weather awareness, complicated national and/or international logistics, plus accounting and payments.

Some or all of those may be broken during a cyberattack.

chatmasta · 6h ago

That’s a good point but perhaps you underestimate the ingenuity borne from constraints.

If you’ve got trucks arriving with meat that’s going to expire in a week, and all your stores have empty shelves, surely there is a system to get that meat into customer mouths before it expires. It could be as simple as asking each store, when they call (which they surely will), how much meat they ordered last week, and sending them the same this week. You could build out more complicated distribution mechanisms, but it should be enough to keep your goods from perishing until you manage to repair your digital crutch.

7952 · 2h ago

The suppliers will know and be able to predict what a large customer like M&S is likely to order. They will probably be preparing items before they are even ordered. And surely their must be some kind of understanding of what a typical store will receive.

wrs · 16h ago

“If you have source code and release assets.” And a build process that works from a clean code base. And a deploy process that works on fresh servers.

All of which assumes you even know what services exist, which in any company of this age and size you probably don’t.

cameronh90 · 12h ago

The British Library still aren't fully back up and running after their cyberattack in Oct 2023: https://www.bl.uk/cyber-incident/

kelnos · 11h ago

I'm not that surprised, though 3-4 months does feel like a long time.

When I was at early Twilio (2011? 2012? ish), we would completely tear down our dev and staging environments every month (quarter? can't remember), and build them back up from scratch. That was everything, including databases (which would get restored from backup during the re-bring-up) and even the deployment infrastructure itself.

At that point we were still pretty small and didn't have a ton of services. Just bringing my product (Twilio Client) back up, plus some of the underlying voice services, took about 24 hours (spread across a few days). And the bits I handled were a) a small part of the whole, and b) some of the easier parts to bring up.

We stopped doing those teardowns sometime later in 2012, or perhaps 2013, because they started taking way too much time away from doing Actual Work. People can't get things done when the staging environment is down for more than a week. Over the following 10 years or so, Twilio's backend exploded in complexity, number of services, and the dependencies between those services.

I left Twilio in early 2022, and I wouldn't have been surprised if it would have taken several months to bring up Twilio (prod) from scratch at that point, though in their case it would be a situation where some products and features would be available earlier than others, so it's not really the same as an e-commerce site. And that was when I left; I'm sure complexity has increased further in the past 3 years.

Also consider that institutional knowledge matters too. I would guess that for all the services running at Twilio, the people who first brought up many (most?) of them are long gone. So I wouldn't be surprised if the people at M&S right now just have no idea how to bring up an e-commerce site like theirs from scratch, and have to learn as they go.

tw04 · 9h ago

So you haven’t dealt with ransomware gangs yet? Because they have gotten sophisticated enough to nuke source code repos and backups and replicated copies.

It’s part of the reason tape is literally never going to die for organizations with data that simply cannot be lost, regardless of rto.

dylan604 · 13h ago

For this particular audience, it's one of those things that could be rewritten in Rust over a weekend and then deployed on the cheap via Hetzner. At least then it'll be memory safe!

briffle · 13h ago

of course, if you redeployed everything from the source code, you could very well still have the same vulnerabilities that caused the problem in the first place..

internetter · 17h ago

There are no backups. There are no failovers. There is no git. There is no orchestration and deployment stratagies. Programmers ssh into the server and edit code there. Years and years of patchwork on top of patchwork with closely coupled code.

Such is a taste of what needs to be done if you wish to have a service that takes months to set back up after any disruption.

squiffsquiff · 15h ago

This is an ignorant position. Look at e.g. https://engineering.marksandspencer.com/mobile/2024/09/05/re...

throwawaymgb123 · 16h ago

This is a perfect description of how things work at one of the largest health care networks in the northeast US (speaking as someone who works there and keeps saying "where's the automation? where are the procedures?" and keeps being told to shut up, we don't have TIME for that sort of thing.

internetter · 16h ago

lol the healthcare industry was definitely in my mind as I wrote this. Never worked there but I read a lot of postmortems and it shows whenever I use their digital products. Recent example is CVS.

Somehow, at some point, they decided that my CVS pharmacy account should be linked to my Mom's extracare. Couldn't find any menu to fix it online. So the next time I went to the register I asked to update it. They read the linked phone number. It was mine. Ok, it is fixed, I think. But then the reciept prints out and it is my mom's Extracare card number. So the next time I press harder. I ask them to read me the card number they have linked from their screen. They read my card number. Ok, it is fixed, I think. But then the reciept prints out and the card number is different—it is my mom's. Then I know the system is incredibly fucked. Being an engineer, I think about how this could happen. I'm guessing there are a hundred database fields where the extracare number is stored, and only one is set to my mom's or something. I poke around the CVS website and find countless different portals made with clearly different frameworks and design practices. Then I know all of CVS's tech looks like this and a disaster is waiting to happen.

Goes like this for a lot of finance as well.

E.g. I can say with confidence that Equifax is still as scuffed as it was back in 2017 when it was hacked. That is a story for another time.

Nobody bothers to keep things clean until it is too late. The features you deliver give promotions, not the potential catastrophes you prevent. Humans have a tendency to be so short sighted, chasing endless earnings beats without anticipating future problems.

aspenmayer · 16h ago

If you don't have time to prepare for failure, then you'll have little time to invest in success, either, if/when failure strikes.

98codes · 17h ago

[citation needed]

internetter · 17h ago

Sorry if I phrased it poorly. I wasn’t definitively saying that all these things are the case. But what always is the case is that when an attack takes down an organization for months, it was employing a tremendous number of horrendous practices. My list was supposed to be some.

M&S isn’t down for months because of something innocuous like a full security audit. As a public company losing tens of millions of dollars a week, their only priority is to stop the bleed, even if that means a hasty partial restoration. The fact they can’t even do that suggests they did stuff terribly wrong. There’s an infinite amount of things I didn’t list that could also be the case. Like if Amazon gave them proprietary blobs they lost after the attack and Amazon won’t provide again. But no matter what they are, things were wrong beyond belief. That is a given.

pavel_lishin · 17h ago

To be fair, I would be that nearly every organization employs a tremendous number of horrendous practices. We only gasp at the ones who get taken down for some reason.

internetter · 16h ago

Horrendous practices exist on a spectrum. Every org has bad code that somebody will fix someday™. It is reasonable to expect that after a catostrophic event like this, a full recovery takes some time. But at a "good" org, these practices are isolated. Not every org is entirely held together with masking tape. For the entire thing to be down for so long, the bad practices need to be widespread, seeping into every corner of the product. Ubiquitous.

For instance, when Cloudflare all went down a while ago due to a bad regex, it took less than a hour to rollback the changes. Undoubtably there were bad practices that lead to a regex having the ability to take everything out, but the problem was isolatable and once adressed partial service was quickly restored, and shortly after preventative measures were employed. This bug didn't destroy cloudflare for months.

P.S. in anticipation of the "but cloudflare has SLAs!!" that isn't really a distinction worth making because M&S has an implicit SLA with their customers — they are losing 40 million each week they can't offer service. Plenty of non-b2b companies that invest in quick recovery as well, like Netflix's monkey testing.

PaulHoule · 16h ago

No, best practice is that you have a checklist to bring up a copy of your system, better yet that checklist is "run a script". In the cloud age you ought to be able to bring a copy up in a new zone with a repeatable procedure.

Makes a big difference in developer quality of life and improves productivity right away. If you onboard a new dev you give them a checklist and they are up and running that day.

I had a coworker who taught me a lot about sysadmining, (social) networking, and vendor management. She told me that you'd better have your backup procedures tested. One time we were doing a software upgrade and I screwed up and dropped the Oracle database for a production system. She had a mirror in place so we had less than a minute of downtime.

softwaredoug · 17h ago

At the same time we’re talking about AI replacing developers we also see cases like this of organizational technical incompetency.

How does one square those two realities?

paxys · 17h ago

99% of "AI" talk in the public is for the sole purpose of making wall street happy to boost stock price and/or pump private valuations of AI startups. The reality on the ground is very different. CEOs are bragging about replacing senior software engineers with AI meanwhile their recruiters and hiring managers are desperately advertising $300-500K/yr jobs for these same engineers while still not being able to hire enough of them because of high demand.

barbazoo · 14h ago

I honestly doubt that there is any overlap between the "$300-500K/yr" jobs and the jobs being replaced by AI.

bobthepanda · 14h ago

also at least some of the businesses that were doing this are now being run into the ground, like Klarna.

AnotherGoodName · 17h ago

Well we need to fix the business leadership problem asap. From the bio of the current M&S CEO. https://en.wikipedia.org/wiki/Stuart_Machin

>He resigned as managing director of Target in April 2016 because of accounting irregularities that he was unaware of but "happened on [his] watch".[4] He then became the chief executive of Steinhoff International.[4] (which seemed to have a lot of issues too https://en.wikipedia.org/wiki/Steinhoff_International#Debt_p...)

Foresight to mitigate potential major issues is exactly what CEOs are expected to do. I'm not sure how being unaware of major account irregularities is not seen as a career ending move here.

AI replacing CEOs seems straightforward as well. Accounting is such a data driven environment i think spotting account irregularities early would be straightforward. Likewise AI has the potential to think past short term thinking that leads to IT outsourcing (to the extent the store is not coming back online anytime soon!).

e2le · 16h ago

>AI replacing CEOs seems straightforward as well.

I'm not sure I want AI replacing all CEO's, ideally it would raise the bar for quality and performance forcing human CEO's to compete.

bradly · 17h ago

> How does one square those two realities?

People eat terrible food because they are bombarded with messages to do so. People can use terrible software for the same reasons. It doesn't matter that the food tastes worse than it used to–food companies are having record profits.

imhoguy · 17h ago

We just need one event of C*O of critical/big company bragging about firing engineering and replacing it with AI and then followed by huge cyberattack like that. Then see how AI balloon pops across news outlets.

chgs · 2h ago

The market rewards failure. Look at crowdstrike.

You aren’t buying technical service or even technical assurance. You are buying someone to blame so the stakeholders don’t hold you accountable.

coliveira · 16h ago

They'll respond saying they need to invest the entire revenue of the company on new data centers to fix the issue, and the stock will double in price.

jacobsenscott · 17h ago

AI replacing devs talk is about short term stock pumping and short term COGS reduction. The long tail is someone else's problem.

umanwizard · 17h ago

I’m pretty sure devs are not usually counted as part of COGS.

tonyhart7 · 16h ago

wait until they release AI for security and system orchestration

nickdothutton · 18h ago

I don't believe most (pre-internet) retailers should be building and operating their own sites. They already run core supply chain, distribution, and certain other apps (e.g. rostering and so on, accounting and payroll), but they probably shouldn't even be running some of those either.

fredoralive · 17h ago

M&S tried that, Amazon used to run the website:

https://www.theguardian.com/technology/2005/apr/19/business....

But they eventually took control back, so it clearly didn't work for them:

https://www.theguardian.com/business/2014/feb/18/marks-spenc...

M&S orders still use the same ###-#######-####### order number format as Amazon, so I'm not sure if it's still some sort of fork of whatever white-label Amazon technology they were using back then.

I'm not sure if getting Amazon to run your own ecomerce website is really the greatest idea in the long term (Amazon kinda want your customers to use Amazon, not your website), but M&S using them isn't as mad as that bit in the early 2000's where Waterstone's website was just a subsection of Amazon.co.uk.

spacebanana7 · 17h ago

> I'm not sure if getting Amazon to run your own e-commerce website is really the greatest idea in the long term

Amazon has a clear conflict of interest with anyone in e-commerce. Shopify is probably a better example.

No comments yet

xp84 · 14h ago

Famously, Borders and Target both allowed Amazon to run their ecommerce operations on this side of the pond, until they realized what a bad idea it is to partner with your competitor on something important. Target can be forgiven, I suppose, as in those days Amazon was mainly a store for books, CDs, and DVDs. Unclear how Borders didn't see it coming, though!

fredoralive · 13h ago

The Waterstone’s example I gave is similar to Borders - if your not from the UK you might not know it’s a bookshop, and they were basically just routing traffic to Amazon as a glorified affiliate link for a while.

youngtaff · 2h ago

M&S’ online store is outsourced to someone like Tata, Publicis Sapient etc and last time I looked was built on one of the commercially available platforms (DemandWare and the like)

Warehousing and delivery is probably contracted out to another third-party

My guess is it’s one of these that’s been hacked

ecshafer · 17h ago

This is the core thesis of a company like Shopify. Shopify will run everything else about being an e-commerce company (website, inventory, shipping, returns, ads, sales channels, etc) and then the merchant can focus on selling their product. But this is part of the larger thesis about running a business you hear in business school classes, to focus on your specialization and outsource your non-core expertise. Buy Workday/ADP/Paychex don't do payroll or HR. Don't build a data center, buy AWS/Azure/GCP. Don't build a sales database or marketing get Hubspot or Salesforce. Does your company take in a lot of mail? Outsource to a company that specializes in processing mail. Outsource your Technical Helpdesk. Outsource your customer support. This is why componentization is accelerating.

runako · 16h ago

> to focus on your specialization and outsource your non-core expertise

Most retailers will argue that connecting with their core customers and delivering delightful experiences to them is their core expertise.

More practically, it will be tension between things like "our marketing department wants X on the site for summer" and "Shopify is planning on launching X in January." It will be less of a resistance to using a third-party provider and more that the third-party provider imposes constraints on the mode of contact with customers. That's a hard pill to swallow for a lot of consumer-focused companies.

xp84 · 13h ago

> Most retailers will argue that connecting with their core customers and delivering delightful experiences to them is their core expertise.

Having worked in e-commerce for most of my career, for individual retailers, I can assure you that the perpetual tension you describe is real. The problem as I see it is, every little retailer thinks that their two-bit designers and product managers are so uniquely visionary in designing interactions that they rightfully should have full control over the product that is the ecommerce website. Shopify employs God-knows-how-many engineers to build and maintain this experience, and probably thousands of SREs to be there 24/7 making sure a random DDOS or slow query doesn't take your site out. "But we think we can build a better site than Shopify with 10 engineers and a couple of managers," they say.

They can build one that has the 3 cute whiz-bang features that their self-important product design staff thinks matter, but it will be unreliable, and they won't have sufficient expertise to get right the other 90% of what a "good" ecom site should have. And on top of it all, none of those gimmicks will likely improve conversion or order value enough to be worth doing.

The smarter ones IMHO do use Shopify. It lacks so many things in its core that it's infuriating (decent search, any nontrivial filtering), but retailers who use it mostly patch over those flaws with plugins sold by third parties (which often introduce ghastly single points of failure that you have no visibility into, and you can't sue some random plugin vendor you pay $50 a month for your site going down on Black Friday).

Ecommerce is hard tbh. But I do personally think that most of my previous employers probably should have done lightweight Shopify skins and made their core competence sourcing, merchandising, and advertising product rather than designing cute search filters, or their own product recommendations algorithm.

runako · 10h ago

Have done a few turns through e-commerce over the years, and I agree 100% with you.

That said, in the context of a Marks & Spencer-sized company (~$13B revenue), it absolutely can be a competitive advantage to in-house e-commerce if it is resourced & staffed appropriately. They are talking about a £300m hit to profit, so they appear to have some headroom for running a complex site.

Doing in-house gives the opportunity for a company to fix the kinds of things you mention with dodgy plugins etc. It also lets them take advantage of Doing Things Our Way, which sounds silly until you consider that Doing Things Our Way is how they got to be so big. And of course, in-house builds are still allowed to use off-the-shelf software where it makes sense.

Also DIY allows companies to adopt new stuff at their own pace. This tends to be important at times of tech transition like it appears we are now reentering.

Will reiterate that e-commerce is hard & there are really no easy answers.

madeofpalk · 16h ago

I guess the question is whether e-commerce should be a core competency of a business with a significant e-commerce business.

I’m not sure what it’s like in the US, but grocery delivery is a reasonably big deal in the UK.

neepi · 17h ago

Err they were breached most likely through Tata Consultancy's helpdesk apparently which is literally the people they outsourced it to.

Their approach was to sell the UK operation to Tata in 2018 and piss everyone off until they leave and replace them with Indian staff to save costs over time.

You get what you pay for. They're now paying for it.

dangus · 11h ago

Honestly, reading your comment I'm not sure how I can be generous enough with it to not consider the main purpose of it to be xenophobia/racism/nationalism.

As an example of Tata’s general competency, Tata owns Jaguar Range Rover group which reported their best profit in a decade for the fiscal year ending March 31.

It’s certainly very possible for any help desk to be an attack vector regardless of the nationality of the employee.

So it seems to me the main point of your comment is to hate on an Indian company solely for being Indian and to stereotype Indian companies as low quality and low cost.

Tata Consultancy is a global company that includes offices in places like Chicago, Dallas, and Atlanta.

neepi · 7h ago

The point is if you're outsourcing to India you want low cost and are not concerned with quality. That's exactly why you go there. And India has plenty of that to give. No one there wants to work for those outsourcing chop shops, but it's an easy way into the industry so staff turnover is high as people move up the chain. So if you're outsourcing for cost cutting reasons, which they are, it's a constant cycle of low quality staff. That is the risk. Nothing to do with race or xenophobia. You put low costs above risk provision.

We can all pick and choose good stories. What about Tata Steel's total mismanagement of Tata Steel Europe? And JLR isn't exactly in good shape as you say it is. You just picked some numbers that sound good. And wrapped it in a xenophobia straw man.

d1sxeyes · 6h ago

> The point is if you're outsourcing to India you want low cost and are not concerned with quality

First part is a reasonable assumption, the second is not, and this is what’s opening you up to allegations of xenophobia.

The allure (and promise) of outsourcing is the idea that you can pay less for a comparable service due to the cost of living disparity between your location and the outsourcing provider. Whether any individual provider achieves that or not is another question, but saying “if your service is provided from India you are not concerned about quality”, or “if your service is provided from India you will have a constant cycle of low quality staff” does sound a lot like xenophobia.

> No one there wants to work for those outsourcing chop shops

This is simply not true. There are a lot of benefits to working for a company like this, although as with any company, it’s not all upside. Of course turnover is high because you have a lot of entry level folks. Regardless of where you are in the world, no-one wants to work a level one help desk until they retire.

> We can all pick and choose good stories. What about Tata Steel's total mismanagement of Tata Steel Europe? And JLR isn't exactly in good shape as you say it is. You just picked some numbers that sound good.

Yeah odd example, JLR is not doing great, and Tata Steel is also struggling, but overall the Tata Group are doing well.

neepi · 5h ago

Have you considered that quality tends to cost the same everywhere as it is fairly rare? It’s not commoditised.

The problem is crap people are cheaper in India than crap people elsewhere. And that looks good on the balance sheet.

And as we’re about maximising shareholder value these days then that’s fine apparently.

arp242 · 17h ago

This incident has little to do with website or web store as such, and the only reason those are impacted is because pretty much all of M&S's IT systems have been impacted. Even if someone else would be running all of that, chances are that would still interface with the M&S computer systems to accurately get inventory information and the like.

benjaminwootton · 17h ago

They outsource as much as they can to the cheapest system integrators they can find, primarily TCS.

dangus · 11h ago

This take doesn't make sense. If one of your core businesses is selling clothes online, and you're a large enough entity, you should write your own software to sell clothes online.

Basically by your exact same logic you're asking Walmart and Target to outsource their websites, which is completely insane.

crop_rotation · 4h ago

> Basically by your exact same logic you're asking Walmart and Target to outsource their websites, which is completely insane.

No, because Walmart website being down for months is just unthinkable. If a company is so incompetent/resource deficient/use your favorite phrase to describe it/ that their e-commerce is down for a month and outlook is it will be down for more time, then something is seriously wrong with the company. Such companies are 100% going to have a much better experience with Shopify.

> If one of your core businesses is selling clothes online, and you're a large enough entity, you should write your own software to sell clothes online.

If only this was so easy. For various reasons writing non trivial software is hard, and unless these companies can make some structural changes to hire and retain very good engineers (which also is hard for various reasons for these companies), they simply have no chance of doing better than shopify.

woah · 16h ago

Holy shit why don't they just set up a Shopify

wavemode · 16h ago

Bureaucracy is almost always the reason. They don't just need a website, they need -their- website back, because it was programmed with a million little business rules and pricing logic and regulatory requirements.

xp84 · 13h ago

You're surely not wrong that a lot of things would need to be done without, but I'd like to think that if I were 'king of M&S' I could have identified a subset of merchandise that could be loaded into a suitable interim solution like Shopify within say, 4 weeks, if the only other option was forgoing all online sales for 12 weeks +.

That would also take a lot of the pressure off of the "full recovery team."

Of course, the real situation must be 100x more complex than I'm imagining it so "I'd like to think" != "I am confident"

NoImmatureAdHom · 11h ago

Four weeks? You mean two days???

The real situation is not 100x more complex, it's just that this is happening in Britain where everything is someone else's job and no one has any reason to care about the actual goals and everyone will go home at 5pm. Or, more likely, to the pub.

dangus · 11h ago

You really think Shopify scales to a large department store?

The largest enterprise example of a Shopify customer on their marketing website has $500 million in sales.

M&S has an annual revenue of over £10 billion

mystified5016 · 11h ago

> “There is no change to our strategy and our longer-term plans to reshape M&S for growth and, if anything, the incident allows us to accelerate the pace of change as we draw a line and move on.”

I wonder if people like this ever hear themselves talking.

wyager · 15h ago

It's weird to me how it often seems like the US and China are the only countries capable of mega-scale tech infrastructure like this (and even then, only in some industries). Can you imagine Wal-mart's website going down for multiple months?

I think a lot of companies (especially in Europe) have not internalized that, yes, you actually do need to expend apparently exorbitant amounts of money on highly-paid engineers if you want your tech to actually be good. Many countries, including the UK, are simply not wealthy enough to do it at scale. They produce plenty of engineers, but most of the ones capable of holding complicated stuff together probably end up working for US companies that can pay them market rates.

zwog · 12h ago

Interestingly the gov.uk website and everything around it is a prime example of software that just works. In terms of performance and accessibility. I work/volunteer for a non-profit design agency and we use the the uk.gov design system and I just love it: https://design-system.service.gov.uk/

NoImmatureAdHom · 11h ago

I have had very different experiences!

tristor · 14h ago

With the case of M&S, and in many other cases in UK tech history that have gone poorly, it's mostly examples of the failure of hiring outside consultancies in India to do everything. Business executives continuously fall afoul of the fungibility myth. They believe that engineers are fungible, and that they should therefore simply pay for the cheapest engineers possible that meet the "requirements" on paper, usually set by someone who is not an engineer (HR, project manager, or a lower ranked middle-manager).

Time and time and time again we have seen major failures globally, and especially in the UK, that prove that there is no fungibility of engineers, and that outsourcing the critical technical infrastructure for your core systems and services is doomed to failure. They'd rather save a dollar today and lose ten million dollars tomorrow by damaging their national economy and sending more money to India. India's GDP is basically entirely propped up by tech services, and most of that is /failed service delivery/, hard to differentiate from frauds and scams at scale.

theuppermiddle · 8h ago

Just because consultants are from India does not make them incompetent. Not to mention the attack vector was social engineering, which can happen to any person of any nationality. 70% of India's GDP is domestic consumption. 7% of the GDP is service export, of which only 3.5% is software services. While significant, it certainly is not "propping" up India's GDP.

Investment Risk Is Highest for Nuclear Power Plants, Lowest for Solar (bu.edu)

Beware of Fast-Math (simonbyrne.github.io)

Photos taken inside musical instruments (dpreview.com)

Webb telescope helps refines Hubble constant, suggesting resolution rate debate (phys.org)

What's working for YC companies since the AI boom (jamesin.substack.com)

Valkey Turns One: Community fork of Redis (gomomento.com)

Gradients Are the New Intervals (mattkeeter.com)

Simpler Backoff (commaok.xyz)

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) (cerebras.ai)

Using lots of little tools to aggressively reject the bots (lambdacreate.com)

Surprisingly fast AI-generated kernels we didn't mean to publish yet (crfm.stanford.edu)

The ‘white-collar bloodbath’ is all part of the AI hype machine (cnn.com)

Beating Google's kernelCTF PoW using AVX512 (anemato.de)

The Book of Secret Knowledge (github.com)

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS (ycombinator.com)

Reverse engineering of Linear's sync engine (github.com)

Randomness Requirements for Security (datatracker.ietf.org)

The Illusion of Causality in Charts (filwd.substack.com)

The Trackers and SDKs in ChatGPT, Claude, Grok and Perplexity (jamesoclaire.com)

C++ to Rust Phrasebook (cel.cs.brown.edu)

Show HN: Icepi Zero – The FPGA Raspberry Pi Zero Equivalent (github.com)

Microsandbox: Virtual Machines that feel and perform like containers (github.com)

Revenge of the Chickenized Reverse-Centaurs (pluralistic.net)

Systems Correctness Practices at Amazon Web Services (cacm.acm.org)

Mary Meeker's first Trends report since 2019, focused on AI (bondcap.com)

Show HN: MCP Defender – OSS AI Firewall for Protecting MCP in Cursor/Claude etc (mcpdefender.com)

Cap: Lightweight, modern open-source CAPTCHA alternative using proof-of-work (capjs.js.org)

Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory (dariobalinzo.medium.com)

The Darwin Gödel Machine: AI that improves itself by rewriting its own code (sakana.ai)

Every 5x5 Nonogram (pixelogic.app)

Ray Tracing in J (idle.nprescott.com)

She Got an Abortion. So a Texas Cop Used 83,000 Cameras to Track Her Down (eff.org)

Identifying Unmarked Iron (castironcollector.com)

Silicon Valley finally has a big electronics retailer again: Micro Center opens (microcenter.com)

AI Responses May Include Mistakes (os2museum.com)

Copy Excel to Markdown Table (and vice versa) (thisdavej.com)

Anthropic launches a voice mode for Claude (techcrunch.com)

Adam Riess and the Hubble tension (theatlantic.com)

Show HN: Smart Silence – Remind your iPhone to stay quiet in quiet places (testflight.apple.com)

Jerry Lewis's “The Day the Clown Cried” discovered in Sweden after 53 years (thenationalnews.com)

Google Duo will be replaced by Google Meet in Sept 2025 (9to5google.com)

Radio Astronomy Software Defined Radio (Rasdr) (radio-astronomy.org)

How to run cron jobs in Postgres without extra infrastructure (wasp.sh)

De Bruijn notation, and why it's useful (blueberrywren.dev)

How large should your sample size be? (vickiboykis.com)

Show HN: W++ – A Python-style scripting language for .NET with NuGet support (github.com)

Robot is 3D-printed upside-down in one piece, then walks out of the printer (newatlas.com)

A Smiling Public Man (salmagundi.skidmore.edu)

Atomics and Concurrency (redixhumayun.github.io)

Triangle splatting: radiance fields represented by triangles (trianglesplatting.github.io)

When will M&S take online orders again?

Comments (116)