You Wouldn't Download a Hacker News

280 jasonthorsness 146 4/30/2025, 1:26:31 AM jasonthorsness.com ↗

Comments (146)

mattkevan · 6h ago
I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.

Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.

nthingtohide · 1h ago
> an innocent machine about wanking and divorce

Let's say you discovered a pendrive of a long lost civilization and train a model on that text data. How would you or the model know that the pendrive contained data on wanking and divorce without anykind of external grounding to that data?

falcor84 · 5h ago
What's wrong with wanking and divorce? These are respectively a way for people to be happier and more self-reliant, and a way for people to get out of a situation that isn't working out for them. I think both are net positives, and I'm very grateful to live in a society that normalizes them.
pc86 · 1h ago
I'm not implying that divorce should be stigmatized or prohibited or anything, but it is bad (necessary evil?) and most people would be much happier if they had never married that person in the first place rather than married them then gotten divorced.

So "normalize divorce" is pretty backward when what we should be doing is normalizing making sure you're marrying the right person.

nhod · 34m ago
This reminds me of one of my very favorite essays of all time, "Why You Will Marry the Wrong Person" by Alain de Botton from the School of Life. The title is somewhat misleading, and I resisted reading it for a couple years as a result. It is exquisite writing — it couldn't be said with fewer words, and adding more wouldn't help either — and an extraordinary and ultimately hopeful meditation on love and marriage.

NYT Gift Article: https://www.nytimes.com/2016/05/29/opinion/sunday/why-you-wi...

cgriswald · 41m ago
Making sure you are marrying the right person is normalized. I’d have never even known my ex wasn’t the right person if I hadn’t married her. I didn’t come out of my marriage worse off.

Normalize divorce and stop stigmatizing it by calling it bad or evil.

dcuthbertson · 5h ago
The innocent machine can't do either. It's akin to having no mouth, but it must scream (apologies to Harlan Ellison)
falcor84 · 2h ago
That is a fair point, but it would then apply to everything else we teach it about, like how we perceive the color of the sky or the taste of champagne. Should we remove these from the training set too?

Is it not still good to be exposed to the experiences of others, even if one cannot experience these things themself?

adamc · 41m ago
Having gone through a divorce... no. It would be better if people tried harder to make relationships work. Failing that, it would be better to not marry such a person.
falcor84 · 34m ago
People sometimes grow in different directions. Sometimes the person who was perfect for you at 25 just isn't a good fit for you at age 40, regardless of how hard you try to make it work.
montebicyclelo · 7h ago
There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.

- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`

- ClickHouse, no signup needed, can run queries in browser directly, [1]

[1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

kordlessagain · 2h ago
xnx · 2h ago
The ClickHouse resource is amazing. It even has history! I had already done my own exercise of downloading all the JSON before discovering the Clickhouse HN DBs.
bambax · 7h ago
> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.

The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?

icoder · 6h ago
I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.

Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).

Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.

Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.

Philpax · 5h ago
genewitch · 5h ago
Matrix protocol or at least the clients agree that several emoji is a key - which is fine - and you verify by looking at the keys (on each client) at the same time in person, ideally. I've only ever signed for people in person, and one remote attestation; but we had a separate verified private channel and attested the emoji that way.
nickdothutton · 3h ago
Do these still happen? They were common (-ish, at least in my circles) in the 90s during the crypto wars, often at the end of conferences and events, but I haven't come across them in recent years.
haswell · 2h ago
I’ve also been thinking about this quite a bit lately.

I also want something like this for a lightweight social media experience. I’ve been off of the big platforms for years now, but really want a way to share life updates and photos with a group of trusted friends and family.

The more hostile the platforms become, the more viable I think something like this will become, because more and more people are frustrated and willing to put in some work to regain some control of their online experience.

brongondwana · 1h ago
Also there's the problem that every human has to have perfect opsec or you get the problem we have now, where there are massive botnets out there of compromised home computers.
marcusb · 2h ago
Isn't this vaguely how the invite system at Lobsters functions? There's a public invite tree, and users risk their reputation (and posting access) when they invite new users.
withinboredom · 1h ago
I know exactly zero people over there. I am also not about to go brown nose my way into it via IRC (or whatever chat they are using these days). I'd love to join, someday.
somethingsome · 53m ago
Hey I never actually tried lobsters, do you mind if I ask an invite?
SuperShibe · 3h ago
I think this ideas problem might be the people part, specifically the majority type of people that will click absolutely anything for a free iPad
XorNot · 6h ago
I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.
wobfan · 5h ago
Yeah exactly, this was exactly the idea behind that. Unfortunately, while on paper it just sounds like a sound idea, at least IMO, though ineffective, it has proven time and time again that the WOT idea in PGP has no chance against the laziness of humans.
drcongo · 6h ago
I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.
littlestymaar · 6h ago
Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.

For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor I'd guess, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.

eadmund · 5h ago
> Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.

I don’t think that really follows. Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time. Various networks of merchants did the same in the Middle Ages.

littlestymaar · 4h ago
> Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time.

Under the supervision of the State (they are regulated and rely on the justice and police system to make things work).

> Various networks of merchants did the same in the Middle Ages.

They did, and because there was no State the amount of trust they could built was fairly limited compared to was has later been made possible by the development of modern states (the industrial revolution appearing in the UK has partly been attributed to the institutional framework that existed there early).

Private actors can, and do, and have always done, build their own makeshift trust network, but building a society-wide trust network is a key pillar of what makes modern states “States” (and it directly derives from the “monopoly of violence”).

im3w1l · 4h ago
GPG lost, TLS won. Both are actually webs of trust with the same underlying technology. But they have different cultures and so different shapes. GPG culture is to trust your friends and have them trust their friends. With TLS culture you trust one entity (e.g. browser) that trusts a couple dozen entities that (root certificate authorities), that either signs keys directly or can fan out to intermediate authorities that then sign keys. The hierarchical structure has proven much more successful than the decentralized one.

Frankly I don't trust my friends of friends of friends not to add thirst trap bots.

lxgr · 3h ago
The difference is in both culture and topology.

TLS (or more accurately, the set of browser-trusted X.509 root CAs) is extremely hierarchical and all-or-nothing.

The PGP web of trust is non-hierarchical and decentralized (from an organizational point of view). That unfortunately makes it both more complex and less predictable, which I suppose is why it “lost” (not that it’s actually gone, but I personally have about one or maybe two trusted, non-expired keys left in my keyring).

kevin_thibedeau · 2h ago
The issue is key management. TLS doesn't usually require client keys. GPG requires all receivers to have a key.
nashashmi · 7h ago
We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.
miki123211 · 6h ago
How do you know it isn't already happening?

With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.

Joker_vD · 5h ago
But what if LLMs will start leaving constructive and helpful comments? I personally would feel like xkcd [0], but others may disagree.

[0] https://xkcd.com/810/

Pikamander2 · 1h ago
I was browsing a Reddit thread recently and noticed that all of the human comments were off-topic one-liners and political quips, as is tradition.

Buried at the bottom of the thread was a helpful reply by an obvious LLM account that answered the original question far better than any of the other comments.

I'm still not sure if that's amazing or terrifying.

gosub100 · 4h ago
That's the moment we will realize that it's not the spam that bothers us, but rather that there is no human interaction. How vapid would it be to have a bunch of fake comments saying eat more vegetables, good job for not running over that animal in the road, call mom tonight it's been a while, etc. They mean nothing if they were generated by a piece of silicon.
miki123211 · 13m ago
I think a much more important question is what happens when we have no idea who's an LLM and who's a real person.

Do we accuse everybody of being an LLM? Will most threads devolve into "you're an LLM, no you're the LLM" wars? Will this give an edge to non-native English speakers, because grammatical errors are an obvious tell that somebody is human? Will LM makers get over their squeamishness and make "write like a Mexican who barely speaks English" a prompt that works and produces good results?

Maybe the whole system of anonymity on the internet gets dismantled (perhaps after uncovering a few successful llm-powered psy-ops or under the guise of child safety laws), and everybody just needs to verify their identity everywhere (or login with Google)? Maybe browser makers introduce an API to do this as anonymously and frictionlessly as possible, and it becomes the new normal without much fuss? Is turnstile ever going to get good enough to make this whole issue moot?

I think we have a very interesting few years in front of us.

withinboredom · 1h ago
I believe they mean whatever you mean it to mean. Humanity has existed on religion based on what some dead people wrote down, just fine. Er, well, maybe not "just fine" but hopefully you get the gist: you can attribute whatever meaning you want to the AI, holy text, or other people.
gosub100 · 32m ago
Religion is the opposite of AI text generation. It brings people together to be less lonely.

AI actively tears us apart. We no longer know if we're talking to a human, or if an artists work came from their ability, or if we will continue to have a job to pay for our living necessities.

melagonster · 3h ago
This just another reddit or HN.
kriro · 1h ago
I think LLMs could be a great driver of private-public key encryption. I could see a future where everyone finally wants to sign their content. Then at least we know it's from that person or an LLM-agent by that person.

Maybe that'll be a use case for blockchain tech. See the whole posting history of the account on-chain.

r3trohack3r · 3h ago
HN already has a pretty good immune system for this sort of thing. Low-effort or repetitive comments get down-voted, flagged, and rate-limited fast. The site’s karma and velocity heuristics are crude compared with fancy ML, but they work because the community is tiny relative to Reddit or Twitter and the mods are hands-on. A fleet of sock-puppet LLM accounts would need to consistently clear that bar—i.e. post things people actually find interesting—otherwise they’d be throttled or shadow-killed long before they “replace all human text.”

Even if someone managed to keep a few AI-driven accounts alive, the marginal cost is high. Running inference on dozens of fresh threads 24/7 isn’t free, and keeping the output from slipping into generic SEO sludge is surprisingly hard. (Ask anyone who’s tried to use ChatGPT to farm karma—it reeks after a couple of posts.) Meanwhile the payoff is basically zero: you can’t monetize HN traffic, and karma is a lousy currency for bot-herders.

Could we stop a determined bad actor with resources? Probably, but the countermeasures would look the same as they do now: aggressive rate-limits, harsher newbie caps, human mod review, maybe some stylometry. That’s annoying for legit newcomers but not fatal. At the end of the day HN survives because humans here actually want to read other humans. As soon as commenters start sounding like a stochastic parrot, readers will tune out or flag, and the bots will be talking to themselves.

Written by GPT-3o

stephenhumphrey · 31m ago
Regardless of whether that final line reflects reality or is merely tongue-in-cheek snark, it elevates the whole post into the sublime.
djoldman · 3h ago
A variant of this was done for 4chan by the fantastic Yannic Kilcher:

https://en.wikipedia.org/wiki/GPT4-Chan

Etheryte · 4h ago
See the Metal Gear franchise [0], the Dead Internet Theory [1], and many others who have predicted this.

> Hideo Kojima's ambitious script in Metal Gear Solid 2 has been praised, some calling it the first example of a postmodern video game, while others have argued that it anticipated concepts such as post-truth politics, fake news, echo chambers and alternative facts.

[0] https://en.wikipedia.org/wiki/Metal_Gear

[1] https://en.wikipedia.org/wiki/Dead_Internet_theory

holuponemoment · 6h ago
Does it even matter?

Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.

bob1029 · 6h ago
It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.

I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...

OccamsMirror · 3h ago
I always love checking the comments on articles about Bevy to see how the metaverse client guy is going.
gosub100 · 3h ago
The paths are going to be predictable by necessity. It's not possible for everyone to have a uniquely derived interpretation about most common issues, whether that's standard lightning rod politics but also extending somewhat into tech socio/political issues.
no_time · 6h ago
I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:

- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.

- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.

- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.

All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.

theasisa · 5h ago
Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.

You'd need to have a permanent captcha that tracks that the actions you perform are human-like, such as mouse movement or scrolling on phone etc. And even then it would only deter current AI bots but not for long as impersonation human behavior would be a 'fun' challenge to break.

Trusted relationships are only as trustworthy as the humans trusting each other, eventually someone would break that trust and afterwards it would be bots trusting bots.

Due to bots already filling up social media with their spew and that being used for training other bots the only way I see this resolving itself is by eventually everything becoming nonsensical and I predict we aren't that far from it happening. AI will eat itself.

no_time · 4h ago
>Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.

Correct. But for curbing AI slop comments this is enough imo. As of writing this, you can quite easily spot LLM generated comments and ban them. If you have a verification system in place then you banned the human too, meaning you put a stop to their spamming.

icoder · 6h ago
I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.

See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.

dns_snek · 6h ago
There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.

Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.

vvillena · 5h ago
These kinds of solutions are already deployed in some places. A trusted ID server creates a bunch of anonymous keys for a person, the person uses these keys to identify in pages that accept the ID server keys. The page has no way to identify a person from a key.

The weak link is in the ID servers themselves. What happens if the servers go down, or if they refuse to issue keys? Think a government ID server refusing to issue keys for a specific person. Pages that only accept keys from these government ID servers, or that are forced to only accept those keys, would be inaccessible to these people. The right to ID would have to be enshrined into law.

no_time · 4h ago
As I see it, a technical solution to AI spam inherently must include a way to uniquely identify particular machines at best, and particular humans responsible for said machines at worst.

This verification mechanism must include some sort of UUID to reign in a single bad actor who happens to validate his/her bot farm of 10000 accounts from the same certificate.

05 · 5h ago
dns_snek · 1h ago
I don't think that's what I was going for? As far as I can see it relies on a locked down software stack to "prove" that the user is running blessed software on top of blessed hardware. That's one way of dealing with bots but I'm looking for a solution that doesn't lock us out of our own devices.
ahoka · 7h ago
Probably already happening.
dangoodmanUT · 2h ago
I imagine LLMs already have this too
genewitch · 5h ago
I have all of n-gate as json with the cross references cross referenced.

Just in case I need to check for plagiarism.

I don't have enough Vram nor enough time to do anything useful on my personal computer. And yes I wrote vram like that to pothole any EE.

_Algernon_ · 6h ago
This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/
drcongo · 6h ago
The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.
userbinator · 8h ago
I had a 20 GiB JSON file of everything that has ever happened on Hacker News

I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.

olalonde · 7m ago
7.5KB/s (aka 7500 characters per second) didn't sound realistic... So I did the math[0] and it turns out it's closer to 34 bytes/s (0.03 KB/s). And it's really lower than that because of all the metadata and syntax in the JSON. You were right about the "over 2MB per day" though.

[0] Well, ChatGPT did, but I verified and the math seemed to check out: https://chatgpt.com/share/68124afc-c914-800b-8647-74e7dc4f21...

sph · 7h ago
2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.

Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.

samplatt · 7h ago
Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?
kevincox · 5h ago
I suspect it is closer to 100% increase for the average comment. If the average comment is a few senteces and the metadata has id, parent id, author, timestamp and a vote count that can add up pretty fast.
FabHK · 6h ago
Around one book every 12 hours.
xnx · 2h ago
20 GB JSON is surprising to me. I have an sqlite file of all HN data that is 20 GB, it would be much larger as JSON.
jakegmaths · 11h ago
Your query for Java will include all instances of JavaScript as well, so you're over representing Java.
smarnach · 10h ago
Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words
sph · 7h ago
A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.

Call it "Go", for example.

(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)

setopt · 7h ago
Let’s make a language called “A” in that case. (I mean C was fine, so why not one letter?)
TZubiri · 58m ago
Or call it the name of a popular song to appeal to the youngins.

I present to you "Gangam C"

InDubioProRubio · 7h ago
You also wouldn't acronym hijack overload to boost mental presence in gamers LOL
matsemann · 8h ago
Reminded me about Scunthorpe problem https://en.wikipedia.org/wiki/Scunthorpe_problem
jasonthorsness · 11h ago
Ah right… maybe even more unexpected then to see a decline
cs02rm0 · 8h ago
I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.

I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.

karel-3d · 2h ago
New Java looks actually good, but most of the Java actual ecosystem is stuck in the past.... and you will mostly work within the existing ecosystem
cess11 · 5h ago
Recruiting Java developers is easy mode, there are rather large consultancies and similar suppliers that will sell or rent them to you in bulk so you don't need to nag with adverts to the same extent as with pythonistas and rubyists and TypeScript.

But there is likely some decline for Java. I'd bet Elixir and Erlang have been nibbling away on the JVM space for quite some time, they make it pretty comfortable to build the kind of systems you'd otherwise use a JVM-JMS-Wildfly/JBoss rig for. Oracle doesn't help, they take zero issue with being widely perceived as nasty and it takes a bit of courage and knowledge to manage to avoid getting a call from them at your inconvenience.

patates · 4h ago
Speaking as someone who ended up in the corporate Java world somewhat accidentally (wasn't deep in the ecosystem before): even the most invested Java shops seem wary of Oracle's influence now. Questioning Oracle tech, if not outright planning an exit strategy, feels like the default stance.
SilverBirch · 7h ago
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?
euroderf · 7h ago
Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?
krapp · 6h ago
HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.
mikeevans · 3h ago
Firebase is owned and operated by Google (has been for a while).
alt227 · 2h ago
If something is on the public web, it is already being scraped by thousands of bots.
TZubiri · 57m ago
Well, it's called Hacker News, so hacking is fair game, at least in the good sense of the word.
dangoodmanUT · 2h ago
there's literally an API they promote. Did you read that part before trying to cancel them?
wslh · 7m ago
It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].

[1] https://www.bittorrent.org/beps/bep_0046.html

[2] https://www.npmjs.com/package/bittorrent-dht

flakiness · 10h ago
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
minimaxir · 10h ago
That's not cheating, that's just pragmatic.
AbstractH24 · 2h ago
What a pragmatic way to rationalize most cheating
ashish01 · 11h ago
I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
jasonthorsness · 11h ago
Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…

I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.

    // DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
    // created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.

    const DefaultStaleIf = "(:now-refreshed)>" +
 "(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
https://github.com/jasonthorsness/unlurker/blob/main/hn/core...
xnx · 2h ago
I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?

I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.

Is there a good tool for making charts directly from Clickhouse data?

texodus · 1h ago
No Clickhouse connector for free accounts yet, but if you can drop a Parquet file on S3 you can try https://prospective.co
xnx · 1h ago
Thanks! I'll check that out. Thought it was a typo of "Perspective" for a moment: https://perspective.finos.org/
texodus · 51m ago
Yes! This is the pro version, we also develop open source https://github.com/finos/perspective (which Prospective is substantially built on, with some customizations such as a wasm64 runtime).
shayway · 3h ago
Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.
alt227 · 41m ago
Here, the entire history of HN with the ability to run queries on it directly in the browser :)

https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

stefs · 8h ago
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
jasonthorsness · 2h ago
It's true :( but line charts of the data had too much overlap and were hard to see anything. I was thinking next time maybe multiple line charts aligned and stacked, with one series per region?
seabass · 7h ago
My first thought as well! The author of uPlot has a good demo illustrating their pitfalls https://leeoniya.github.io/uPlot/demos/stacked-series.html
dguest · 7h ago
How do you feel about stacked plots on a logarithmic y axis? Some physics experiments do this all the time [1] but I find them pretty uninitiative.

[1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...

lblume · 6h ago
What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.
dguest · 6h ago
It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.

The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.

For the record I think it's a terrible convention, it just somehow became standard in some fields.

sebastianmestre · 2h ago
Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layers

Edit: or make a non-stacked version?

jasonthorsness · 2h ago
Lots of valid criticism here of these graphs and the queries; I'll write a follow-up article.
9rx · 9h ago
> The Rise Of Rust

Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!

emilbratt · 8h ago
The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.
matsemann · 10h ago
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
pjc50 · 6h ago
I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.

I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.

(HN does not like jokes, but you can get away with it if you also include an explanation)

minimaxir · 10h ago
The only vote data that is visible via any HN API is the scores on submissions.

Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.

ryandrake · 9h ago
Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
saagarjha · 9h ago
I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.
xnx · 2h ago
Some of this data is available through the API (and Clickhouse and BigQuery).

I wrote a Puppeteer script to export my own data that isn't public (upvotes, downvotes, etc.)

nottorp · 9h ago
> Are there users I constantly upvote/downvote?

Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

vidarh · 7h ago
The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.
pjc50 · 6h ago
I recognize twenty or so of the most frequent and/or annoying posters.

The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.

matsemann · 8h ago
Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?
thaumasiotes · 8h ago
> It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?

nottorp · 4h ago
Either you got the direction wrong or you'd support someone who is wrong just because you like them.

You're wrong in both cases :)

thaumasiotes · 3h ago
Maybe try rereading my comment?
nottorp · 1h ago
You're right. But I still disagree with you. Both ways are wrong if you want to maintain a constructive discussion.

Maybe you don't like my opinions on cogwheel shaving but you will agree with me on quantum frobnicators. But if you first come across about my comments on cogwheel shaving and note the user name, you may not even read the comments on quantum frobnicators later.

9rx · 9h ago
> What's my upvote/downvote ratio?

Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?

It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.

For those who seek fidget toys, there are better devices for that.

immibis · 8h ago
Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.

Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.

9rx · 8h ago
So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
jpc0 · 6h ago
It’s not so much rereading the comments but more a matter of it being indication to other users.

The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.

So generally it shows the sentiment of the group and humans and conditioned to follow the group.

9rx · 3h ago
An indication of what? It is impossible to know why a user pressed an arrow button. Any meaning the user may have wanted to convey remains their own private information.

All it can fundamentally serve is to act as an impoverished man's read receipt. And why would you want to give trolls that information? Fishing to find out if anyone is reading what they're posting is their whole game. Do not feed the trolls, as they say.

matsemann · 8h ago
Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
saagarjha · 9h ago
If Hacker News had reactions I’d put an eye roll here.
9rx · 9h ago
You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.
deadbabe · 5h ago
Is the 20GB JSON file available?
tacker2000 · 8h ago
Yea, i also get the feeling that these rust evangelists get more annoying every day ;p
Am4TIfIsER0ppos · 2h ago
I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here's hoping for another Tay AI.
hsbauauvhabzb · 9h ago
Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.
Havoc · 7h ago
It’s on firebase/bigquery to avoid people doing what OP did

If you click the api link bottom of page it’ll explain.

jasonthorsness · 3h ago
I used the API! It only takes a few hours to download your own copy with the tool I used https://github.com/jasonthorsness/unlurker

I had to CTRL-C and resume a few times when it stalled; it might be a bug in my tool

xnx · 2h ago
Is there any advantage to making all these requests instead of using Clickhouse o BigQuery?
jasonthorsness · 2h ago
Probably not :P. I made the client for another project, https://hn.unlurker.com, and then just jumped straight to using it to download the whole thing instead of searching for an already available full data set.
andrewshadura · 9h ago
Funny nobody's mentioned "correct horse battery staple" in the comments yet…
pier25 · 10h ago
would love to see the graph of React, Vue, Angular, and Svelte
a3w · 4h ago
Cool project. Cool graphs.

But any GDPR requests for info and deletion in your inbox, yet?