Show HN: AI Peer Reviewer – Multiagent system for scientific manuscript analysis
The system uses multiple specialized agents to analyze different aspects of scientific papers, from methodology to writing quality.
Key features: 24 specialized agents analyzing sections, scientific rigor, and writing quality // Detailed feedback with actionable recommendations. // PDF report generation. // Support for custom review criteria and target journals.
Two ways to use it:
1. Cloud version (free during testing): https://www.rigorous.company - Upload your manuscript - Get a comprehensive PDF report within 1–2 working days - No setup required
2. Self-hosted version (GitHub): https://github.com/robertjakob/rigorous - Use your own OpenAI API keys - Full control over the review process - Customize agents and criteria - MIT licensed
The system is particularly useful for researchers preparing manuscripts before submission to co-authors or target journals.
Would love to get feedback from the HN community, especially from PhDs and researchers across all academic fields. The project is open source and we welcome contributions!
GitHub: https://github.com/robertjakob/rigorous Cloud version: https://www.rigorous.company
We're working through them and will send out reports asap!
Since we're currently covering the model costs for you, we'd appreciate any feedback via this short form in return: https://docs.google.com/forms/d/1EhQvw-HdGRqfL01jZaayoaiTWLS...
Thanks again for testing!
I think AI systems like this could greatly help with peer review, especially as a first check before submitting a manuscript to a journal.
That said, this particular system appears to focus on the wrong issues with peer review, in my opinion. I'll ignore the fact that an AI system is not a peer since another person already brought that up [1]. Even if this kind of system was a peer, the system appears to be checking superficial issues and not the deeper issues that many peer reviewer/referrers care about. I'll also ignore any security risks (other posts discuss that too).
A previous advisor of mine said that a good peer review needs to think about one major/deep question when reviewing a manuscript: Does the manuscript present any novel theories, novel experiments, or novel simulations; or does it serve as a useful literature review?
Papers with more novelty are inherently more publishable. This system does not address this major question and focuses on superficial aspects like writing quality, as if peer review is mere distributed editing and not something deeper. It is possible for even a well-written manuscript to lack any novelty, and novelty is what makes it worthy of publication. Moreover, many manuscripts have at best superficial literature reviews that name drop important papers and often mischaracterize their importance.
It takes deep expertise in a subject to see how a work is novel and fits into the larger picture of a given field. This system does nothing to aid in that. Does it help identify what parts of a paper you should emphasize to prove its novelty? That is, does it help you find the "holes" in the field that need patching? Does it help show what parts of your literature review are lacking?
A lot of peer review is kinda performative, but if we are going to create automated systems to help with peer review, I would like them to focus on the most important task of peer review: assessing the novelty of the work.
(I will note that I have not tried out this particular system myself. I am basing my comments here on the documentation I looked at on GitHub and the information in this thread.)
[1] https://news.ycombinator.com/item?id=44144672
(A machine could point to similar work though.)
We didn’t think too deeply about the term “AI peer reviewer” and didn’t mean to imply it’s equivalent to human peer review. Based on your comments, we’ll stick to using “AI reviewer” going forward.
Regarding security: there is an open-source version for those who want full control. The free cloud version is mainly for convenience and faster iteration. We don’t store manuscript files longer than necessary to generate feedback (https://www.rigorous.company/privacy), and we have no intention of using manuscripts for anything beyond testing the AI reviewer.
On novelty. totally agree it’s a core part of good peer review. The current version actually includes agents evaluating originality, contribution, impact, and significance. It’s still v1 of course but we want to improve it. We'd actually love for critical thinkers like you to help shape it. If you're open to testing it with a preprint and sharing your thoughts on the feedback, that would be extremely valuable to us.
Thanks again for engaging, we really appreciate it.
When I first read "Originality and Contribution" at [1], I actually assumed it was a plagiarism check. It did not occur to me until now that you were referring to novelty with that. Similarly, I assumed "Impact and Significance" referred to more about whether the subject was appropriate for a given journal or not (would readers of this journal find this information significant/relevant/impactful or should it be published elsewhere?). That's a question that many journals do ask of referees, independent of overall novelty, but I see how you mean a different aspect of novelty instead.
I'm not opposed to testing your system with a manuscript of my own, but currently the one manuscript that I have approaching publication is still in the internal review stage at my national lab, and I don't know precisely when it will be ready for external submission. But I'll keep it in mind whenever it passes all of the internal checks.
[1] https://github.com/robertjakob/rigorous/blob/main/Agent1_Pee...
I mostly share the same sentiment, and I see a similar issue with the product. The current system is not in its current poor state due to lack of reviewers, it is due to lack of quality reviewers and arbitrary notions of "good enough for this venue." So I wanted to express a difference of opinion about what peer review should be (I think you'll likely agree).
I don't think we are doing the scientific community any service by doing our current Conference/Journal based "peer review". The truth is that you cannot verify a paper by reading it. You can falsify it, but even that is difficult. The ability to determine novelty and utility is also a crapshoot, where we have a long history illustrating how bad we are at this. Several Nobel prize worthy works have been rejected multiple times due to "obviousness", "lack of novelty", and "clearly wrong." All three apply to the paper that led to the 2001 Nobel Prize in Economics[1]!
The truth of the matter is that peer review is done in the lab. It is done through replication, reproduction, and the further development of ideas. What we saw around LK-99[2] was more quality and impactful peer review than what any reader for a venue could provide. The impact existed long before any of those works were published in venues.
I think this came down to forgetting the purpose of journals. They were there when we didn't have tools like ArXiV, OpenReview, or even GitHub. Journals were primarily focused on solving the logistic problem of distribution. So I consider all those technical works, "preprints", and blog posts around LK-99 replications as much of a publication as anything else. The point is that we are communicating with our peers. There's always been prestige around certain venues, but primarily people did not publish to them. The other venues checked for plagiarism, factual errors, and any obvious errors. Otherwise, they continued with publication.
This silly notion of acceptance rates just creates a positive feedback loop which is overburdening the system (highly apparent in ML conferences). The notions of novelty and impact are highly noisy (as demonstrated in multiple NeurIPS studies and elsewhere), making the process far more random than acceptable. I don't think this is all that surprising. It is quite easy to poke flaws in any work you come across. It does not take a genius to figure out limitations of works (often they're explicitly stated!).
The result of this is obvious, and is what most researchers end up doing: resubmit elsewhere and try your luck again. Maybe the papers are improved, maybe they aren't, mostly the latter. The only thing this accomplishes is an exponentially increasing number of paper submissions and slowing down of research progress as we spend time reformatting and resubmitting which should instead be spent researching. The distribution of quality review comments seems to have high variance, but I can say that early in my PhD they mainly resulted in me making my works worse as I chased their comments rather than just trying to re-roll and move on.
In this sense, I don't think there's a "lack of reviewer" problem, so much as we have an acceptance threshold problem with an arbitrary metric. I think we should check for critical errors, check for plagiarism, and then just make sure the work is properly communicated. The rest is far more open to interpretation and not even us experts are that good at it.
[0] Well my defense is in a week...
[1] https://en.wikipedia.org/wiki/The_Market_for_Lemons
[2] https://en.wikipedia.org/wiki/LK-99
On this tool, I fully expect that it will not capture high level conceptual peer review, but could very much serve a role in identifying errors of omission from a manuscript as a checklist to improve quality (as long as this remains an author controlled process).
I will be interested to throw in some of my own published papers to see if it catches all the things I know I would have liked to improve in my papers.
We did find a few datasets that offer a starting point: https://arxiv.org/abs/2212.04972 https://arxiv.org/abs/2211.06651 https://arxiv.org/abs/1804.09635
There’s also interesting potential in comparing preprints to their final published versions to reverse-engineer the kinds of changes peer review typically drives.
A growing number of journals and publishers, like PLOS, Nature Communications, and BMJ—now publish peer review reports openly, which could be valuable as training data.
That said, while this kind of data might help generate feedback to improve publication odds (by surfacing common reviewer demands early), I am not fully convinced it would lead to the best feedback. In our experience, reviewer comments can be inconsistent or even unreasonable, yet authors often comply anyway to get past the gate.
We're also working on a pre-submission screening tool that checks whether a manuscript meets hard requirements like formatting or scope for specific journals and conferences, hoping this will save a lot of time.
Would love to hear your take on what kind of feedback you find useful, what feels like nonsense, and what you would want in an ideal review report... via this questionnaire https://docs.google.com/forms/d/1EhQvw-HdGRqfL01jZaayoaiTWLS...
Regarding security concerns: there is an open-source version for those who want full control. The free cloud version is mainly for convenience and faster iteration. We don’t store manuscript files longer than necessary to generate feedback (https://www.rigorous.company/privacy), and we have no intention of using manuscripts for anything beyond testing the AI reviewer.
As a free github project it seems.. I don't know, it's not peer review and shouldn't be advertised as such, but as a basic review I guess it's fine, but why would someone pay you for a handful of LLM prompts?
If your business can be completely replicated by leaked system prompts I think you're going to have issues
The way I see it, it can function, at best, as a tool of text analysis, e. g. as part of augmented analytics engines in a CAQDAS.
1. Agents are defined as having agency, with sentience as an obvious prerequisite.
I was joking, but probably so, to the extent that 80% of peer reviewers are men and 80% of authors of peer reviewed articles are men[0]
0. https://www.jhsgo.org/article/S2589-5141%2820%2930046-3/full...
That's a dream which is unlikely to come true.
One reason being that the training data is not unbiased and it's very hard to make it less biased, let alone unbiased.
The other issue being that the AI companies behind the models are not interested in this. You can see the Grok saga playing out in plain sight, but the competitors are not much better. They patch a few things over, but don't solve it at the root. And they don't have incentives to do that.
The best you can hope for is to provide technical means to point out indicators of bias. But anything beyond that could, at worst, do more harm than good. ("The tool said this result is unbiased now! Keep your skepticism to yourself and let me publish!")
Bias is systematic error.
Maybe your thermometer just always reads 5° high.
Maybe it reads high on sunny days and low on rainy days.
Bias is distinct from random error, say if it's an electronic thermometer with a loose wire.
For classification problems, there's also this impossibility result: https://www.marcellodibello.com/algorithmicfairness/handout/...
Source: I've personally been involved in peer reviewing in fields as diverse as computer science, quantum physics and applied animal biology. I've recently left those fields in part because of how terrible some of the real-world practices are.
They were already on life support. The need to “move fast” “there is no time”, “we have a 79 file PR with 7k line changes that we have been working on for 6 weeks. Can you please review it quickly? We wanna demo tomorrow GTM meeting”. Management found zero value in code reviews. You still can’t catch everything, so what’s the point? They can’t measure what the value of such process is.
Now? Now every junior dev is pushing 12 PRs a week, all adding 37 new files and thousands of lines that have been auto generated with a ton of patterns and themes that are all over the place and you are expecting anyone to keep up?
Just merge it. I have seen people go from:
> “asking who is best to review changes in area X? I have a couple of questions to make sure I’m doing things right”
To
> “this seems to work fine. Can I get a quick review? Trying to push it out and see how it works”
To
> “need 2 required approvals on this PR please?”
If that's how it works at your company then run as fast as you can. There are many reasonable alternatives that won't push this AI-generated BS on you.
That is, if you care. If you don't then please stay where you are so reasonable places don't need to fight in-house pressure to move in that direction.
Regarding code reviews, I can't see a way out unfortunately. We already have github (and others) agents/features where you write an issue on a repo, and kick off an agent to "implement it and send a PR for the repo". As it exists today, every repo has 100X more issues and discussions and comments than it has PRs. now imagine if the barrier to opening a PR is basically: Open an issue + click "Have a go at it, GitHub" button. Who has the time or bandwidth to review that? That wouldn't make any sense either.
In my view, the peer-review process is flawed. Reviewers have little incentive to engage meaningfully. There’s no financial compensation, and often no way to even get credit for it. It would be cool to have something like a Google Scholar page for reviewers to showcase their contributions and signal expertise.
Judging the actual contents may feel like the holy grail but is unlikely to be taken well by the actual academic research community. At least the part that cares about progressing human knowledge instead of performative paper milling.
Would be great to see contributions from the community!
Funsamentally each of those 24 agents seem to be just:
"load from pdf > put text into this prompt > Call OpenAI API"
So is it actually just posting 24 different prompts to a generalist AI?
I'm also wondering about the prompts, one I read said "find 3-4 problems per section....find 10-15 problems per paper". What happens when you put up a good paper, does this force it to find meaningless, nit-picky, problems? Have you tried papers which are acknowledged to be well written on it?
From a programming perspective the code has got a lot of room from improvements.
The big one is if you'd used the same interface for each "agent" you could have had them all self register and call themselves in a loop rather than having to do what you've done in this file:
https://github.com/robertjakob/rigorous/blob/main/Agent1_Pee...
TBH, that's a bit of a WTF file. The `def _determine_research_type` method looks like a placeholder you've forgotten about too, as it use a pretty wonky way to determine the paper type.
Also, you really didn't need specialized classes for each prompt, you could have just had the prompts as text files a single class loaded as templates that you just replace text into. That will mean you're going to have a lot of work whenever you need to update the way your prompting works, having to change 24 files each time, probably cut/pasting which is error prone.
I've done it before where you have the templates in a folder, and the program just dynamically loads them. So you can add more really easily. Next stage is to add pre-processor directives to your loader that allows you to put some config at the top of each text file.
I'm also not looking that hard at the code, but it seems you dump the entire paper into each prompt, rather than just the section it needs to review, which seems like an easy money saver if you asked an AI to chop up the paper, then just inject the section needed to reduce your costs for tokens. Although you then run the risk of it chopping it up badly.
Finally, and this is a real nitpick but it's twitch inducing when reading the prompts, comments in javascript are two forward slashes not a hash.
You're right: In the current version each "agent" essentially loads the whole paper, applies a specialized prompt, and calls the OpenAI API. The specialization lies in how each prompt targets a specific dimension of peer review (e.g., methodological soundness, novelty, citation quality). While it’s not specialization via architecture yet (i.e., different models), it’s prompt-driven specialization, essentially simulating a review committee, where each member is focused on a distinct concern. We’re currently using a long-context, cost-efficient model (GPT-4.1-nano style) for these specialized agents to keep it viable for now. Think of it as an army of reviewers flagging areas for potential improvement.
To synthesize and refine feedback, we also run Quality Control agents (acting like an associate editor), which reviews all prior outputs from the individual agents to reduce redundancy and surface the most constructive insights (and filter out less relevant feedback).
On your point about nitpicking: we’ve tested the system on several well-regarded, peer-reviewed papers. While the output is generally reasonable and we did not discover "made up" issues yet, there are occasional instances where feedback is misaligned. We're convinced, however, we can almost fully reduce such noise in future iterations (Community Feedback is super important to achieve this).
On the code side: 100% agree. This is very much an MVP focused on testing potential value to researchers, and the repeated agent classes were helpful for fast iteration. However, your suggestion of switching to template-based prompt loading and dynamic agent registration is great and would improve maintainability and scalability. We'll 100% consider it in the next version.
The _determine_research_type method is indeed a stub. Good catch. Also, lol @ the JS comment hashes, touché.
If you're open to contributing or reviewing, we’d love to collaborate!
The article in question is currently on the ArXiV, and I'd love to know what you think of it – https://arxiv.org/abs/2504.16621. The latest version of the manuscript (locally, shrunk) is 9.4 MiB (9879403 bytes) so I don't get the 413 error if you have a 10 MB / MiB limit :-).
As an aside, I know from many others that chatGPT is already writing a lot of reports for journals – curated by a human but not exclusively so. Is this a good thing for science?
(except your github usernames on the repo posted only here)
Regardless of how useful this is it's hard to take it serious.
We're in very early MVP mode, trying to move fast and see if this works. We pushed a Cloud version to support users who don't want to run the GitHub script themselves. That said, you're absolutely encouraged to run it yourself (with your openAI key) — the results are identical.
For context: we're two recent ETH Zurich PhD graduates.
Robert Jakob: https://www.linkedin.com/in/robertjakob Kevin O'Sullivan: https://www.linkedin.com/in/kevosull
Going to add contact information immediately.
Thanks again for the feedback — it's exactly what we need at this stage.
(I'm not trying to sound overly critical - I very much like the idea and the premise. I merely wouldn't use this business approach)
Hard disagree. The “in between” is where you want where most are already ending up. Initially you had everyone so worried about privacy and what OpenAI is doing with their precious private data. “They will train on it. Privacy is important to me. I’m not about to like give OpenAI access to my private, secure, Google drive back ups or Gmail history or Facebook private messages or any real private “local only” information.
Also among those who understand data privacy concerns, when it come to work data, in the span of 2-3 years, all business folks I know went from “this is confidential business information. Please never upload to ChatGPT and only email it to me” to “just put everything on ChatGPT and see what it tells you”
The initial worry was driven by not understanding how LLMs worked. What if “it just learned as you talked to it?” And “what if it used that learning with somebody else?” Like I told it a childhood secret, will it turn around and tell others my secret?”
People understand how that works now and some concerns are less. Basically most understand that it’s similar risk as their already existing digital life is
You're telling me companies in Europe aren't putting all their user data on AWS and Azure regions in Europe? Both AWS and Azure are gigantic in Europe.
Was there some level of support beyond this that you were referring to?
Note: The current version uses the OpenAI API, but it should be adaptable to run on local models instead.
We'd be happy to hear what kind of feedback you find useful, what is useless, and what you would want in an ideal review report. (https://docs.google.com/forms/d/1EhQvw-HdGRqfL01jZaayoaiTWLS)