Show HN: I built a tool that extracts structured financial data from PDFs

1 igor_strelkov 1 7/2/2025, 3:50:31 PM assess.finance ↗
As a founder working with financial models, I was spending way too much time copy-pasting numbers from PDF balance sheets and income statements into Excel.

So I built Assess Finance — a tool that extracts financial data from any PDF (even scanned ones), and automatically generates clean, standardized:

Income Statements

Balance Sheets

Cash Flow Reports

It’s fast, works with multi-year reports, and exports to Excel/CSV. No AI hype—just real time saved.

Would love your feedback. I also wrote a breakdown of how it works under the hood (OCR + financial structure mapping) if anyone’s interested.

Comments (1)

igor_strelkov · 9h ago
Here’s how it works under the hood:

1. PDF Parsing: We detect whether the PDF is native (text-based) or scanned (image-based). Native PDFs are parsed using pdfplumber; scanned files go through Tesseract OCR.

2. Table Extraction: We use heuristics + a fine-tuned model to identify financial tables (not just any table) and extract structured data like Revenue, EBITDA, Net Income, etc., even if labels vary.

3. Standardization Engine: A rule-based mapper matches extracted rows to a standardized chart of accounts (GAAP/IFRS-style), handling multi-year columns and inconsistent formats across companies.

4. Validation Layer: We auto-check for accounting errors (e.g., Assets ≠ Liabilities + Equity), date mismatches, or missing totals. Flagged reports are pushed for manual review or cleanup.

5. Export Formats: Outputs are returned as standardized Excel/CSV files—ready for financial modeling, BI dashboards, or credit analysis.

No LLMs involved yet, just focused, fast, deterministic extraction and mapping logic. But we’re experimenting with retrieval-augmented generation (RAG) for interpreting footnotes.

Happy to answer any questions or go deeper on architecture, caching, or product edge cases.