Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

3 caiopizzol 0 6/17/2025, 2:32:20 PM github.com ↗
Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

Comments (0)

No comments yet