Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To
So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline
THE PROBLEM NOBODY TALKS ABOUT
Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1
WHAT YOU CAN NOW DO IN SQL
Find all fintechs founded after 2020 in São Paulo:
SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');
Result: 8,426 companies (as of Jun 2025)
SURPRISING THINGS I FOUND
1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.
2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.
3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.
4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.
5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.
TECHNICAL BITS
The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit
Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour
THE CODE
It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline
One command setup: docker-compose --profile postgres up --build
Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing
WHY OPEN SOURCE THIS?
I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.
The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.
COMMUNITY RESPONSE
I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."
QUESTIONS FOR HN
1. What other government datasets are this painful? I'm thinking of tackling more.
2. For those who've worked with government data - what's your worst encoding/format horror story?
3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.
The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.
No comments yet