Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

3 caiopizzol 0 6/17/2025, 2:32:20 PM github.com ↗

Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

The Grug Brained Developer (2022) (grugbrain.dev)

Honda conducts successful launch and landing of experimental reusable rocket (global.honda)

Resurrecting a dead torrent tracker and finding 3M peers (kianbradley.com)

Building Effective AI Agents (anthropic.com)

Programming Language Design in the Era of LLMs: A Return to Mediocrity? (kirancodes.me)

Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite (blog.google)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

AMD's CDNA 4 Architecture Announcement (chipsandcheese.com)

Time Series Forecasting with Graph Transformers (kumo.ai)

What Google Translate Can Tell Us About Vibecoding (ingrids.space)

Real-time action chunking with large models (pi.website)

Why JPEGs still rule the web (2024) (spectrum.ieee.org)

Bzip2 crate switches from C to 100% rust (trifectatech.org)

Tetrachromatic Vision (bookofjoe.com)

After millions of years, why are carnivorous plants still so small? (smithsonianmag.com)

AI will shrink Amazon's workforce in the coming years, CEO Jassy says (cnbc.com)

From SDR to 'Fake HDR': Mario Kart World on Switch 2 (alexandermejia.com)

A Rural Public Transit Odyssey (shagbark.substack.com)

The hamburger-menu icon today: Is it recognizable? (nngroup.com)

O3 Turns Pro (thezvi.substack.com)

Bots are overwhelming websites with their hunger for AI data (theregister.com)

AMD's Pre-Zen Interconnect: Testing Trinity's Northbridge (chipsandcheese.com)

The magic of through running (worksinprogress.news)

Voyager: Real-Time Splatting City-Scale 3D Gaussians on Your Phone (arxiv.org)

Attempting to Make the Smallest* Electric Motor [video] (youtube.com)

CPU-Based Layout Design for Picker-to-Parts Pallet Warehouses (arxiv.org)

Miscalculation by Spanish power grid operator REE contributed to blackout (reuters.com)

Calculating Oil Storage Tank Occupancy with Help of Satellite Imagery (medium.com)

What happens when clergy take psilocybin (nautil.us)

Should we design for iffy internet? (bytes.zone)

Guidelines on how to be a scientific sleuth released (osf.io)

Celebrated pianist and writer Alfred Brendel dies aged 94 (theguardian.com)

Texas electricity maximum renewables record (gridstatus.io)

Show HN: Chawan TUI web browser (chawan.net)

How you breathe is like a fingerprint that can identify you (nature.com)

Fossify – A suite of open-source, ad-free apps (github.com)

Show HN: Canine – A Heroku alternative built on Kubernetes (github.com)

Iran asks its people to delete WhatsApp from their devices (apnews.com)

Benzene at 200 (chemistryworld.com)

Cmapv2: A high performance, concurrent map (github.com)

Dull Men’s Club (theguardian.com)

The Humble Programmer (1972) (cs.utexas.edu)

Pitfalls of premature closure with LLM assisted coding (shayon.dev)

Astronomers Just Solved the Mystery of the Universe's Missing Matter (gizmodo.com)

Iron nitride permanent magnets made with DIY ball mill [video] (youtube.com)

Selfish reasons for building accessible UIs (nolanlawson.com)

OpenAI wins $200M U.S. defense contract (cnbc.com)

Photon transport through the entire adult human head (spiedigitallibrary.org)

WhatsApp introduces ads in its app (nytimes.com)

Fun with Telnet (2024) (brandonrozek.com)

Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

Comments (0)