BMW ConnectedDrive lets me control my returned rental car (Sixt)

Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

Comments (0)

No comments yet

GitHub API Is Down

Is GitHub Down?

BMW ConnectedDrive lets me control my returned rental car (Sixt)

Ask HN: In a guide to inner work for founders and engs, what topics to cover?

What newspaper are you paying for these days?

PSA: iwantmyname is utterly broken

Ask HN: What are some ways the internet is being used for good?

Ask HN: How to Deal with a Bad Manager?

Ask HN: Tech people who are self employed. How do you do it?

What does it mean to use C++ in the front end?

Ask HN: What cool skill or project interests you, but feels out of reach?

Ask HN: How do I give back to people helped me when I was young and had nothing?

Engineers at our startup don't build features anymore

Tell HN: Help restore the tax deduction for software dev in the US (Section 174)

Ask HN: What is your fallback job if AI takes away your career?

Tell HN: YouTube's New AI Search Is Incredibly Good

Ask HN: Is there an AI bot that works like a literate programming build step

Ask HN: How to learn CUDA to professional level

Ask HN: Prevent Secrets from Committing to Repos

Ask HN: Is ageism in tech still a problem?

Ask HN: Genuine alternatives to Google and Apple for releasing paid apps

Ask HN: Casual Math Book Suggestions

Ask HN: Seeking ways to improve my planning skills and follow-through

Ask HN: Minecraft's UI element style (vs. modern flat glass interface)

Ask HN: AWS cdk, serverless setup advice

Ask HN: AGI and Product Development

How does feedback usually happen during projects?

Ask HN: What is the latest on treatment of Metastatic Breast Cancer?

Just how many $10 /MOS subscriptions do startups expect us to sign up for?

Ask HN: In 15 years, what will a gas station visit look like?

Ask HN: Dear Product Managers – How do you use LLM's in your day to day work?

Why Vertical AI Agents May Replace RPA in Complex Enterprise Workflows

Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

Comments (0)