Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

2 super_ar 0 5/7/2025, 3:56:33 PM github.com ↗

Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse https://github.com/glassflow/clickhouse-etl

Why we built this: Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data? Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.

We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.

We looked into using FINAL but haven't been happy with the speed for real-time workloads.

We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.

We decided to solve it by building a new product and are excited to share it with you.

The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).

Main components:

- Streaming deduplication: You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.

- Temporal Stream Joins: You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.

- Built-in Kafka source connector: There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.

- ClickHouse sink: Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.

We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!

Building a Regex Engine (sh4dy.com)

Navy finds something the LCS is good at: Stopping drug smuggling (taskandpurpose.com)

Show HN: An AI-first visual editor using GPT-4o's GPT-image-1 model (img.ly)

Amazon to Invest $4B in Chile AWS Data Centers (newscvg.com)

Show HN: One-Line Installer for Cursor on Linux (github.com)

Beautiful Concurrency (2007) [pdf] (microsoft.com)

Efforts Grow to Thwart mRNA Therapies (nytimes.com)

OSS: Two Steps Forward, One Step Back (redmonk.com)

Reddit will tighten verification to keep out human-like AI bots (old.reddit.com)

Bird flu in cats points to risk of another pandemic (phys.org)

WhatsApp provides no cryptographic management for group messages (arstechnica.com)

Agatha Christie, Who Died in 1976, Will See You in Class (nytimes.com)

Webdev without frameworks [CodeMic] [video] (youtube.com)

Trading Stuff for Money (dynomight.net)

EPRI Webcast of Initial Findings from April 28, 2025 Iberia Blackout (youtube.com)

Elon Musk's AI, Grok, Accused of Undressing Women in Public on X (techoreon.com)

Show HN: Piny – Astro, React and Next Visual Editor for VSCode, Cursor, Windsurf (getpiny.com)

India-Pakistan war: A chilling 2019 study predicted a nuclear war in 2025 (economictimes.indiatimes.com)

Ordinary least squares from first principles (youtube.com)

Currency Older Than Money: A Foundation for Cultural Mutualism (evolutionofconsent.com)

Dockerizing MCP: Unleashing Trust and Simplicity in Agentic AI Interactions (huddleandgo.work)

Will U.S. Push on Seabed Mining End Global Consensus on Oceans? (e360.yale.edu)

Requests for Startups – Summer 2025 (ycombinator.com)

SDFs and the Fast sweeping algorithm in Jax (rohangautam.github.io)

The dolphin who loved me (2014) (theguardian.com)

Understanding modern AI is understanding embeddings: a guide with almost no math (sgnt.ai)

Linear Regression for Fun and Profit (evanmiller.org)

Linear Programming for Fun and Profit (modal.com)

Had a super productive conversation with an Apple Metal engineer (anukari.com)

Life on the AGI-Pill (lukaspetersson.com)

Memory Safety Features in Zig (gencmurat.com)

Navigating a career in statistics: reflections from senior leaders (analysisfunction.civilservice.gov.uk)

Show HN: A terminal-based KeePass password manager (github.com)

Journalism, Media, and Technology Trends and Predictions 2025 (reutersinstitute.politics.ox.ac.uk)

Mistral launches chatbot for companies, triples revenue in 100 days (reuters.com)

Alpha-Generating Digital Asset Strategies Will Reshape Alternative Investing (coindesk.com)

ProVerif: Cryptographic protocol verifier in the formal model (bblanche.gitlabpages.inria.fr)

Verifpal is new software for verifying the security of cryptographic protocols (verifpal.com)

Are 'CSS Carousels' Accessible? (sarasoueidan.com)

Show HN: I made an AI tool to turn personal memories into stylized portraits (wishpainted.com)

What can we learn from broken things? (newyorker.com)

a (videotooolkit.app)

Which LLM writes the best analytical SQL? (tinybird.co)

What happens when you take an older adult on a trishaw ride? – JOYRIDE (PBS) [video] (youtube.com)

Mushroom meal - murder trial begins in rural Australia (nbcnews.com)

How Dare You Transmit at 1.4 GHz! (radioandnukes.substack.com)

HarmonyOS Next： Build Application with Different Package Names (leetcode.com)

I feel like most people still don't get who cursor targets toward

Lockbit SQL Dump Visualizer (sqldump.defusedcyber.com)

How Lost Radar and Silent Radios Have Upended Newark Air Travel (nytimes.com)

Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

Comments (0)