I built a neural classifier to replace Plaid's transaction categories
The core product relied on inferring whether a transaction was health-related or not. I quickly realized that adding rules and heuristics on top of Plaid's categories wouldn't work. Not to mention that Plaid's categorization was way too inaccurate to be deciding financial rewards on.
Here's an account of what I built to make it work, verified with a cleaned dataset of 6k data points collected from my platform.
First of all, Plaid's baseline categorization accuracy was low: - Categorization accuracy was 65.22% overall - Accuracy was better for well-known merchants (Plaid identified an "Entity ID") at 83.99%
I tried RAG to start, but that immediately fell apart due to name collisions and regional duplication
Thankfully I was able to start with Plaid's already cleaned transaction data. To better resolve entities, my pipeline took in: - Transaction amount (for product band heuristics) - Location - POS method (in-person vs. online) - A list of known bank-specific formatting quirks that I collected as I tried to build this pipeline (for now limited to the Big Banks ™)
Using that data I could much better figure out: - Which entity the purchase was made from among entities with duplicate names (mostly SMBs) - Collapsing regional identifiers into a single parent organization - Side note: did you know that Orangetheory has a different regional identifier for every location. For example: "Orangetheory", "OTF", "otf", "otf {city}", "orangetheory {city}" are all possible names. This one took so long to solve robustly
Also this way I could provide a custom category to look for. In my case it was "health-related" or not. Which I defined with the FSA/HSA eligibility rules (in JSON format), plus some other properties like fitness/studio classes merchants, and supplements.
The results: - 87.28% accuracy on classifying "health-related" spend (with a "needs more info" tag for marketplace cases like Amazon) - 95.78% accuracy on personal finance category classification, with only 300 known entities logged in my database. So this can definitely improve with more effort put in expanding the known entities list
I made this writeup mostly for catharsis to shutting down my startup, and to warn of potential things to look out for when trying to properly utilize transactions data.
But I really do believe that this kind of infra, semantic understanding of financial data, is becoming increasingly valuable as financial data becomes more available. And new businesses can be built with it. I am considering expanding more on this infra as a developer API or toolkit. So if you're working on financial rewards, personal finance apps, FSA/HSA/expense platforms, accounting tools, etc. I'd love to hear from you!
No comments yet