Taxed on a me

So I’ve got all these posts, most of the word count density is over the last couple years. It’s really something goofy like a million and a half words in the last three years alone.

Over and over again I run in to this problem of not being sure if I’d ever written on a specific topic or not.

Between how absolutely all over the fucking place I am on a post-by-post basis (something I’d been improving on until I embarked on this fool “post a day for a month” silliness) and the fact that even if I wasn’t, there’s no really good way to run a topic search.

Categories and tag clouds fail me utterly because I can’t recall my own taxonomy, much less exhibit the discipline of assigning categories and tags with each post.

But it occurs to me that in the heart of a piece of code I wrote a VERY long time ago (FAR closer to 30 years ago than not) might have at least a hint of a solution to the issue.

I’m pretty sure I could kick the shit out of it, what with decades more coding and thinking about this specific kind of problem.

Imagine you’re an insurance company and companies are trying to figure out if they should use you for your corporate health insurance plan.

What they do (or did) was submit a giant document written in Microsoft Word. That document was really a questionnaire; a giant outline of topics, categories, subcategories, and lists of questions with in them.

The insurance company would take this document and hand it to some poor bastard whose job it was to go through the whole thing and look up questions in a giant database of questions that had “subject matter expert” answers that had been vetted by the legal department. If it found a match to the question in the database it would pull up the answer to the matched question and plug it in to the document, then move on.

If there wasn’t a match, then that question would be sent off to a SME, who would double-check it wasn’t just something else then, if necessary, write up an answer, submit it to legal and once approved, stuff it in the database for next time and send it back to Said Poor Bastard.

This was just stupid. But it had to be done with every one. There were teams of these people who’d split these giant documents up, which were frequently far in excess of 200-300 pages, and go to work trying to match them.

The naive solution would be to say “well get a computer to look up those blocks of text.” But that’s wrought with nightmarish problems. There are a thousand ways to say a thing, so the likelihood you’d find a match, even if there was a perfect one, was effectively zero.

So what I did as an experiment was build a piece of software that took each question, eliminated what I called “noise words” then ran a scoring search of the tuple of resulting words against each of the stock questions in the database. What would come back was a list of questions that contained the greatest number of matching words from the original.

We all know that as a plain old web search now. But that shit didn’t really exist in the mid 90s. Not like that.

But that’s not enough. Let’s say you have an outline and the bottom/deepest node contains a question like “How does this affect your pricing structure?” Well that’s no fucking good. You can’t search for that. But if you take every parent, grandparent, etc, item up the line and just append them all, THEN filter out the noise words and run the search?

Yeah you can automatically populate more than 80% of the answers with no human interaction (aside from the verification step to be sure everything matches up cleanly, which I insisted on because I didn’t want the legal team from an insurance company up my fucking ass.)

In fact the tests were SO good that I predicted we could indeed reduce headcount by a tremendous margin.

They heard that and bailed out on the project. Weren’t interested. “Nope. We can’t say that.”

Bunch of fucking cowards.

BUT

Take some of the ideas behind that tech. You can replace “removing noise words” with “search for statistically significant digrams and trigrams” and then look for clusters of occurrences of those across the document space.

Now what would that give you? The software itself wouldn’t “understand” shit. Software can’t. (Sorry Carmack, no AI for you.) But what it WOULD be able to do is say “Hey, these two documents talk about the same thing. Not only that, but talking about topic X (having a preponderance of commonly related statistically significant phrases) crosses over with talking about topic Y.”

So it would be able to figure out that Document A and Document B were topically related and keep that information indexed.

How the hell would that help?

At the end of the day, one of the things I’d be able to do is write a few sentence paragraph and submit the thing, looking for other instances of the same topic.

So no more mindless repetition!

(Lol. That one was just for us.)

Now, a couple other things:

Because the author is always one of three people (them being me, myself, and I) the style will be remarkably consistent, especially since I tend to codify the same topics over time using the same nomenclature. So such a thing should work particularly well.

Plus, it’s not inconceivable that I could take these pseudo-topics and use them to come up with a category or topic name. There can’t be THAT many of them, and the process would be purely additive, so I could just kinda chip away at it.

THAT way I’d be able to tell if I’d written about all this before having embarked on writing this post about it, possibly again.