Friday, November 27, 2020

The tool I want

 I would like to store ALL actual text I ingest from the world, it wouldn't be that big. It would be immune to the unreliable nature of the web. I'd trigram index it as it went in to enable better search. (I read somewhere yesterday that's the way to do it)

I'd also like all the audio and video I hear/see to have a transcript of any spoken words, with a recording, and timestamps. Video is huge, so that would need to be managed a bit.


Given these requirements, I can see a need to start riding along with Moore's law again. Text is well within our capabilities, we can't read that fast, so it should be quite feasible to store it all. The need for more storage and processor than my laptop has comes in to play with audio and video, especially transcription and storage.


Having stored this content, I want to be able to search it. I want some form of content/sentiment analysis to allow search by concept and association.


I also want to be able to rate it, not just in a single dimension, but in an arbitrary number of them. Something can be funny, insightful, literally false and metaphorically true, a bit racist, somewhat political, and in English. The thumbs up/down or limiting to a single 1-10 scale works well for forcing into a single database field, but not for actual real world use.


Every single piece of stuff fits into multiple orthogonal hierarchies, you can't store that information in any single rank system without information loss.


As for sharing, it has to be something I pay for, or host myself, with possible federation. Ads corrupt.


---


Implementation - The first step is to simply tap the stream of web traffic I see in the browser, and train a classifier to recognize text/not text. It is important to link it back to the source.


Once I have a reliable stream of text, I think the rest starts to align.

No comments: