It is a little bit too late for questions like this, but this morning I woke up with it. “What about performance?”
“Imagine how a larger audience will have effect on a plugin like this. Each time you, and hundreds of other people browse themselves into websites and the extension should start analyze the website.”
Well. This is not a real problem as long as you store data “locally” (or via storage, to have the data shared between browsers). In that specific case, the storage will give the user immediately stored data for a given site. If we store data based on the current visited domain both time and network performance will be saved. It is when we’re trying to synchronize visits against APIs the real problems begin. Especially if each element contains different kinds of urls that has to be sent away for analysis. In cases like Facebook there may be a lot of data transferred out, as the extension won’t know what the page contains initially.
If we’d like to blacklist a linked url, we’re not only limited to Facebook. Blacklisted URLs must be reachable from each site that goes through analysis. It probably won’t get beautiful if the extension grows larger. One solution is to just send the domain name out (hashed?), but with a large amount of traffic this may, this could still be an issue.
The idea it self could look like this:
Fetch all elements when document loaded.
All hostnames will be hashed – so www.test.com/link/to/somewhere looks like this …
aac5c2fc2cd9b325c8361ff2da0b2c4864e0c948
… where only www.test.com is the part of the hash. Rendering all hrefs with domains only will push back the amount of data being sent. If many of the links on a website is pointing to the same place – or even better if all do – there will be only one hash to send initially.
Waiting for the document to be loaded comes with a price though: Facebook dynamically loads their pages, so when scrolling downwards, there will never be a finished document.
Fetch elements with DOMSubtreeModified.
The first versions of the chrome extensions handled this very well. Since the data was stored locally, elements fetched and analyzed was instant. There was only a few short words (nyatider, friatider, etc) to look for. But sending data out online in this state will also be instant; they wont render in bulk, so the datastream will take longer. With lots of usage this might of course be a problem. Having hosts hashed is a good way to do this, however we can’t avoid the fact with the datastream. The bulk will be smaller, but the data stream will still be there.
The next problem to handle here is location. Imagine usage from Japan, when/if all analyzing servers are located in a different part of the world. There will be delays. And still, downtime is not even considered here…