Dave Doolin said
1 year, 2 months ago: @robpeck – I’m agreeing with Erik on this one, because Google has historically been reasonably transparent about when they do specifically penalize sites.
There is definitely a certain amount of duplicate content posted which is deceptive and attempts to manipulate search engine rankings.
But not all of it.
Consider, for example, stories distributed by AP. It’s duplicate content, but it may be shown on many websites, depending on the target audience of the site.
My hunch is that deception and manipulation have specific algorithmic signals, while syndication and other forms of duplication have different signals.
On the other hand, consider the mathematics required to find and assess duplicate content. The naive case is checking every web page against every other web page. The computational cost of that is roughly “Order(heat-death-of-the-universe)” time scale. (This same limitation applies to LSI, except that LSI is a *lot* more expensive to compute.)