When you build probabilistic models of something (say natural language grammars), you always fall prey, to some degree or other, to wrong independence assumptions. For example, a model might capture the fact that two events are each very probable to occur, but fail to capture the fact that they are quite improbable to occur together. Since it’s always nice to have examples from everday life or popular culture for scientific concepts, I’m referring the following dialogue from The Big Bang Theory, in which Sheldon quite conspicuously makes a wrong independence assumption:
HOWARD: Someone has to go up with the telescope as a payload specialist, and guess who that someone is.
SHELDON: Muhammad Li.
HOWARD: Who’s Muhammad Li?
SHELDON: Muhammad is the most common first name in the world, Li, the most common surname. As I didn’t know the answer, I thought that gave me a mathematical edge.
Since we all know, Sheldon can’t be wrong. So this is clearly meant to be intended as a joke by Sheldon on the basis of the wrong independence assumption….(Sheldon: h..h…[+ingressive])and he took two regular names from different ethnic groups on purpose. If he’d been taking regular names from a more homogeneous group it is quite possible to get a hit (in Germany this would be ‚Peter‘ + ‚Müller‘ which is quite often found..)If the problem is applied to texts, the independence assumption is a bad idea. Word frequencies may indeed relate to each other by frequent sentences (‚He loved her‘) but word frequencies change in time and don’t inherit their ‚frequency status‘ by usage…but since name giving is also a changing process and old names disappear I’m quite puzzled and should eat something now…