After completely scanning Wikipedia with my original script (a process that took about 4 days), I realized I made a boo-boo. The script was case-sensitive! This meant that not all the occurrences of some words were being counted. Affect/effect would be counted only if they appeared in the middle or at the end of a sentence. If they started a sentence thus being capitalized, they were not counted. This in turn meant the statistics were skewed. The search needed to be case-insensitive--counting occurrences of the homophone forms regardless of their case!
Here I hit a wall. Making the search case-insensitive was easy enough. Really, really easy actually. All I did was change
|
Code
$plaintext =~ /\b$key\b/g |
|
(where
$plaintext is the text of the Wikipedia article being scanned and
$key is the homophone form (e.g. affect) being searched for) to
|
Code
$plaintext =~ /\b$key\b/gi |
|
The change is tiny. You probably wouldn't notice it if you weren't looking for it. I added an "i" to the end.
The portion between the two forward slashes (/) is a
regular expression, an easy method in programming to search for a specific string of text. I use the
"g" or global option in both regular expressions. This tells the regular expression to find every occurrence of $key rather than just the first one. I added the
"i" or case-insensitive option to make the search case-insensitive. This solved the problem of not all the occurrences of a homophone form being counted, but it introduced a new problem: this one little change made the search way longer--about 81x longer!
Let me show you how that breaks down for searching 20 articles:
Case-sensitive - 3 sec.
Case-insensitive - 4 min. 3 sec.
Now remember Wikipedia has about 2 million articles! Multiply the above durations by (2 million/20) and you get:
Case-sensitive - 3.5 days
Case-insensitive - 278 days
Obviously waiting 278 days is not practical. It would take forever to get this homophone checker programmed.
Fortunately I bulldozed the wall and found a solution. Before passing $plaintext and $key to the regular expression, I applied Perl's
lc function, which lowercases a string, to both of them. Since the entire article and homophone form were now lowercase, I just did a case-sensitive search. It still takes a little longer (about 1.3x longer) but much, much less time than a case-insensitive search:
Case-sensitive - 3 sec. for 20 articles/3.5 days estimated for 2 million articles
Case-sensitive with everything lowercased - 4 sec. for 20 articles/4.5 days estimated for 2 million articles
We are back on track. In approximately 4.5 days, I should be able to post the frequency results from a complete scan of Wikipedia!