Swing the Sickle
Welcome, Guest. Please login or register.
Today is: July 29, 2010, 9:36pm



Forum Login
Login Name: Create a new account
Password:     Forgot password

Swing the Sickle Forum    Swing the Sickle    General Discussion  ›  Homophone Checker Moderators: Rachel
Users Browsing Forum
No Members and 3 Guests

Homophone Checker  This thread currently has 978 views. Print
1 Pages 1 Recommend Thread
Matthew
January 28, 2008, 5:59am Report to Moderator

Administrator Group
Posts: 171
Remember homophones from...like first grade? The problem is that they can still trip us up today. And your computer's nifty spell checker won't even catch the slip up, because it's still a word! I had a crazy idea to write a Homophone Checker: a spell checker that will no know when you used the wrong spelling of a homophone.

I'm writing it in Perl. It is in the very early stages of development right now. I'm mainly writing Perl scripts to parse large amounts of text gathering all the statistics I possibly can about properly-used homophones. These statistics will be used by the eventual Homophone Checker in determining the probability that the write right homophone spelling was used.

Where better to find large (and I mean LARGE) amounts of text with (hopefully) properly-used homophones than Wikipedia with its over 2 million articles? Every few weeks, Wikipedia dumps all the articles into a single file they make available for download to programmers wanting a way to do analysis on the complete Wikipedia text. If you had to guess how big Wikipedia is (minus images and revision history) in gigabytes, how big would you guess?

Uncompressed: 13.9 GB. That would fill 3 DVDs! Thankfully they compress it pretty heavily for downloading, but the compressed file still weighed in at 3.2 GB and took an entire night to download.

I based my list of homophones on this list available free online for academic use and research. The first Perl script I wrote scans all of Wikipedia for these words counting how many times they occur. Then it generates a list showing what percentage of the time a specific spelling of a homophone is used compared to its other spellings.

I thought you might find the statistics thus far interesting. The below list is just a preliminary list. It is generated from scanning 6,000 Wikipedia articles. Scanning all 2 million+ articles will probably take my computer about 3 days of processing. Scanning articles that average 435 words/article looking for 1,543 different homophones is pretty processor intensive. My script is scanning a little less than 500 articles/minute--a lot faster than a human doing it by hand!

Frequency List (Based on 6,000 Wikipedia Articles)
Logged Offline
Site E-mail Private Message
Joab
January 30, 2008, 1:29am Report to Moderator

Baby Member
Posts: 27
Location: Tampa, FL
That's freaking cool and amazing.  I'm most impressed with you're skills.
Logged Offline
E-mail Private Message Reply: 1 - 3
Daniel
January 31, 2008, 11:22pm Report to Moderator
Administrator Group
Posts: 60
What percent of Wikipedia have you scanned?
Logged Offline
E-mail Private Message Reply: 2 - 3
Matthew
March 24, 2008, 7:06am Report to Moderator

Administrator Group
Posts: 171
After completely scanning Wikipedia with my original script (a process that took about 4 days), I realized I made a boo-boo. The script was case-sensitive! This meant that not all the occurrences of some words were being counted. Affect/effect would be counted only if they appeared in the middle or at the end of a sentence. If they started a sentence thus being capitalized, they were not counted. This in turn meant the statistics were skewed. The search needed to be case-insensitive--counting occurrences of the homophone forms regardless of their case!

Here I hit a wall. Making the search case-insensitive was easy enough. Really, really easy actually. All I did was change

Code
$plaintext =~ /\b$key\b/g


(where $plaintext is the text of the Wikipedia article being scanned and $key is the homophone form (e.g. affect) being searched for) to

Code
$plaintext =~ /\b$key\b/gi


The change is tiny. You probably wouldn't notice it if you weren't looking for it. I added an "i" to the end.

The portion between the two forward slashes (/) is a regular expression, an easy method in programming to search for a specific string of text. I use the "g" or global option in both regular expressions. This tells the regular expression to find every occurrence of $key rather than just the first one. I added the "i" or case-insensitive option to make the search case-insensitive. This solved the problem of not all the occurrences of a homophone form being counted, but it introduced a new problem: this one little change made the search way longer--about 81x longer!

Let me show you how that breaks down for searching 20 articles:

Case-sensitive - 3 sec.
Case-insensitive - 4 min. 3 sec.

Now remember Wikipedia has about 2 million articles! Multiply the above durations by (2 million/20) and you get:

Case-sensitive - 3.5 days
Case-insensitive - 278 days

Obviously waiting 278 days is not practical. It would take forever to get this homophone checker programmed.

Fortunately I bulldozed the wall and found a solution. Before passing $plaintext and $key to the regular expression, I applied Perl's lc function, which lowercases a string, to both of them. Since the entire article and homophone form were now lowercase, I just did a case-sensitive search. It still takes a little longer (about 1.3x longer) but much, much less time than a case-insensitive search:

Case-sensitive - 3 sec. for 20 articles/3.5 days estimated for 2 million articles
Case-sensitive with everything lowercased - 4 sec. for 20 articles/4.5 days estimated for 2 million articles

We are back on track. In approximately 4.5 days, I should be able to post the frequency results from a complete scan of Wikipedia!
Logged Offline
Site E-mail Private Message Reply: 3 - 3
1 Pages 1 Recommend Thread
Print

Swing the Sickle Forum    Swing the Sickle    General Discussion  ›  Homophone Checker