06272456 is referenced by 176 patents and cites 12 patents.

A window of letters is identified within a text sample input. If the window contains matches to reference letter sequences (RLS) contained in multiple sets of n-gram language profiles (profiles), then the longest match is kept and scored for each language. Scoring each language is based on frequency parameters of the matched RLS in profiles for each language. The window is incrementally shifted through the sample and the matching and scoring is done on the letters within the window. At the end of the sample input, the language having the highest cumulative score is identified as the sample's language. Scoring may be improved by restricting the RLS within longer profiles to be full words, using two passes where the second pass disregards languages that are not scored near the highest scoring language during the first pass, favoring matched RLS within profiles of complete words during scoring, favoring longer matched RLS within profiles during scoring, and increasing a score of a match that does not frequently appear in many languages. The profiles may be enhanced by removing some of the RLS if the frequency of the RLS does not meet a predefined threshold and a variable threshold.

Title
System and method for identifying the language of written text having a plurality of different length n-gram profiles
Application Number
9/44752
Publication Number
6272456 (B1)
Application Date
March 19, 1998
Publication Date
August 7, 2001
Inventor
Miguel Cardoso de Campos
Seattle
WA, US
Agent
Kilpatrick Stockton
US
Assignee
Microsoft Corporation
WA, US
IPC
G06F 17/27
View Original Source