05548507 is referenced by 151 patents and cites 9 patents.

Provides a process which identifies the language or genre of a stored or transmitted document. The process uses a plurality of Word Frequency Tables (WFTs) respectively associated with languages/genre of interest. Each WFT contains a relatively few of the most common words of one of the languages of interest. Each word code in a WFT has an associated normalized frequency of occurrence value (NFO); use of NFOs increases the language/genre detection ability of the process. A plurality of respective accumulators are associated with the plurality of WFTs. All accumulators are set to zero before identification processing starts. The language/genre identification process receives a sequence of words from an inputted document, and compares each received word to all of the words in all WFTs. Whenever a received word is found in any WFT, the process adds the word's associated NFO to a current total in the associated accumulator. In this manner, totals in all accumulators build up into language discriminating values after a number of words are read from the document. Processing stops when either the end of the document is reached or when a predetermined number of words are received; and then the language/genre associated with the accumulator containing the largest total is the identified language.

Title
Language identification process using coded language words
Application Number
8/212490
Publication Number
5548507
Application Date
March 14, 1994
Publication Date
August 20, 1996
Inventor
Robert C Paulsen Jr
Highland
NY, US
Michael J Martino
Gardiner
NY, US
Agent
Bernard M Goldman
Assignee
International Business Machines Corporation
NY, US
IPC
G06F 17/27
View Original Source