text, topic, trouble

full text search = searching a collection of documents, such as websites or online journal articles, based on the words that appear in the documents (as opposed to searching based on categories or subjects assigned to the documents)

The problems with using words within the document itself to determine what this document is about are:

  1. homonyms, heteronyms, polysemes, not to mention figurative language (the plot thickens)
  2. what the document is actually about may not appear as a word in the document (an article about homonyms, heteronyms, and polysemes may not mention the word “linguistics” but is relevant to that topic)
  3. words that appear in the document may have no relevance to the topic of the document (an article that uses the words “homonyms,” “heteronyms,” and “polysemes” may not be about those things nor linguistics)

[Search engines have a variety of ways to compensate, though maybe not solve, for these issues.]

Moby Dick is about a whale, Othello is about a handkerchief, and about other things. The difficulties are to identify which of the things mentioned refer to relevant topics, and how to deal with topics of the document that are not mentioned explicitly….Parts of the document are not always what the entire document is about, nor is a document usually about the sum of the things it mentions.

—Robert A. Fairthorne, “Content Analysis, Specification and Control” (1969)

In that vein, I’ll find what the books on my shelf of Norton Critical Editions are about.


I used the first non-proper noun I found that was at all related to the overall topic of the book (that latter part a full text search couldn’t tell you!).

  • Hamlet is about “murder” (several of them)
  • Tristram Shandy is about “vexations” (alas)
  • Pride and Prejudice is about an “engagement” (a few, in fact)
  • Emma is about “heart” (in metaphor only)
  • Frankenstein is about “feelings” (so many)
  • Wuthering Heights is about a “devil” (indeed)
  • Jane Eyre is about “friendship” (and more)
  • The Scarlet Letter is about “confinement” (confined by pregnancy and society)
  • The House of Seven Gables is about “death” (very dark)
  • Bleak House is about “proceedings” (repeated proceedings)
  • Great Expectations is about a “house” (this old house)
  • A Connecticut Yankee in King Arthur’s Court is about “consternation” (plenty to go around)
  • The Sound and the Fury is about “nothing” (signifying)
This entry was posted in language. Bookmark the permalink.