| List of Tables | p. xv |
| List of Figures | p. xxi |
| Table of Notations | p. xxv |
| Preface | p. xxix |
| Road Map | p. xxxv |
| Preliminaries | p. 1 |
| Introduction | p. 3 |
| Ratinalist and Empiricist Approaches to Language | p. 4 |
| Scientific Content | p. 7 |
| Questions that linguistics should answer | p. 8 |
| Non-categorical phenomena in language | p. 11 |
| Language and cognition as probabilistic phenomena | p. 15 |
| The Ambiguity of Language: Why NLP Is Difficult | p. 17 |
| Dirty Hands | p. 19 |
| Lexical resources | p. 19 |
| Word counts | p. 20 |
| Zipf's laws | p. 23 |
| Collocations | p. 29 |
| Concordances | p. 31 |
| Further Reading | p. 34 |
| Exercises | p. 35 |
| Mathematical Foundations | p. 39 |
| Elementary Probability Theory | p. 40 |
| Probability spaces | p. 40 |
| Conditional probability and independence | p. 42 |
| Bayes' theorem | p. 43 |
| Random variables | p. 45 |
| Expectation and variance | p. 46 |
| Notation | p. 47 |
| Joint and conditional distributions | p. 48 |
| Determining P | p. 48 |
| Standard distributions | p. 50 |
| Bayesian statistics | p. 54 |
| Exercises | p. 59 |
| Essential Information Theory | p. 60 |
| Entropy | p. 61 |
| Joint entropy and conditional entropy | p. 63 |
| Mutual information | p. 66 |
| The noisy channel model | p. 68 |
| Relative entropy or Kullback-Leibler divergence | p. 72 |
| The relation to language: Cross entropy | p. 73 |
| The entropy of English | p. 76 |
| Perplexity | p. 78 |
| Exercises | p. 78 |
| Further Reading | p. 79 |
| Linguistic Essentials | p. 81 |
| Parts of Speech and Morphology | p. 81 |
| Nouns and pronouns | p. 83 |
| Words that accompany nouns: Determiners and adjectives | p. 87 |
| Verbs | p. 88 |
| Other parts of speech | p. 91 |
| Phrase Structure | p. 93 |
| Phrase structure grammars | p. 96 |
| Dependency: Arguments and adjuncts | p. 101 |
| X' theory | p. 106 |
| Phrase structure ambiguity | p. 107 |
| Semantics and Pragmatics | p. 109 |
| Other Areas | p. 112 |
| Further Reading | p. 113 |
| Exercises | p. 114 |
| Corpus-Based Work | p. 117 |
| Getting Set Up | p. 118 |
| Computers | p. 118 |
| Corpora | p. 118 |
| Software | p. 120 |
| Looking at Text | p. 123 |
| Low-level formatting issues | p. 123 |
| Tokenization: What is a word? | p. 124 |
| Morphology | p. 131 |
| Sentences | p. 134 |
| Marked-up Data | p. 136 |
| Markup schemes | p. 137 |
| Grammatical tagging | p. 139 |
| Further Reading | p. 145 |
| Exercises | p. 147 |
| Words | p. 149 |
| Collocations | p. 151 |
| Frequency | p. 153 |
| Mean and Variance | p. 157 |
| Hypothesis Testing | p. 162 |
| The t test | p. 163 |
| Hypothesis testing of differences | p. 166 |
| Pearson's chi-square test | p. 169 |
| Likelihood ratios | p. 172 |
| Mutual Information | p. 178 |
| The Notion of Collocation | p. 183 |
| Further Reading | p. 187 |
| Statistical Inference: n-gram Models over Sparse Data | p. 191 |
| Bins: Forming Equivalence Classes | p. 192 |
| Reliability vs. discrimination | p. 192 |
| n-gram models | p. 192 |
| Building n-gram models | p. 195 |
| Statistical Estimators | p. 196 |
| Maximum Likelihood Estimation (MLE) | p. 197 |
| Laplace's law, Lidstone's law and the Jeffreys-Perks law | p. 202 |
| Held out estimation | p. 205 |
| Cross-validation (deleted estimation) | p. 210 |
| Good-Turing estimation | p. 212 |
| Briefly noted | p. 216 |
| Combining Estimators | p. 217 |
| Simple linear interpolation | p. 218 |
| Katz's backing-off | p. 219 |
| General linear interpolation | p. 220 |
| Briefly noted | p. 222 |
| Language models for Austen | p. 223 |
| Conclusions | p. 224 |
| Further Reading | p. 225 |
| Exercises | p. 225 |
| Word Sense Disambiguation | p. 229 |
| Methodological Preliminaries | p. 232 |
| Supervised and unsupervised learning | p. 232 |
| Pseudowords | p. 233 |
| Upper and lower bounds on performance | p. 233 |
| Supervised Disambiguation | p. 235 |
| Bayesian classification | p. 235 |
| An information-theoretic approach | p. 239 |
| Dictionary-Based Disambiguation | p. 241 |
| Disambiguation based on sense definitions | p. 242 |
| Thesaurus-based disambiguation | p. 244 |
| Disambiguation based on translations in a second-language corpus | p. 247 |
| One sense per discourse, one sense per collocation | p. 249 |
| Unsupervised Disambiguation | p. 252 |
| What Is a Word Sense? | p. 256 |
| Further Reading | p. 260 |
| Exercises | p. 262 |
| Lexical Acquisition | p. 265 |
| Evaluation Measures | p. 267 |
| Verb Subcategorization | p. 271 |
| Attachment Ambiguity | p. 278 |
| Hindle and Rooth (1993) | p. 280 |
| General remarks on PP attachment | p. 284 |
| Selectional Preferences | p. 288 |
| Semantic Similarity | p. 294 |
| Vector space measures | p. 296 |
| Probabilistic measures | p. 303 |
| The Role of Lexical Acquisition in Statistical NLP | p. 308 |
| Further Reading | p. 312 |
| Grammar | p. 315 |
| Markov Models | p. 317 |
| Markov Models | p. 318 |
| Hidden Markov Models | p. 320 |
| Why use HMMs? | p. 322 |
| General form of an HMM | p. 324 |
| The Three Fundamental Questions for HMMs | p. 325 |
| Finding the probability of an observation | p. 326 |
| Finding the best state sequence | p. 331 |
| The third problem: Parameter estimation | p. 333 |
| HMMs: Implementation, Properties, and Variants | p. 336 |
| Implementation | p. 336 |
| Variants | p. 337 |
| Multiple input observations | p. 338 |
| Initialization of parameter values | p. 339 |
| Further Reading | p. 339 |
| Part-of-Speech Tagging | p. 341 |
| The Information Sources in Tagging | p. 343 |
| Markov Model Taggers | p. 345 |
| The probabilistic model | p. 345 |
| The Viterbi algorithm | p. 349 |
| Variations | p. 351 |
| Hidden Markov Model Taggers | p. 356 |
| Applying HMMs to POS tagging | p. 357 |
| The effect of initialization on HMM training | p. 359 |
| Transformation-Based Learning of Tags | p. 361 |
| Transformations | p. 362 |
| The learning algorithm | p. 364 |
| Relation to other models | p. 365 |
| Automata | p. 367 |
| Summary | p. 369 |
| Other Methods, Other Languages | p. 370 |
| Other approaches to tagging | p. 370 |
| Languages other than English | p. 371 |
| Tagging Accuracy and Uses of Taggers | p. 371 |
| Tagging accuracy | p. 371 |
| Applications of tagging | p. 374 |
| Further Reading | p. 377 |
| Exercises | p. 379 |
| Probabilistic Context Free Grammars | p. 381 |
| Some Features of PCFGs | p. 386 |
| Questions for PCFGs | p. 388 |
| The Probability of a String | p. 392 |
| Using inside probabilities | p. 392 |
| Using outside probabilities | p. 394 |
| Finding the most likely parse for a sentence | p. 396 |
| Training a PCFG | p. 398 |
| Problems with the Inside-Outside Algorithm | p. 401 |
| Further Reading | p. 402 |
| Exercises | p. 404 |
| Probabilistic Parsing | p. 407 |
| Some Concepts | p. 408 |
| Parsing for disambiguation | p. 408 |
| Treebanks | p. 412 |
| Parsing models vs. language models | p. 414 |
| Weakening the independence assumptions of PCFGs | p. 416 |
| Tree probabilities and derivational probabilities | p. 421 |
| There's more than one way to do it | p. 423 |
| Phrase structure grammars and dependency grammars | p. 428 |
| Evaluation | p. 431 |
| Equivalent models | p. 437 |
| Building parsers: Search methods | p. 439 |
| Use of the geometric mean | p. 442 |
| Some Approaches | p. 443 |
| Non-lexicalized treebank grammars | p. 443 |
| Lexicalized models using derivational histories | p. 448 |
| Dependency-based models | p. 451 |
| Discussion | p. 454 |
| Further Reading | p. 456 |
| Exercises | p. 458 |
| Applications and Techniques | p. 461 |
| Statistical Alignment and Machine Translation | p. 463 |
| Text Alignment | p. 466 |
| Aligning sentences and paragraphs | p. 467 |
| Length-based methods | p. 471 |
| Offset alignment by signal processing techniques | p. 475 |
| Lexical methods of sentence alignment | p. 478 |
| Summary | p. 484 |
| Exercises | p. 484 |
| Word Alignment | p. 484 |
| Statistical Machine Translation | p. 486 |
| Further Reading | p. 492 |
| Clustering | p. 495 |
| Hierarchical Clustering | p. 500 |
| Single-link and complete-link clustering | p. 503 |
| Group-average agglomerative clustering | p. 507 |
| An application: Improving a language model | p. 509 |
| Top-down clustering | p. 512 |
| Non-Hierarchical Clustering | p. 514 |
| K-means | p. 515 |
| The EM algorithm | p. 518 |
| Further Reading | p. 527 |
| Exercises | p. 528 |
| Topics in Information Retrieval | p. 529 |
| Some Background on Information Retrieval | p. 530 |
| Common design features of IR systems | p. 532 |
| Evaluation measures | p. 534 |
| The probability ranking principle (PRP) | p. 538 |
| The Vector Space Model | p. 539 |
| Vector similarity | p. 540 |
| Term weighting | p. 541 |
| Term Distribution Models | p. 544 |
| The Poisson distribution | p. 545 |
| The two-Poisson model | p. 548 |
| The K mixture | p. 549 |
| Inverse document frequency | p. 551 |
| Residual inverse document frequency | p. 553 |
| Usage of term distribution models | p. 554 |
| Latent Semantic Indexing | p. 554 |
| Least-squares methods | p. 557 |
| Singular Value Decomposition | p. 558 |
| Latent Semantic Indexing in IR | p. 564 |
| Discourse Segmentation | p. 566 |
| TextTiling | p. 567 |
| Further Reading | p. 570 |
| Exercises | p. 573 |
| Text Categorization | p. 575 |
| Decision Trees | p. 578 |
| Maximum Entropy Modeling | p. 589 |
| Generalized iterative scaling | p. 591 |
| Application to text categorization | p. 594 |
| Perceptrons | p. 597 |
| k Nearest Neighbor Classification | p. 604 |
| Further Reading | p. 607 |
| Tiny Statistical Tables | p. 609 |
| Bibliography | p. 611 |
| Index | p. 657 |
| Table of Contents provided by Syndetics. All Rights Reserved. |