* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Natural Language Technology
Survey
Document related concepts
Transcript
Lafferty, at the Sixteenth National Conference on Artificial Intelligence, 1999. (*) Some of this material comes from a joint tutorial, co-organized with John http://www.cs.cornell.edu/home/llee Cornell University Lillian Lee Natural Language Technology(*) VI. Conclusions and References V. The Sparse Data Problem IV. Tools of the Trade III. Language as a Statistical Source II. The Statistical Revolution I. Overview of the Field Outline I. Overview of the Field (NLU) understanding computer (NLG) generation language NLG example: produce a summary of a patient’s records. instructions. NLU example: convert an utterance into a sequence of computer language Goal: computers using natural language as input and/or output Natural Language Processing (NLP) • “Do what I mean!” Lots of users want to communicate in natural language. • User utterances • News broadcasts • Documents Lots of information is in natural language format. Why NLP? “document(s)” (CNN broadcasts) signal in language 1 query query command in natural language summarization machine translation question answering → information retrieval user interfaces computer instructions relevant documents answer to query signal in language 2 summary Output – Bill Gates, 1997 “Now we’re betting the company on these natural interface technologies” Input Task NLP is Useful need be neither biologically plausible nor mathematically satisifying. On the whole, NLP tends to be applications-oriented: 95% is OK; models emphasizes formal aspects • Mathematics and statistics: properties of models emphasizes biological/cognitive plausibility • Psychology: models of cognitive processes emphasizes 100% accuracy (competence) • Linguistics: models of language Excellent opportunities for interdisciplinary work. NLP is Cross-Disciplinary 1950] wish to include” [Turing, “Computing Machinery and Intelligence”, introducing almost any one of the fields of human endeavour that we “The question and answer method seems to be suitable for This idea dates back at least to the Turing Test: in NLP problems. All the difficult problems in artificial intelligence manifest themselves It is often said that NLP is “AI-complete”: NLP is Challenging “At last, a computer that understands you like your mother” been solved ... • Ad from the 70’s or 80’s (source: S. Shieber): the problem has already • “Doesn’t Microsoft do that already?” Why is NLP hard? 1 and 3: Does this mean well, or poorly? 3. It understands you as well as it understands your mother 2. It understands (that) you like your mother 1. (*) It understands you as well as your mother understands you What can we infer about the computer? “At last, a computer that understands you like your mother” Ambiguity 2. “... a computer that understands your lie cured mother” 1. “... a computer that understands you like your mother” At the acoustic level (speech recognition): Ambiguity at Many Levels (I) unionized = ? The morphological analysis problem is especially difficult in new domains: In practice, storing root forms reduces database size. = un + derstands = under + stands (although derived historically from this) = understand + s “... a computer that understands you like ...” At the morphological (word-form) level: Ambiguity at Many Levels (II) you understands like your mother [does] S’ understands V Different structures lead to different interpretations. NP V VP At the syntactic (structural) level: VP Ambiguity at Many Levels (III) [that] you like your mother S’ ... understands you like [it understands] your mother ... understands you like your mother [understands you] “... a computer that understands you like ...” = ? Ellipsis: missing (elided) syntactic material In fact, the identity of the syntactic material is ambiguous. Ambiguity at Many Levels (IV) A more typical example: “They put money in the bank”. This is an instance of word sense ambiguity. A cask or vat used in vinegar-making A female parent mother = ? (OED) At the semantic (meaning) level: Ambiguity at Many Levels (V) discourse entity. This is an instance of anaphora, where “she” co-refers to some other 2b. ... doesn’t understand me at all. 2a. ... doesn’t know any details. 2. But she ... 1. Alice says they’ve built a computer that understands you like your mother. At the discourse (multiple-clause) level: Ambiguity at Many Levels (VI) It often helps to restrict the domain. • automated and data-driven • handcrafted and expert-driven Two veins of work to combat the knowledge acquisition bottleneck: 2. Knowledge about the world 1. Knowledge about language The task seems so difficult! What resources do we need? What Will It Take? ∼80% “understanding” rate Note – restricted domains. 1994] of joint business ventures [Message Understanding Conferences (MUC) Information Extraction: systems exist for analyzing and summarizing reports for novices. conversational system for weather information. JUPITER [MIT Spoken Language Systems group (1-888-573-TALK)]: French-English translation of weather reports. The TAUM-METEO system [Chandioux 76]: essentially perfect Not an exclusive list! Success Stories • The Association for Computational Linguistics (ACL) Universe, http://www.cs.columbia.edu/∼radev/u/db/acl/ • comp.ai.nat-lang FAQ (Frequently Asked Questions). http://www.cs.columbia.edu/∼radev/nlpfaq.txt • Allen, 1995. Natural Language Understanding, 2nd edition. and speech recognition introduction to natural language processing, computational linguistics, • Jurafsky and Martin, 2000. Speech and Language Processing: An Not an exclusive list! General NLP References II. The Statistical Revolution statistics used to make decisions • statistical methods — no assumption on language source; sample • statistical models — language assumed generated by a statistical source Two threads (often intertwined; not everyone distinguishes!): Draws on probability, statistics, information theory, machine learning. • Helps ease the knowledge acquisition bottleneck. Goal: Infer language properties from (annotated?) (text?) samples. What is Statistical NLP? S −→ NP VP −→ → NP S → S the tut. NP the tutorial NP V P sequence of rewriting rules. S → → V P −→ VP VP the tut. NP bombed rocked S rocked VP Context-Free Grammars (CFG’s): strings generated by choosing some Non-statistical Models Example: CFG’s the tutorial NP V P (.25) (.75) VP VP → → bombed rocked 1 × 1 × .75 S→N P V P N P →the tut. V P →rocked = .75 → NP (1.0) P (“the tut. rocked”) = → S (1.0) randomly picking rules according to their probabilities Probabilistic Context-Free Grammars (PCFG’s): strings generated by Statistical Models Example: PCFG’s savings? river? bank ” co-occurring with “money” from entries in a machine-readable dictionary A statistical solution [Lesk 86]: estimate the likelihood of savings bank “They put money in the Word sense disambiguation (WSD): find correct word sense from context Statistical Methods Example: WSD But statistical approaches were (are) not universally accepted ... transfer to new domains is easier • Statistical methods reduce the knowledge acquisition bottleneck models can be iteratively trained/updated confidence can be assessed (helps combine knowledge sources) “grammatical/ungrammatical”) • Statistical models allow degrees of uncertainty (not just Why Statistical NLP? language 1 language 1 corruption process language 2 language 2 • Shannon, Weaver: cryptographic notions, the noisy channel model company it keeps” [Firth 57]) • Harris, Firth: empirical linguistics (“You shall know a word by the The 40’s and 50’s: statistical NLP popular A Brief History is not.” [Chomsky 1957] from English. Yet (1), though nonsensical, is grammatical, while (2) sentences will be ruled out on identical grounds as equally “remote” ... has ever occurred .... Hence, in any statistical model ... these (2) Furiously sleep ideas green colorless nor (1) Colorless green ideas sleep furiously “It is fair to assume that neither sentence Late 50’s–80’s: statistical NLP in disfavor A Brief History (cont.) kitchen help at the ACL [conference] banquet.” [Abney 97] at least use the terminology persuasively risks being mistaken for passing knowledge of statistical methods .... anyone who cannot “no one can profess to be a computational linguist without a • nowadays, • confluence with interest in machine learning • revived by IBM: influenced by speech recognition The 80’s – present: statistical NLP once again mainstream A Brief History (cont.) 12.8% 63.5 ACL 1990 ACL 1998 1983 was the last year in which there were no such papers. Linguistics) (ACL is the main conference of the Association for Computational Percentage of statistically-based papers Source From Julia Hirschberg’s AAAI-98 invited talk: Statistics on Statistical NLP • “Every time I fire a linguist, my performance goes up” • “AI-NLP ...is going nowhere fast” • “that’s not learning, that’s statistics” • “I don’t believe in this statistics stuff” • Chomsky processes, linguistics... Some draw contrasts with knowledge-based methods, higher-level The “Opposite” of Statistical NLP? WSD using dictionaries methods and models) • Statistical methods can make use of knowledge bases (don’t confuse statistical semantics, discourse models [Miller 96] CFG’s → PCFG’s • Knowledge-based models can be converted to stochastic versions Statistics Complements Other Approaches III. Language as a Statistical Source A Recent Anniversary The Bell System Technical Journal, July 1948. C.E. Shannon, A Mathematical Theory of Communication, one selected from a set of possible messages.” engineering problem. The significant aspect is that the actual message is entities. These semantic aspects of communication are irrelevant to the correlated according to some system with certain physical or conceptual Frequently the messages have meaning; that is they refer to or are point either exactly or approximately a message selected at another point. “The fundamental problem of communication is that of reproducing at one Famous First Words preceding letters” • Shannon: “The states will correspond to the “residue of influence” from Another coin is flipped to decide which letter or word to output A coin (or a bunch of coins) is flipped to choose the next state The generator can be in one of several “states” • A string is generated by a randomized algorithm A useful conceptual and practical device: coin-flipping models Generative Models 11 00 00 11 00 11 00 11 C .2 B .1 B .4 A .5 A .5 111 000 000 111 000 111 B .8 B .5 ABBABABABABABABBBABBBBBABABABABABBAC C .1 C .2 AAACDCBDCEAADADACEDAEADCABEDADDCECAA D .2 E .1 A .4 Coin-Flipping Models 0 1 01 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 01 1 01 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 helpful to simulate it in your mind. When designing a new statistical model for an NLP task, it is often very The Soul of a New Machine ANDY TOBE SEACE CTISBE ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY 3. Second-order approximation: ALHENHTTPA OOBTTVA NAH RBL OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI 2. First-order approximation: QPAAMKBZAACIBZLHJQD XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD 1. Zero-order approximation: From Shannon’s original paper: Markov Approximations to English VERATE OF TO CONSTRANDED STATER VILL MENTTERING AND OF IN OF QUT DIVERAL THIS OFFECT INATEVER THIFER WITH PROVERAL THE CHESTRAING FOR HAVE TO INTRALLY RELOVERATED THER WHICH CONISTS AT FORES ANDITING WAS REASER IN THERE TO WILL WAS BY HOMES THING BE Pietra et. al, 1997): Markov random field with 1000 “features,” no underlying “machine” (Della REGOACTIONA OF CRE PONDENOME OF DEMONSTURES OF THE REPTABIN IS IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID 4. Third-order approximation: From Shannon’s original paper Markov Approximations (cont.) the labor involved becomes enormous at the next stage.” “It would be interesting if further approximations could be constructed, but Shannon’s comment: THE PROBLEM FOR AN UNEXPECTED METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER 2. Second-order approximation: EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE T CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME 1. First-order approximation: Word-Based Approximations S NLY N W Y T F LL N TH V W LS N TH S Trigram word model: 1.72 bits/letter Fourth order model: 2.8 bits/letter First order model: 4.03 bits/letter of English: • Coin-flipping models can be used to estimate the redundancy or entropy S NT NC TH R • Redundancy helps us communicate Estimating Redundancy Decoder E Y p(Y | E) Channel Model Y E E* = arg max p(E | Y) = argmax p(E) p(Y | E) E* p(E) Source Language Model The Source-Channel Model acoustic signal French text typed text processed image speech recognition French translation spelling correction OCR p(Y | E) Channel Model Observation Y E Application p(E) Source Language Model Source-Channel Examples Y • Special cases • EM • HMM’s • PCFG’s IV. Tools of the Trade Which is more likely? (both are grammatical) “The tutorial was a boring address” vs. “The tutorial was a roaring success” Predicting String Probabilities = P (“ ... boring address”) .000001 .003 Standard models: PCFG’s, n-grams/HMM’s Classic applications: speech, handwriting, and optical character recognition = P (“ ... roaring success”) approximate source probabilities Language model: method for assigning probabilities to strings; want to Language Modeling → NP (1.0) rocked the tutorial → → VP S VP VP Rule NP (.25) (.75) Prob. PCFG’s give both parse structures and probability. = 1 × 1 × .75 = P the tutorial NP V P P (“the tut. rocked”) → S Rule (1.0) Prob. PCFG’s bombed rocked police police police S , police S → S .9 police S police SS S Police, Police police, Police police police, ... S S S S → S .1 S police S Ambiguous PCFG’s: multiple parse trees for same sentence Probabilities of all rules with the same lefthand side must sum to one. PCFG Facts parse trees π for w1 · · · wn P (π) How can we compute this efficiently? Dynamic programming! PG (w1 · · · wn ) = For sentence w1 · · · wn and PCFG G, PCFG Language Modeling wi wj wn Combining inside probs bottom-up yields In(S, 1, n), prob of the sentence w1 · · · wi−1 Awj+1 · · · wn Outside probability Out(A, i, j): probability that S generates Inside probability In(A, i, j): probability that A generates string w i · · · wj w1 A S Computing Sentence Probabilities Schabes 92] Note: algorithms for taking advantage of structural annotation [Pereira and • each step increases T ’s likelihood according to the PCFG • iterative re-estimation using inside and outside probabilities for large training corpus T Inside-Outside algorithm [Baker 79]: Where do we get the rule probabilities? Training PCFG’s But cf. [Chelba and Jelinek 98] probabilities (so far) If don’t need parses, simpler models more effective at estimating sentence PCFG’s: linguistically intuitive, provide parse structures PCFG’s: Overall Effectiveness a a S S b b S→aSb; S→ab What if we limit modeling capacity to local correlations? “Matching” a’s and b’s not necessarily adjacent. CFG’s: top-down generation. Restricting Generative Capacity (1.0) bombed (1.0) rocked (.75) rocked (.25) bombed 1.0: the tut. .25:bombed .75:rocked P (“the tut. rocked”) = 1 × 1 × .75 = 1 × .75 × 1 = 1 × .75 = .75 .25 .75 (1.0) the tut. 1.0 (1.0) the tut. Hidden Markov Models (HMM’s): states, state transitions, outputs. HMM’s ex: (.1) S→aSb; (.9) S→ab • HMM’s cannot simulate all PCFG’s. PCFG’s) • Same string may be generated via different paths (cf. ambiguous output probs. (cf. PCFG rule probs) • Sum of all transition probs out of a state must sum to one. Same for HMM Facts paths p for w1 · · · wn P (w1 · · · wn |p)P (p) How can we compute this efficiently? Dynamic programming! (cf. PCFG language modeling) Ph (w1 · · · wn ) = For sentence w1 · · · wn and HMM h (arbitrary states), HMM Language Modeling 1 0 00 0 1 011 1 00 11 0 1 wi s wi+1 11 00 0 1 0 00 11 0 1 1 0 1 wn Alternatively, combining backward probs right-to-left yields Back(start, 0) sentence Combining forward probs left-to-right yields For(start, n), prob of the Backward prob Back(s, i): prob of generating w i+1 · · · wn , starting at s Forward prob For(s, i): prob of generating w 1 · · · wi , ending at state s w1 Computing Sentence Probabilities (cf. Inside-Outside) • each step increases T ’s likelihood according to the HMM • iterative re-estimation using forward and backward probabilities for large training corpus T Forward-Backward, or Baum-Welch: Where do we get transition/output probs? Training HMM’s (locally) training data likelihood In general, training a probabilistic model: find parameter settings maximizing Inside-Outside (PCFG’s) and Forward-Backward (HMM’s) look very similar. Training Commonalities θ θ ∗ = arg max Pθ (T ) Settle for finding local stationary point via hillclimbing. Goal: find T : training data θ : model parameters (e.g., transition probs) Ex: Inside-Outside, Forward-Backward calculate likelihood directly. The EM algorithm [Dempster, Laird, Rubin 77]: used when it is difficult to EM y Pθi (y|T )log Pθi (y, T ) Pθi+1 (T ) > Pθi (T ) Pθi (y|T )log Pθi+1 (y, T ) > (training likelihood increased!) then y Use auxiliary variable Y , dependent on θ . If EM (cont.) (1) HMM’s: the paths) Trick is to find auxiliary variable Y making these computations easy (ex: Maximization: find θi+1 maximizing this function of θ Expectation: calculate E(log θ (Y, T )), with respect to Pθi (y|T ), as a Iterative process: EM algorithm • n-gram models • Part-of-speech HMM’s For simpler versions of HMM’s (nothing hidden), EM is not necessary. Special Cases ... ... ... ... V ......... Training: state semantics known – use POS-tagged training data, not EM 95% accuracy): “reverse” the HMM to N ... ......... ADJ find most likely tag (state) sequence State-of-the art for POS tagging (∼ DET (.02) those (.1) the ......... ......... HMM POS Model − 1 previous words. Bigrams/trigrams: dominant language-modeling technology P (the) · P (tut.|the) · P (was|the tut.) · P (a|tut. was) · P (roaring|was a) · · · Trigram model: P (the) · P (tut.|the) · P (was|tut.) · P (a|was) · P (roaring|a) · · · Bigram model for P (the tut. was a roaring success): Calculations much simpler (avoid Forward-Backward, EM) Special simple case of HMM’s: state represents N N-gram Models #(w1 w2 · · · wn−1 wn ) , #(w1 w2 · · · wn−1 ) (more later ...) Standard techniques: interpolation [Jelinek and Mercer 80], backoff [Katz 87] where #(·) indicates frequency in a large training corpus. likelihood estimate Estimates for P (wn |w1 w2 · · · wn−1 ) are typically based on the maximum Training N-gram models V. The Sparse Data Problem P(“L’avocat general” → “the general avocado”) P( “I saw her duck [with a telescope]” → verb attachment) General problem in statistical NLP: density estimation correction, ... Applications: speech recognition, handwriting recognition, spelling Which is more likely? (both are grammatical) “It’s hard to wreck a nice beach” vs. “It’s hard to recognize speech” Predicting Probabilities Ptrue (“informative brown bag seminar ) ≈ A simple model: #(“informative brown bag seminar” ) |T | Training: find parameters maximizing the likelihood of training set T . Maximum-Likelihood Estimation Search Tips ©2000 Google Try our Web Directory - Cool Jobs - Advertise with Us! - Add Google to your Site - Google Browser Buttons Everything Else Make sure all words are spelled correctly. Try using fewer words. Try using more general keywords. Try different keywords. Your search - "informative brown bag seminar" - did not match any documents. Tip: Get the most out of Google’s capabilities -- try our new Advanced Search page. Google Search I’m Feeling Lucky Advanced Search Language, Display, & Filtering Options "informative brown bag seminar" 1,060,000,000 web pages indexed ... accurately model them. The aggregate probability of unseen events can be very large, so we need to set [Collins and Brooks 1995, PP-attachment]. • A standard corpus of word 4-tuples has a 95% unseen rate for the test new sample would be unseen [Brown et al. 1992]. • For a 350M-word sample of English, an estimated 14% of triples in any Why care about unseen strings? Sparse Data Problems from English.” [Chomsky 1957] sentences will be ruled out on identical grounds as equally “remote” ... has ever occurred .... Hence, in any statistical model ... these (2) Furiously sleep ideas green colorless nor (1) Colorless green ideas sleep furiously “It is fair to assume that neither sentence Chomsky: the sparse data problem is insurmountable! Sparse Data Problems (cont.) • unknown words • domain variance (apple, sun) We wish to determine similarity automatically, not with pre-existing thesauri: ⇒ “informative brown bag seminar” is reasonable. “informative brown bag presentation” “informative brown bag talk” Key idea: look at information provided by similar words. Similarity Information Distributional Similarity We are interested in distributional similarity: x and x are similar means P (Y |x) ≈ P (Y |x ) P(V|apple) eat peel throw threaten eat peel throw threaten eat peel throw threaten P(V|pear) P(V|veto) Not (necessarily) semantic similarity: “You used me!” “You utilized me!”) ⇒ A B Example: two clusters vs. two neighbors For each word, use words in its specific local neighborhood as model 94, Dagan-Lee-Pereira 97, Lee-Pereira 99, Lee 99] • Nearest neighbors [Dagan-Marcus-Markovitch 93, Dagan-Pereira-Lee Compresses the data Group words into global clusters; use clusters as models Karov-Edelman 96, Li-Abe 97; Rooth et al 99, Lee-Pereira 99] • Clustering [Brown et al. 92; Schütze 92, Pereira-Tishby-Lee 93; Distributional Similarity Models year people percent syndrome today firm state people govt. group country business program airline industry program org. bank system today discussion stake hearing lead referendum primary hostage talk cos city govt. agency firm bank state airline business var Underline: unique to a function. Verb-noun pairs from AP newswire. business govt. year city conf tau euc govt. group agency state bank firm airline business JS nation people program country group govt. business state jac Example: Nearest Neighbors of “Company” VI. Conclusions and References — The Mathematics of Statistical Machine Translation [Brown, Della Pietra, Della Pietra, and Mercer, 1993] general and to better machine translation systems in particular.” another and guide us to better natural language processing systems in probabilistic framework so that the two together may draw strength from one replace it. Rather, we hope to enfold it in the embrace of a secure objective function. But it is not our intention to ignore linguistics, neither to Doubtless even these can be recast in terms of some information theoretic another set of rules for analyzing a string of words into a string of sentences. one set of rules for analyzing a string of characters into a string of words, and “The linguistic content of our program thus far is scant indeed. It is limited to No Myths...Only a Beginning processing?”” [Magerman 95] “What should I read to learn about statistical natural language “The $64,000 question in computational linguistics these days is: For Further Information ... Linguistics 19(1), 1993. Computational Linguistics Using Large Corpora”, Computational • K. Church and R. Mercer, “Introduction to the Special Issue on Processing”, AI Magazine 18(4), 1997. • E. Brill and R. Mooney, “An Overview of Empirical Natural Language Klavans and P. Resnik, eds., 1997. • S. Abney, “Statistical Methods and Linguistics”, in The Balancing Act, J. language learning, 1999. introdction to Machine Learning 34(1-3), special issue on natural • C. Cardie and R. Mooney, “Machine Learning and Natural Language”, Short Overviews Reviewed by L. Lee in Computational Linguistics 26(2), 2000. Processing, MIT Press, 1999. • C. Manning and H. Schütze, Foundations of Statistical Natural Language • F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997. Interscience, 1991. • T. Cover and J. Thomas, Elements of Information Theory, Wiley Reviewed by D. Magerman in Computational Linguistics 21(1), 1995. • E. Charniak, Statistical Language Learning, MIT Press, 1993. Books http://xxx.lanl.gov/archive/cs. Computing Research Repository computer science holdings, http://xxx.lanl.gov/cmp-lg/, later absorbed into the Many recent papers are posted on the cmp-lg server, International Conference on Computational Linguistics (COLING) Applied Natural Language Processing (ANLP) European Chapter of the ACL (EACL) North American Chapter of the ACL (NAACL) Association for Computational Linguistics (ACL) • General NLP conferences: • Workshop on Very Large Corpora (WVLC). • Empirical Methods in Natural Language Processing (EMNLP). Conferences