Download Natural Language Technology

Document related concepts

Pattern recognition wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Mathematical model wikipedia , lookup

Wizard of Oz experiment wikipedia , lookup

Neural modeling fields wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lafferty, at the Sixteenth National Conference on Artificial Intelligence, 1999.
(*) Some of this material comes from a joint tutorial, co-organized with John
http://www.cs.cornell.edu/home/llee
Cornell University
Lillian Lee
Natural Language Technology(*)
VI. Conclusions and References
V. The Sparse Data Problem
IV. Tools of the Trade
III. Language as a Statistical Source
II. The Statistical Revolution
I. Overview of the Field
Outline
I. Overview of the Field
(NLU)
understanding
computer
(NLG)
generation
language
NLG example: produce a summary of a patient’s records.
instructions.
NLU example: convert an utterance into a sequence of computer
language
Goal: computers using natural language as input and/or output
Natural Language Processing (NLP)
• “Do what I mean!”
Lots of users want to communicate in natural language.
• User utterances
• News broadcasts
• Documents
Lots of information is in natural language format.
Why NLP?
“document(s)” (CNN broadcasts)
signal in language 1
query
query
command in natural language
summarization
machine translation
question answering
→ information retrieval
user interfaces
computer instructions
relevant documents
answer to query
signal in language 2
summary
Output
– Bill Gates, 1997
“Now we’re betting the company on these natural interface technologies”
Input
Task
NLP is Useful
need be neither biologically plausible nor mathematically satisifying.
On the whole, NLP tends to be applications-oriented: 95% is OK; models
emphasizes formal aspects
• Mathematics and statistics: properties of models
emphasizes biological/cognitive plausibility
• Psychology: models of cognitive processes
emphasizes 100% accuracy (competence)
• Linguistics: models of language
Excellent opportunities for interdisciplinary work.
NLP is Cross-Disciplinary
1950]
wish to include” [Turing, “Computing Machinery and Intelligence”,
introducing almost any one of the fields of human endeavour that we
“The question and answer method seems to be suitable for
This idea dates back at least to the Turing Test:
in NLP problems.
All the difficult problems in artificial intelligence manifest themselves
It is often said that NLP is “AI-complete”:
NLP is Challenging
“At last, a computer that understands you like your mother”
been solved ...
• Ad from the 70’s or 80’s (source: S. Shieber): the problem has already
• “Doesn’t Microsoft do that already?”
Why is NLP hard?
1 and 3: Does this mean well, or poorly?
3. It understands you as well as it understands your mother
2. It understands (that) you like your mother
1. (*) It understands you as well as your mother understands you
What can we infer about the computer?
“At last, a computer that understands you like your mother”
Ambiguity
2. “... a computer that understands your lie cured mother”
1. “... a computer that understands you like your mother”
At the acoustic level (speech recognition):
Ambiguity at Many Levels (I)
unionized = ?
The morphological analysis problem is especially difficult in new domains:
In practice, storing root forms reduces database size.
= un + derstands
= under + stands (although derived historically from this)
= understand + s
“... a computer that understands you like ...”
At the morphological (word-form) level:
Ambiguity at Many Levels (II)
you
understands
like your mother [does]
S’
understands
V
Different structures lead to different interpretations.
NP
V
VP
At the syntactic (structural) level:
VP
Ambiguity at Many Levels (III)
[that] you like your mother
S’
... understands you like [it understands] your mother
... understands you like your mother [understands you]
“... a computer that understands you like ...” = ?
Ellipsis: missing (elided) syntactic material
In fact, the identity of the syntactic material is ambiguous.
Ambiguity at Many Levels (IV)
A more typical example: “They put money in the bank”.
This is an instance of word sense ambiguity.
A cask or vat used in vinegar-making
A female parent
mother = ? (OED)
At the semantic (meaning) level:
Ambiguity at Many Levels (V)
discourse entity.
This is an instance of anaphora, where “she” co-refers to some other
2b. ... doesn’t understand me at all.
2a. ... doesn’t know any details.
2. But she ...
1. Alice says they’ve built a computer that understands you like your mother.
At the discourse (multiple-clause) level:
Ambiguity at Many Levels (VI)
It often helps to restrict the domain.
• automated and data-driven
• handcrafted and expert-driven
Two veins of work to combat the knowledge acquisition bottleneck:
2. Knowledge about the world
1. Knowledge about language
The task seems so difficult! What resources do we need?
What Will It Take?
∼80% “understanding” rate
Note – restricted domains.
1994]
of joint business ventures [Message Understanding Conferences (MUC)
Information Extraction: systems exist for analyzing and summarizing reports
for novices.
conversational system for weather information.
JUPITER [MIT Spoken Language Systems group (1-888-573-TALK)]:
French-English translation of weather reports.
The TAUM-METEO system [Chandioux 76]: essentially perfect
Not an exclusive list!
Success Stories
• The Association for Computational Linguistics (ACL) Universe,
http://www.cs.columbia.edu/∼radev/u/db/acl/
• comp.ai.nat-lang FAQ (Frequently Asked Questions).
http://www.cs.columbia.edu/∼radev/nlpfaq.txt
• Allen, 1995. Natural Language Understanding, 2nd edition.
and speech recognition
introduction to natural language processing, computational linguistics,
• Jurafsky and Martin, 2000. Speech and Language Processing: An
Not an exclusive list!
General NLP References
II. The Statistical Revolution
statistics used to make decisions
• statistical methods — no assumption on language source; sample
• statistical models — language assumed generated by a statistical source
Two threads (often intertwined; not everyone distinguishes!):
Draws on probability, statistics, information theory, machine learning.
• Helps ease the knowledge acquisition bottleneck.
Goal: Infer language properties from (annotated?) (text?) samples.
What is Statistical NLP?
S −→
NP
VP
−→
→
NP
S
→
S
the tut.
NP
the tutorial
NP V P
sequence of rewriting rules.
S
→
→
V P −→
VP
VP
the tut.
NP
bombed
rocked
S
rocked
VP
Context-Free Grammars (CFG’s): strings generated by choosing some
Non-statistical Models Example: CFG’s
the tutorial
NP V P
(.25)
(.75)
VP
VP
→
→
bombed
rocked
1
×
1
×
.75
S→N P V P N P →the tut. V P →rocked
= .75
→
NP
(1.0)
P (“the tut. rocked”) =
→
S
(1.0)
randomly picking rules according to their probabilities
Probabilistic Context-Free Grammars (PCFG’s): strings generated by
Statistical Models Example: PCFG’s
savings? river?
bank
”
co-occurring with “money” from entries in a machine-readable dictionary
A statistical solution [Lesk 86]: estimate the likelihood of savings bank
“They put money in the
Word sense disambiguation (WSD): find correct word sense from context
Statistical Methods Example: WSD
But statistical approaches were (are) not universally accepted ...
transfer to new domains is easier
• Statistical methods reduce the knowledge acquisition bottleneck
models can be iteratively trained/updated
confidence can be assessed (helps combine knowledge sources)
“grammatical/ungrammatical”)
• Statistical models allow degrees of uncertainty (not just
Why Statistical NLP?
language 1
language 1
corruption process
language 2
language 2
• Shannon, Weaver: cryptographic notions, the noisy channel model
company it keeps” [Firth 57])
• Harris, Firth: empirical linguistics (“You shall know a word by the
The 40’s and 50’s: statistical NLP popular
A Brief History
is not.” [Chomsky 1957]
from English. Yet (1), though nonsensical, is grammatical, while (2)
sentences will be ruled out on identical grounds as equally “remote”
... has ever occurred .... Hence, in any statistical model ... these
(2) Furiously sleep ideas green colorless
nor
(1) Colorless green ideas sleep furiously
“It is fair to assume that neither sentence
Late 50’s–80’s: statistical NLP in disfavor
A Brief History (cont.)
kitchen help at the ACL [conference] banquet.” [Abney 97]
at least use the terminology persuasively risks being mistaken for
passing knowledge of statistical methods .... anyone who cannot
“no one can profess to be a computational linguist without a
• nowadays,
• confluence with interest in machine learning
• revived by IBM: influenced by speech recognition
The 80’s – present: statistical NLP once again mainstream
A Brief History (cont.)
12.8%
63.5
ACL 1990
ACL 1998
1983 was the last year in which there were no such papers.
Linguistics)
(ACL is the main conference of the Association for Computational
Percentage of statistically-based papers
Source
From Julia Hirschberg’s AAAI-98 invited talk:
Statistics on Statistical NLP
• “Every time I fire a linguist, my performance goes up”
• “AI-NLP ...is going nowhere fast”
• “that’s not learning, that’s statistics”
• “I don’t believe in this statistics stuff”
• Chomsky
processes, linguistics...
Some draw contrasts with knowledge-based methods, higher-level
The “Opposite” of Statistical NLP?
WSD using dictionaries
methods and models)
• Statistical methods can make use of knowledge bases (don’t confuse
statistical semantics, discourse models [Miller 96]
CFG’s → PCFG’s
• Knowledge-based models can be converted to stochastic versions
Statistics Complements Other Approaches
III. Language as a Statistical Source
A Recent Anniversary
The Bell System Technical Journal, July 1948.
C.E. Shannon, A Mathematical Theory of Communication,
one selected from a set of possible messages.”
engineering problem. The significant aspect is that the actual message is
entities. These semantic aspects of communication are irrelevant to the
correlated according to some system with certain physical or conceptual
Frequently the messages have meaning; that is they refer to or are
point either exactly or approximately a message selected at another point.
“The fundamental problem of communication is that of reproducing at one
Famous First Words
preceding letters”
• Shannon: “The states will correspond to the “residue of influence” from
Another coin is flipped to decide which letter or word to output
A coin (or a bunch of coins) is flipped to choose the next state
The generator can be in one of several “states”
• A string is generated by a randomized algorithm
A useful conceptual and practical device: coin-flipping models
Generative Models
11
00
00
11
00
11
00
11
C .2
B .1
B .4
A .5
A .5
111
000
000
111
000
111
B .8
B .5
ABBABABABABABABBBABBBBBABABABABABBAC
C .1
C .2
AAACDCBDCEAADADACEDAEADCABEDADDCECAA
D .2
E .1
A .4
Coin-Flipping Models
0
1
01
1
0
0
1
0
1
0
1
0
1
0
1
0
1
1
0
0
1
0
1
01
1
01
1
0
0
0
1
0
1
0
1
0
1
0
1
0
1
1
0
0
1
0
1
helpful to simulate it in your mind.
When designing a new statistical model for an NLP task, it is often very
The Soul of a New Machine
ANDY TOBE SEACE CTISBE
ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY
3. Second-order approximation:
ALHENHTTPA OOBTTVA NAH RBL
OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI
2. First-order approximation:
QPAAMKBZAACIBZLHJQD
XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD
1. Zero-order approximation:
From Shannon’s original paper:
Markov Approximations to English
VERATE OF TO
CONSTRANDED STATER VILL MENTTERING AND OF IN
OF QUT DIVERAL THIS OFFECT INATEVER THIFER
WITH PROVERAL THE CHESTRAING FOR HAVE TO INTRALLY
RELOVERATED THER WHICH CONISTS AT FORES ANDITING
WAS REASER IN THERE TO WILL WAS BY HOMES THING BE
Pietra et. al, 1997):
Markov random field with 1000 “features,” no underlying “machine” (Della
REGOACTIONA OF CRE
PONDENOME OF DEMONSTURES OF THE REPTABIN IS
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
4. Third-order approximation:
From Shannon’s original paper
Markov Approximations (cont.)
the labor involved becomes enormous at the next stage.”
“It would be interesting if further approximations could be constructed, but
Shannon’s comment:
THE PROBLEM FOR AN UNEXPECTED
METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD
THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
2. Second-order approximation:
EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE T
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
1. First-order approximation:
Word-Based Approximations
S
NLY
N
W Y T F LL
N TH
V W LS
N TH S
Trigram word model: 1.72 bits/letter
Fourth order model: 2.8 bits/letter
First order model: 4.03 bits/letter
of English:
• Coin-flipping models can be used to estimate the redundancy or entropy
S NT NC
TH R
• Redundancy helps us communicate
Estimating Redundancy
Decoder
E
Y
p(Y | E)
Channel
Model
Y
E
E* = arg max p(E | Y) = argmax p(E) p(Y | E)
E*
p(E)
Source
Language Model
The Source-Channel Model
acoustic signal
French text
typed text
processed image
speech recognition
French translation
spelling correction
OCR
p(Y | E)
Channel
Model
Observation Y
E
Application
p(E)
Source
Language Model
Source-Channel Examples
Y
• Special cases
• EM
• HMM’s
• PCFG’s
IV. Tools of the Trade
Which is more likely? (both are grammatical)
“The tutorial was a boring address”
vs.
“The tutorial was a roaring success”
Predicting String Probabilities
=
P (“ ... boring address”)
.000001
.003
Standard models: PCFG’s, n-grams/HMM’s
Classic applications: speech, handwriting, and optical character recognition
=
P (“ ... roaring success”)
approximate source probabilities
Language model: method for assigning probabilities to strings; want to
Language Modeling
→
NP
(1.0)

rocked
the tutorial
→
→
VP
S
VP
VP
Rule
NP
(.25)
(.75)
Prob.
PCFG’s give both parse structures and probability.
= 1 × 1 × .75



= P



the tutorial
NP V P
P (“the tut. rocked”)
→
S
Rule
(1.0)
Prob.
PCFG’s








bombed
rocked
police
police
police
S
,
police
S
→
S
.9
police
S
police
SS
S
Police, Police police, Police police police, ...
S
S
S
S
→
S
.1
S
police
S
Ambiguous PCFG’s: multiple parse trees for same sentence
Probabilities of all rules with the same lefthand side must sum to one.
PCFG Facts
parse trees π for w1 · · · wn
P (π)
How can we compute this efficiently? Dynamic programming!
PG (w1 · · · wn ) =
For sentence w1 · · · wn and PCFG G,
PCFG Language Modeling
wi
wj
wn
Combining inside probs bottom-up yields In(S, 1, n), prob of the sentence
w1 · · · wi−1 Awj+1 · · · wn
Outside probability Out(A, i, j): probability that S generates
Inside probability In(A, i, j): probability that A generates string w i · · · wj
w1
A
S
Computing Sentence Probabilities
Schabes 92]
Note: algorithms for taking advantage of structural annotation [Pereira and
• each step increases T ’s likelihood according to the PCFG
• iterative re-estimation using inside and outside probabilities for large
training corpus T
Inside-Outside algorithm [Baker 79]:
Where do we get the rule probabilities?
Training PCFG’s
But cf. [Chelba and Jelinek 98]
probabilities (so far)
If don’t need parses, simpler models more effective at estimating sentence
PCFG’s: linguistically intuitive, provide parse structures
PCFG’s: Overall Effectiveness
a
a
S
S
b
b
S→aSb; S→ab
What if we limit modeling capacity to local correlations?
“Matching” a’s and b’s not necessarily adjacent.
CFG’s: top-down generation.
Restricting Generative Capacity
(1.0) bombed
(1.0) rocked
(.75) rocked
(.25) bombed
1.0: the tut.
.25:bombed
.75:rocked
P (“the tut. rocked”) = 1 × 1 × .75 = 1 × .75 × 1 = 1 × .75 = .75
.25
.75
(1.0) the tut.
1.0
(1.0) the tut.
Hidden Markov Models (HMM’s): states, state transitions, outputs.
HMM’s
ex: (.1) S→aSb; (.9) S→ab
• HMM’s cannot simulate all PCFG’s.
PCFG’s)
• Same string may be generated via different paths (cf. ambiguous
output probs. (cf. PCFG rule probs)
• Sum of all transition probs out of a state must sum to one. Same for
HMM Facts
paths p for w1 · · · wn
P (w1 · · · wn |p)P (p)
How can we compute this efficiently? Dynamic programming!
(cf. PCFG language modeling)
Ph (w1 · · · wn ) =
For sentence w1 · · · wn and HMM h (arbitrary states),
HMM Language Modeling
1
0
00
0
1
011
1
00
11
0
1
wi
s
wi+1
11
00
0
1
0
00
11
0 1
1
0
1
wn
Alternatively, combining backward probs right-to-left yields Back(start, 0)
sentence
Combining forward probs left-to-right yields For(start, n), prob of the
Backward prob Back(s, i): prob of generating w i+1 · · · wn , starting at s
Forward prob For(s, i): prob of generating w 1 · · · wi , ending at state s
w1
Computing Sentence Probabilities
(cf. Inside-Outside)
• each step increases T ’s likelihood according to the HMM
• iterative re-estimation using forward and backward probabilities for large
training corpus T
Forward-Backward, or Baum-Welch:
Where do we get transition/output probs?
Training HMM’s
(locally) training data likelihood
In general, training a probabilistic model: find parameter settings maximizing
Inside-Outside (PCFG’s) and Forward-Backward (HMM’s) look very similar.
Training Commonalities
θ
θ ∗ = arg max Pθ (T )
Settle for finding local stationary point via hillclimbing.
Goal: find
T : training data
θ : model parameters (e.g., transition probs)
Ex: Inside-Outside, Forward-Backward
calculate likelihood directly.
The EM algorithm [Dempster, Laird, Rubin 77]: used when it is difficult to
EM
y
Pθi (y|T )log Pθi (y, T )
Pθi+1 (T ) > Pθi (T )
Pθi (y|T )log Pθi+1 (y, T ) >
(training likelihood increased!)
then
y
Use auxiliary variable Y , dependent on θ . If
EM (cont.)
(1)
HMM’s: the paths)
Trick is to find auxiliary variable Y making these computations easy (ex:
Maximization: find θi+1 maximizing this
function of θ
Expectation: calculate E(log θ (Y, T )), with respect to Pθi (y|T ), as a
Iterative process:
EM algorithm
• n-gram models
• Part-of-speech HMM’s
For simpler versions of HMM’s (nothing hidden), EM is not necessary.
Special Cases
...
...
...
...
V
.........
Training: state semantics known – use POS-tagged training data, not EM
95% accuracy): “reverse” the HMM to
N
...
.........
ADJ
find most likely tag (state) sequence
State-of-the art for POS tagging (∼
DET
(.02) those
(.1) the
.........
.........
HMM POS Model
− 1 previous words.
Bigrams/trigrams: dominant language-modeling technology
P (the) · P (tut.|the) · P (was|the tut.) · P (a|tut. was) · P (roaring|was a) · · ·
Trigram model:
P (the) · P (tut.|the) · P (was|tut.) · P (a|was) · P (roaring|a) · · ·
Bigram model for P (the tut. was a roaring success):
Calculations much simpler (avoid Forward-Backward, EM)
Special simple case of HMM’s: state represents N
N-gram Models
#(w1 w2 · · · wn−1 wn )
,
#(w1 w2 · · · wn−1 )
(more later ...)
Standard techniques: interpolation [Jelinek and Mercer 80], backoff [Katz 87]
where #(·) indicates frequency in a large training corpus.
likelihood estimate
Estimates for P (wn |w1 w2 · · · wn−1 ) are typically based on the maximum
Training N-gram models
V. The Sparse Data Problem
P(“L’avocat general” → “the general avocado”)
P( “I saw her duck [with a telescope]” → verb attachment)
General problem in statistical NLP: density estimation
correction, ...
Applications: speech recognition, handwriting recognition, spelling
Which is more likely? (both are grammatical)
“It’s hard to wreck a nice beach”
vs.
“It’s hard to recognize speech”
Predicting Probabilities
Ptrue (“informative brown bag seminar ) ≈
A simple model:
#(“informative brown bag seminar” )
|T |
Training: find parameters maximizing the likelihood of training set T .
Maximum-Likelihood Estimation
Search Tips
©2000 Google
Try our Web Directory - Cool Jobs - Advertise with Us! - Add Google to your Site - Google Browser Buttons Everything Else
Make sure all words are spelled correctly.
Try using fewer words.
Try using more general keywords.
Try different keywords.
Your search - "informative brown bag seminar" - did not match any documents.
Tip: Get the most out of Google’s capabilities -- try our new Advanced Search page.
Google Search I’m Feeling Lucky
Advanced Search Language, Display, & Filtering Options
"informative brown bag seminar"
1,060,000,000 web pages indexed ...
accurately model them.
The aggregate probability of unseen events can be very large, so we need to
set [Collins and Brooks 1995, PP-attachment].
• A standard corpus of word 4-tuples has a 95% unseen rate for the test
new sample would be unseen [Brown et al. 1992].
• For a 350M-word sample of English, an estimated 14% of triples in any
Why care about unseen strings?
Sparse Data Problems
from English.” [Chomsky 1957]
sentences will be ruled out on identical grounds as equally “remote”
... has ever occurred .... Hence, in any statistical model ... these
(2) Furiously sleep ideas green colorless
nor
(1) Colorless green ideas sleep furiously
“It is fair to assume that neither sentence
Chomsky: the sparse data problem is insurmountable!
Sparse Data Problems (cont.)
• unknown words
• domain variance (apple, sun)
We wish to determine similarity automatically, not with pre-existing thesauri:
⇒ “informative brown bag seminar” is reasonable.
“informative brown bag presentation”
“informative brown bag talk”
Key idea: look at information provided by similar words.
Similarity Information
Distributional Similarity
We are interested in distributional similarity:
x and x are similar means P (Y |x) ≈ P (Y |x )
P(V|apple)
eat
peel
throw
threaten
eat
peel
throw
threaten
eat
peel
throw
threaten
P(V|pear)
P(V|veto)
Not (necessarily) semantic similarity: “You used me!”
“You utilized me!”)
⇒
A
B
Example: two clusters vs. two neighbors
For each word, use words in its specific local neighborhood as model
94, Dagan-Lee-Pereira 97, Lee-Pereira 99, Lee 99]
• Nearest neighbors [Dagan-Marcus-Markovitch 93, Dagan-Pereira-Lee
Compresses the data
Group words into global clusters; use clusters as models
Karov-Edelman 96, Li-Abe 97; Rooth et al 99, Lee-Pereira 99]
• Clustering [Brown et al. 92; Schütze 92, Pereira-Tishby-Lee 93;
Distributional Similarity Models
year
people
percent
syndrome
today
firm
state
people
govt.
group
country
business
program
airline
industry
program
org.
bank
system
today
discussion
stake
hearing
lead
referendum
primary
hostage
talk
cos
city
govt.
agency
firm
bank
state
airline
business
var
Underline: unique to a function. Verb-noun pairs from AP newswire.
business
govt.
year
city
conf
tau
euc
govt.
group
agency
state
bank
firm
airline
business
JS
nation
people
program
country
group
govt.
business
state
jac
Example: Nearest Neighbors of “Company”
VI. Conclusions and References
— The Mathematics of Statistical Machine Translation
[Brown, Della Pietra, Della Pietra, and Mercer, 1993]
general and to better machine translation systems in particular.”
another and guide us to better natural language processing systems in
probabilistic framework so that the two together may draw strength from one
replace it. Rather, we hope to enfold it in the embrace of a secure
objective function. But it is not our intention to ignore linguistics, neither to
Doubtless even these can be recast in terms of some information theoretic
another set of rules for analyzing a string of words into a string of sentences.
one set of rules for analyzing a string of characters into a string of words, and
“The linguistic content of our program thus far is scant indeed. It is limited to
No Myths...Only a Beginning
processing?”” [Magerman 95]
“What should I read to learn about statistical natural language
“The $64,000 question in computational linguistics these days is:
For Further Information ...
Linguistics 19(1), 1993.
Computational Linguistics Using Large Corpora”, Computational
• K. Church and R. Mercer, “Introduction to the Special Issue on
Processing”, AI Magazine 18(4), 1997.
• E. Brill and R. Mooney, “An Overview of Empirical Natural Language
Klavans and P. Resnik, eds., 1997.
• S. Abney, “Statistical Methods and Linguistics”, in The Balancing Act, J.
language learning, 1999.
introdction to Machine Learning 34(1-3), special issue on natural
• C. Cardie and R. Mooney, “Machine Learning and Natural Language”,
Short Overviews
Reviewed by L. Lee in Computational Linguistics 26(2), 2000.
Processing, MIT Press, 1999.
• C. Manning and H. Schütze, Foundations of Statistical Natural Language
• F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997.
Interscience, 1991.
• T. Cover and J. Thomas, Elements of Information Theory, Wiley
Reviewed by D. Magerman in Computational Linguistics 21(1), 1995.
• E. Charniak, Statistical Language Learning, MIT Press, 1993.
Books
http://xxx.lanl.gov/archive/cs.
Computing Research Repository computer science holdings,
http://xxx.lanl.gov/cmp-lg/, later absorbed into the
Many recent papers are posted on the cmp-lg server,
International Conference on Computational Linguistics (COLING)
Applied Natural Language Processing (ANLP)
European Chapter of the ACL (EACL)
North American Chapter of the ACL (NAACL)
Association for Computational Linguistics (ACL)
• General NLP conferences:
• Workshop on Very Large Corpora (WVLC).
• Empirical Methods in Natural Language Processing (EMNLP).
Conferences