Search Mailing List Archives
[java-nlp-user] Stanford NER: confidence scores
manning at stanford.edu
Tue Nov 17 17:29:51 PST 2009
Yes and no.
In stuff available at the command-line, there are a few options that dump out probabilities of things. But they dump out probabilities of the class assignment to individual tokens and token pairs, not to full entities. AND, as the code is written now, the options only work with a -testFile not a -textFile (for no very good reason). But they give some idea....
Using the provided sample document, you can get the following:
java edu.stanford.nlp.process.PTBTokenizer sample.txt > sample.test
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz -testFile sample.test -printProbs > sample.print
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz -testFile sample.test -printFirstOrderProbs > sample.print1st
[manning at jamie ~]$ head sample.print
The O=0.99999998135786 ORGANIZATION=2.6567466928612085E-9 MISC=1.356206662374912E-8 PERSON=8.176353978988214E-11 LOCATION=2.3417217380971917E-9
fate O=0.9999999582736345 ORGANIZATION=4.1700953530777644E-8 MISC=3.3655569957340776E-12 PERSON=2.143502478107925E-11 LOCATION=7.103478429908953E-13
of O=0.9999903961249463 ORGANIZATION=9.602401442223853E-6 MISC=1.1608074462933527E-9 PERSON=1.5819472785608723E-10 LOCATION=1.5487870335293751E-10
Lehman O=2.0103282479232179E-7 ORGANIZATION=0.9993291405222918 MISC=2.3464088532112967E-5 PERSON=7.224010968876396E-5 LOCATION=5.74954246837179E-4
Brothers O=9.86544119657211E-5 ORGANIZATION=0.9992345028959849 MISC=2.2664807695279273E-5 PERSON=7.36669467427115E-5 LOCATION=5.705109379181266E-4
, O=0.9999999607256321 ORGANIZATION=3.9139979344041836E-8 MISC=6.337165183393874E-11 PERSON=5.023186106482568E-11 LOCATION=2.1323493706861942E-11
the O=0.9999999999383817 ORGANIZATION=1.625108900198471E-13 MISC=2.5300939551568E-12 PERSON=1.5361712666066918E-11 LOCATION=4.410280856766081E-11
beleaguered O=0.9999999881256373 ORGANIZATION=1.832642399224178E-12 MISC=1.1871420471150292E-8 PERSON=6.821695803231142E-13 LOCATION=1.1632470059359709E-12
investment O=0.9999999985357135 ORGANIZATION=1.4523487401949386E-9 MISC=1.1119518225092833E-11 PERSON=3.0957335968364846E-13 LOCATION=1.0237981738385062E-12
bank O=0.9999997595471375 ORGANIZATION=1.322685284177659E-7 MISC=1.0531526839802884E-7 PERSON=1.890726223114857E-9 LOCATION=9.788457668668991E-10
So you can see here that the classifier is very, very certain (prob 0.999) that Lehman Brothers is an organization. Other assignments will be less certain.....
To do much more than this, you'd need to work with the API. You can make arbitrary queries to the underlying CRF by using the
method, but you'll need to have read a reasonable amount of this book:
before you're likely to understand how.....
That's what there is now. Internally we do have another option that lets you adjust a bias for recognizing or not recognizing each entity class. But it's not presently in the download.
On Nov 17, 2009, at 2:59 PM, s2009 at fastmail.fm wrote:
> is there a way to provide a confidence score for each entity found in a
> text by the NER tool, so that one could adjust for more or less
> precision by setting a threshold on the score and considering only
> extracted entities that have a score higher than the threshold?
> Thank you.
> Sergio Govoni
> java-nlp-user mailing list
> java-nlp-user at lists.stanford.edu
More information about the java-nlp-user