Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] Stanford NER: confidence scores

Christopher Manning manning at stanford.edu
Tue Nov 17 17:29:51 PST 2009


Yes and no.

In stuff available at the command-line, there are a few options that dump out probabilities of things.  But they dump out probabilities of the class assignment to individual tokens and token pairs, not to full entities.  AND, as the code is written now, the options only work with a -testFile not a -textFile (for no very good reason).  But they give some idea....

Using the provided sample document, you can get the following:

java edu.stanford.nlp.process.PTBTokenizer sample.txt > sample.test
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz -testFile sample.test -printProbs > sample.print
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz -testFile sample.test -printFirstOrderProbs > sample.print1st

[manning at jamie ~]$ head sample.print 
The	O=0.99999998135786	ORGANIZATION=2.6567466928612085E-9	MISC=1.356206662374912E-8	PERSON=8.176353978988214E-11	LOCATION=2.3417217380971917E-9
fate	O=0.9999999582736345	ORGANIZATION=4.1700953530777644E-8	MISC=3.3655569957340776E-12	PERSON=2.143502478107925E-11	LOCATION=7.103478429908953E-13
of	O=0.9999903961249463	ORGANIZATION=9.602401442223853E-6	MISC=1.1608074462933527E-9	PERSON=1.5819472785608723E-10	LOCATION=1.5487870335293751E-10
Lehman	O=2.0103282479232179E-7	ORGANIZATION=0.9993291405222918	MISC=2.3464088532112967E-5	PERSON=7.224010968876396E-5	LOCATION=5.74954246837179E-4
Brothers	O=9.86544119657211E-5	ORGANIZATION=0.9992345028959849	MISC=2.2664807695279273E-5	PERSON=7.36669467427115E-5	LOCATION=5.705109379181266E-4
,	O=0.9999999607256321	ORGANIZATION=3.9139979344041836E-8	MISC=6.337165183393874E-11	PERSON=5.023186106482568E-11	LOCATION=2.1323493706861942E-11
the	O=0.9999999999383817	ORGANIZATION=1.625108900198471E-13	MISC=2.5300939551568E-12	PERSON=1.5361712666066918E-11	LOCATION=4.410280856766081E-11
beleaguered	O=0.9999999881256373	ORGANIZATION=1.832642399224178E-12	MISC=1.1871420471150292E-8	PERSON=6.821695803231142E-13	LOCATION=1.1632470059359709E-12
investment	O=0.9999999985357135	ORGANIZATION=1.4523487401949386E-9	MISC=1.1119518225092833E-11	PERSON=3.0957335968364846E-13	LOCATION=1.0237981738385062E-12
bank	O=0.9999997595471375	ORGANIZATION=1.322685284177659E-7	MISC=1.0531526839802884E-7	PERSON=1.890726223114857E-9	LOCATION=9.788457668668991E-10

So you can see here that the classifier is very, very certain (prob 0.999) that Lehman Brothers is an organization.  Other assignments will be less certain.....

To do much more than this, you'd need to work with the API.  You can make arbitrary queries to the underlying CRF by using the

	getCliqueTrees()

method, but you'll need to have read a reasonable amount of this book:

	http://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193/

before you're likely to understand how.....

That's what there is now.  Internally we do have another option that lets you adjust a bias for recognizing or not recognizing each entity class.  But it's not presently in the download.

Chris.


On Nov 17, 2009, at 2:59 PM, s2009 at fastmail.fm wrote:

> Hello,
> 
> is there a way to provide a confidence score for each entity found in a
> text by the NER tool, so that one could adjust for more or less
> precision by setting a threshold on the score and considering only
> extracted entities that have a score higher than the threshold?
> 
> Thank you.
> 
> Sergio Govoni
> 
> 
> 
> _______________________________________________
> java-nlp-user mailing list
> java-nlp-user at lists.stanford.edu
> https://mailman.stanford.edu/mailman/listinfo/java-nlp-user




More information about the java-nlp-user mailing list