Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] NER and UTF-8

Gerber Daniel dgerber at
Fri May 13 01:15:31 PDT 2011

I ran into a problem regarding UTF-8. I'm querying my Lucene index and try to NER-tag the results. This works perfectly on my personal laptop (current MacBook Pro), but if I run the program on the server I get this message for almost every tagged sentence:

Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

I know that this problem has been discussed before, but those answers didn't help me very much. :(

Kind regards,

My configurations:

Linux 2.6.32-21 /  x86_64 / GNU/Linux / Ubuntu 10.04 LTS

Mac OS X 10.6.7 

More information about the java-nlp-user mailing list