Search Mailing List Archives
[java-nlp-user] begin / end indexes in POS tagger
horatio at gmail.com
Thu Apr 7 16:19:50 PDT 2011
Heh... a few months late, but possibly still helpful.
In the next release of the tagger, the TaggedWords produced by the
tagger will have the begin & end offsets attached whenever possible.
In the meantime, giving the tagger the following option turns off the
On Tue, Oct 12, 2010 at 12:45 PM, John Bauer <horatio at gmail.com> wrote:
> I know the PTBTokenizer has the capability of putting the offset on a word
> it tokenizes, and TaggedWord has the capability to store it, but it's not
> being set anywhere. I'll put that on my list of things to look at,
> On Tue, Oct 12, 2010 at 12:37 PM, John Wiesel <john.wiesel at fu-berlin.de>
>> Hi John,
>> Yes, exactly. I mean the character offset, which can be available via the
>> beginPosition and
>> endPosition properties but remains -1 when using tokenizeText and
>> Am 12.10.2010 19:59, schrieb John Bauer:
>> > Hi John,
>> > When you say indexes, do you mean the character offset in the text, or
>> > do you mean something else?
>> > -John
>> > On Tue, Oct 12, 2010 at 6:05 AM, John Wiesel <john.wiesel at fu-berlin.de
>> > <mailto:john.wiesel at fu-berlin.de>> wrote:
>> > Dear all,
>> > I am currently working on a IE project for my diploma thesis and
>> > would really appreciate your input
>> > on a problem that currently drives me nuts..
>> > After tokenizing my input using the MaxentTagger.tokenizeText and
>> > then applying tagSentence to
>> > obtain the POS tags, the resulting TaggedWords do not have begin or
>> > end indexes... As a workaround I
>> > am manually keeping track of the indexes but due to the fact that my
>> > MaxentTagger always
>> > americanizes the input, even if I feed it a TaggerConfig at
>> > initialization, my workaround fails time
>> > and again.
>> > First, is there a way to obtaining the indexes, like forcing
>> > MaxentTagger to use a different
>> > Tokenizer, like using the CoreLabelTokenFactory?
>> > If not, is there a way of preventing the default tokenizer the
>> > MaxentTagger uses from americanizing
>> > the input, much like here
>> > <https://mailman.stanford.edu/pipermail/java-nlp-user/2008-November/000082.html>?
>> > Like I said, using
>> > "-tokenizerOptions","americanize=false," just doesn't seem to work.
>> > I am using version 3.0 btw.
>> > Thanks for your help, I really appreciate it.
>> > John
>> > _______________________________________________
>> > java-nlp-user mailing list
>> > java-nlp-user at lists.stanford.edu
>> > <mailto:java-nlp-user at lists.stanford.edu>
>> > https://mailman.stanford.edu/mailman/listinfo/java-nlp-user
More information about the java-nlp-user