Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] begin / end indexes in POS tagger

John Bauer horatio at gmail.com
Thu Apr 7 16:19:50 PDT 2011


Heh...  a few months late, but possibly still helpful.

In the next release of the tagger, the TaggedWords produced by the
tagger will have the begin & end offsets attached whenever possible.

In the meantime, giving the tagger the following option turns off the
americanization:

 -tokenizerOptions americanize=false

John

On Tue, Oct 12, 2010 at 12:45 PM, John Bauer <horatio at gmail.com> wrote:
> I know the PTBTokenizer has the capability of putting the offset on a word
> it tokenizes, and TaggedWord has the capability to store it, but it's not
> being set anywhere.  I'll put that on my list of things to look at,
>
> John
>
> On Tue, Oct 12, 2010 at 12:37 PM, John Wiesel <john.wiesel at fu-berlin.de>
> wrote:
>>
>> Hi John,
>>
>> Yes, exactly. I mean the character offset, which can be available via the
>> beginPosition and
>> endPosition properties but remains -1 when using tokenizeText and
>> tagSentence...
>>
>> Thanks,
>> John
>>
>> Am 12.10.2010 19:59, schrieb John Bauer:
>> > Hi John,
>> >
>> > When you say indexes, do you mean the character offset in the text, or
>> > do you mean something else?
>> >
>> > -John
>> >
>> > On Tue, Oct 12, 2010 at 6:05 AM, John Wiesel <john.wiesel at fu-berlin.de
>> > <mailto:john.wiesel at fu-berlin.de>> wrote:
>> >
>> >     Dear all,
>> >
>> >     I am currently working on a IE project for my diploma thesis and
>> > would really appreciate your input
>> >     on a problem that currently drives me nuts..
>> >
>> >     After tokenizing my input using the MaxentTagger.tokenizeText and
>> > then applying tagSentence to
>> >     obtain the POS tags, the resulting TaggedWords do not have begin or
>> > end indexes... As a workaround I
>> >     am manually keeping track of the indexes but due to the fact that my
>> > MaxentTagger always
>> >     americanizes the input, even if I feed it a TaggerConfig at
>> > initialization, my workaround fails time
>> >     and again.
>> >
>> >     First, is there a way to obtaining the indexes, like forcing
>> > MaxentTagger to use a different
>> >     Tokenizer, like using the CoreLabelTokenFactory?
>> >
>> >     If not, is there a way of preventing the default tokenizer the
>> > MaxentTagger uses from americanizing
>> >     the input, much like here
>> >
>> > <https://mailman.stanford.edu/pipermail/java-nlp-user/2008-November/000082.html>?
>> > Like I said, using
>> >     "-tokenizerOptions","americanize=false," just doesn't seem to work.
>> >
>> >     I am using version 3.0 btw.
>> >
>> >     Thanks for your help, I really appreciate it.
>> >
>> >     John
>> >     _______________________________________________
>> >     java-nlp-user mailing list
>> >     java-nlp-user at lists.stanford.edu
>> > <mailto:java-nlp-user at lists.stanford.edu>
>> >     https://mailman.stanford.edu/mailman/listinfo/java-nlp-user
>> >
>> >
>>
>
>



More information about the java-nlp-user mailing list