Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] text size limits?

John Bauer horatio at gmail.com
Tue May 17 13:31:09 PDT 2011


Hi Mark,

Quadratic running time isn't great, but you can see how it might
occur, since we compare each part of the document to each other part
of the document.

I found that running on the book you sent me took about 2 days and 8G.
 When I have some time, I'll see if I can figure out which coref sieve
is taking the most time and try to reduce the constant factors some.
This won't be any time soon, though,

John

On Tue, May 17, 2011 at 10:26 AM, Mark Hansel <hansel at mnstate.edu> wrote:
> John,
>
> CoreNLP has been running on the Christie novel for 1300 minutes and is still
> running. Without coref, it took just over 18 minutes, over several runs (v
> 1.0.2/3). Quadratic time .... A quadratic curve can approach vertical!
>
> There are strange constructions in the text (e.g., use of ':' as a
> terminator). These produce unexpected "sentences," but they seem a
> reasonable resolution to the irregular grammar (and I can work with the
> results). There are also many British spellings (many corrected by me --
> "standardized" -- but a few missed). However, CoreNLP seems to catch these
> (lemma tags). However, it is difficult to consider seriously that one of
> these facts is even relevant.
>
> If you think this is of broader interest, I can direct emails through the
> list. My interest is not NLP, but crime "mentality" as revealed in popular
> culture, most likely a series of Gothic novels (which have a longer history
> than crime fiction, per se -- say mid 19th C). The Christie novel is a
> convenient development tool and good NLP tools help avoid an otherwise
> prohibitive coding effort.
>
> The methods I will be using are highly modified text mining methods (No Doc
> Term Matrix (DTM), but a series of Sentence Term Matrices (STM) for
> important book characters). I suspect there is little substantive overlap
> with list members.
>
>
> Thanks again.
>
> Mark Hansel
> Emeritus Professor of Sociology and Criminal Justice
> Minnesota State University Moorhead
> hansel at mnstate.edu
> hansel at hanselshire.org
>
>
> On Mon, 16 May 2011, John Bauer wrote:
>
>> We just released a new version of CoreNLP.  I *think* it will handle
>> the file you sent me.  I started a test over a day ago, and although
>> it hasn't finished, it hasn't run out of memory yet either,
>>
>> John
>>
>> On Tue, Apr 5, 2011 at 11:23 AM, John Bauer <horatio at gmail.com> wrote:
>>>
>>> No promises on when it will be available, but we do hope to change the
>>> annotation stored for the coref.  That should help with the memory
>>> requirements.  If there's something else to work on first, it might be
>>> worth
>>> doing that for a month or two,
>>>
>>> John
>>>
>>> On Apr 4, 2011 6:13 PM, "Mark Hansel" <hansel at mnstate.edu> wrote:
>>>>
>>>> Using the full novel as input, with 12G additional memory (18 total), I
>>>> ran the core-nlp with 12 GB and then again 17GB maximum memory. In both
>>>> instances, "top" reported memory stabilized at about 4G after running
>>>> overnight. After about 18 hours into both runs, two things occurred: a)
>>>> reported memory use declined from 4GB to (at last look) 2.1G/1.9GB (as
>>>> reported by "top") and b) the System Monitor reported that 99.9% of RAM
>>>> and 100% of swap memory were in use. Apparently, a daemon controlled
>>>> much
>>>> of RAM (gvfs -- 14.2GB).
>>>>
>>>> I am not ready to give up.
>>>>
>>>> (1) Does the stand-alone dcoref work any differently than
>>>> the version integrated into the core-nlp? It takes trivial
>>>> changes in my software to work with two output files,
>>>> particularly as the coreference report can stand alone, even if
>>>> markup differs.
>>>>
>>>> (2) If stand-alone coref won't work on my files, I can still work
>>>> with split files, though I don't yet know limits. As my primary
>>>> initial interest is persons, this would require revisiting
>>>> pronouns.
>>>>
>>>> (3) Part of my project (one that should come last) does
>>>> not require NLP. I can buy significant time by moving that
>>>> forward.
>>>>
>>>> Any insight you can provide would help a lot.
>>>>
>>>> BTW, running the full file without the dcoref annotator completes in 18
>>>> minutes.
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Mark Hansel
>>>> hansel at mnstate.edu
>>>>
>>>> On Tue, 22 Mar 2011, John Bauer wrote:
>>>>
>>>>> It looks like you will need at least 6G to do this file. That's how
>>>>> much memory it's taking on one of our machines right now...
>>>>> Unfortunately, since it uses the whole document to do coref, there's
>>>>> no simple way to reduce that number without changing the code itself.
>>>>>
>>>>> I consulted our coref experts, and the running time is quadratic in
>>>>> the number of mentions. An approximation for this is document length,
>>>>> so if you take X hours to do half of the file, it should take about 4X
>>>>> hours to do the whole file. All bets are off if you don't give it
>>>>> enough memory, though. It sometimes takes a long time to notice it's
>>>>> out of memory.
>>>>>
>>>>> John
>>>>>
>>>>> On Mon, Mar 21, 2011 at 2:40 PM, Mark Hansel <hansel at mnstate.edu>
>>>>> wrote:
>>>>>>
>>>>>> File is attached.
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>>> Mark Hansel
>>>>>> Emeritus Professor of Sociology and Criminal Justice
>>>>>> Minnesota State University Moorhead
>>>>>> hansel at mnstate.edu
>>>>>>
>>>>>>
>>>>>> On Mon, 21 Mar 2011, John Bauer wrote:
>>>>>>
>>>>>>> Mind sending me the file you're using?  I'll run it on one of our
>>>>>>> large memory machines and see if that's the problem or not.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>> On Mon, Mar 21, 2011 at 2:13 PM, Mark Hansel <hansel at mnstate.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I raised memory limit to 5GB and let corenlp run for about 8 hours
>>>>>>>> before
>>>>>>>> killing the job, after checking the output file. (I have been
>>>>>>>> assuming
>>>>>>>> that
>>>>>>>> run time is non-linear, but that the curve is not exponential.)
>>>>>>>>
>>>>>>>> I want to use this tool, as it saves me immense tool building and
>>>>>>>> other
>>>>>>>> tools I have explored (e.g., Gate) present similar (and some unique)
>>>>>>>> challenges. My options appear to be:
>>>>>>>>
>>>>>>>>        1. Add memory (feasible and likely).
>>>>>>>>        2. Chunk large texts and
>>>>>>>>                a. piece together the coreference chain (challenging)
>>>>>>>>                b. foresake coreference reassemply
>>>>>>>>        3. Suggestions?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> Mark Hansel
>>>>>>>> Emeritus Professor of Sociology and Criminal Justice
>>>>>>>> Minnesota State University Moorhead
>>>>>>>> hansel at mnstate.edu
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 18 Mar 2011, John Bauer wrote:
>>>>>>>>
>>>>>>>>> There's no reason it shouldn't work for the novel.  Conversely, if
>>>>>>>>> you
>>>>>>>>> got an OOM exception, it's possible it was silently caught and the
>>>>>>>>> JVM
>>>>>>>>> just gave up.  You could try giving it even more memory and see if
>>>>>>>>> that helps, since it sounds like you have the memory available.
>>>>>>>>>
>>>>>>>>> The extremely long delay is an annoying side effect I've seen when
>>>>>>>>> processing large memory jobs... it garbage collects over and over
>>>>>>>>> before deciding it can't continue.
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> On Fri, Mar 18, 2011 at 6:00 AM, Mark Hansel <hansel at mnstate.edu>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> How large a text will the stanford-corenlp process? After almost
>>>>>>>>>> 39
>>>>>>>>>> hours,
>>>>>>>>>> the process terminated with no error messages, but an empty xml
>>>>>>>>>> file
>>>>>>>>>> (I
>>>>>>>>>> have
>>>>>>>>>> found no log file).
>>>>>>>>>>
>>>>>>>>>> I fed corenlp (with coreference resolution)  a complete novel of
>>>>>>>>>> ~10K
>>>>>>>>>> lines,
>>>>>>>>>> 76000+ words and about 425K characters. Splitting the novel in
>>>>>>>>>> two,
>>>>>>>>>> produces
>>>>>>>>>> usable output (in resonable time - ~1H).
>>>>>>>>>>
>>>>>>>>>> Some details: I set -Xm to 4g. It uses 3.2G of RAM used and about
>>>>>>>>>> 4400M
>>>>>>>>>> of
>>>>>>>>>> virtual memory. I made only minor modifications to the suggested
>>>>>>>>>> command
>>>>>>>>>> line in the corenlp distriubution (file list, output directory).
>>>>>>>>>> (Workstation is sufficient for the task - 6GB RAM, multiple
>>>>>>>>>> cores.)
>>>>>>>>>>
>>>>>>>>>> Without coreference resolution, the software offers me little
>>>>>>>>>> gain.
>>>>>>>>>>
>>>>>>>>>> Any suggestions? (I am not a java native, fluent in Fortran and C.
>>>>>>>>>> For
>>>>>>>>>> this
>>>>>>>>>> project, C.)
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>>
>>>>>>>>>> Mark Hansel
>>>>>>>>>> Emeritus Professor of Sociology and Criminal Justice
>>>>>>>>>> Minnesota State University Moorhead
>>>>>>>>>> hansel at mnstate.edu
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> java-nlp-user mailing list
>>>>>>>>>> java-nlp-user at lists.stanford.edu
>>>>>>>>>> https://mailman.stanford.edu/mailman/listinfo/java-nlp-user
>>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> java-nlp-user mailing list
>>>>>>>> java-nlp-user at lists.stanford.edu
>>>>>>>> https://mailman.stanford.edu/mailman/listinfo/java-nlp-user
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>



More information about the java-nlp-user mailing list