Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] text size limits?

John Bauer horatio at
Tue Apr 5 11:23:44 PDT 2011

No promises on when it will be available, but we do hope to change the
annotation stored for the coref.  That should help with the memory
requirements.  If there's something else to work on first, it might be worth
doing that for a month or two,

On Apr 4, 2011 6:13 PM, "Mark Hansel" <hansel at> wrote:
> Using the full novel as input, with 12G additional memory (18 total), I
> ran the core-nlp with 12 GB and then again 17GB maximum memory. In both
> instances, "top" reported memory stabilized at about 4G after running
> overnight. After about 18 hours into both runs, two things occurred: a)
> reported memory use declined from 4GB to (at last look) 2.1G/1.9GB (as
> reported by "top") and b) the System Monitor reported that 99.9% of RAM
> and 100% of swap memory were in use. Apparenly, a daemon controlled much
> of RAM (gvfs -- 14.2GB).
> I am not ready to give up.
> (1) Does the stand-alone dcoref work any differently than
> the version integrated into the core-nlp? It takes trivial
> changes in my software to work with two output files,
> particularly as the coreference report can stand alone, even if
> markup differs.
> (2) If stand-alone coref won't work on my files, I can still work
> with split files, though I don't yet know limits. As my primary
> intial interest is persons, this would require revisiting
> pronouns.
> (3) Part of my project (one that should come last) does
> not require NLP. I can buy significant time by moving that
> forward.
> Any insight you can provide would help a lot.
> BTW, running the full file without the dcoref annotator completes in 18
> minutes.
> Thank you.
> Mark Hansel
> hansel at
> On Tue, 22 Mar 2011, John Bauer wrote:
>> It looks like you will need at least 6G to do this file. That's how
>> much memory it's taking on one of our machines right now...
>> Unfortunately, since it uses the whole document to do coref, there's
>> no simple way to reduce that number without changing the code itself.
>> I consulted our coref experts, and the running time is quadratic in
>> the number of mentions. An approximation for this is document length,
>> so if you take X hours to do half of the file, it should take about 4X
>> hours to do the whole file. All bets are off if you don't give it
>> enough memory, though. It sometimes takes a long time to notice it's
>> out of memory.
>> John
>> On Mon, Mar 21, 2011 at 2:40 PM, Mark Hansel <hansel at> wrote:
>>> File is attached.
>>> Thanks a lot.
>>> Mark Hansel
>>> Emeritus Professor of Sociology and Criminal Justice
>>> Minnesota State University Moorhead
>>> hansel at
>>> On Mon, 21 Mar 2011, John Bauer wrote:
>>>> Mind sending me the file you're using?  I'll run it on one of our
>>>> large memory machines and see if that's the problem or not.
>>>> Thanks,
>>>> John
>>>> On Mon, Mar 21, 2011 at 2:13 PM, Mark Hansel <hansel at>
>>>>> I raised memory limit to 5GB and let corenlp run for about 8 hours
>>>>> killing the job, after checking the output file. (I have been assuming
>>>>> that
>>>>> run time is non-linear, but that the curve is not exponential.)
>>>>> I want to use this tool, as it saves me immense tool building and
>>>>> tools I have explored (e.g., Gate) present similar (and some unique)
>>>>> challenges. My options appear to be:
>>>>>        1. Add memory (feasible and likely).
>>>>>        2. Chunk large texts and
>>>>>                a. piece together the coreference chain (challenging)
>>>>>                b. foresake coreference reassemply
>>>>>        3. Suggestions?
>>>>> Thank you,
>>>>> Mark Hansel
>>>>> Emeritus Professor of Sociology and Criminal Justice
>>>>> Minnesota State University Moorhead
>>>>> hansel at
>>>>> On Fri, 18 Mar 2011, John Bauer wrote:
>>>>>> There's no reason it shouldn't work for the novel.  Conversely, if
>>>>>> got an OOM exception, it's possible it was silently caught and the
>>>>>> just gave up.  You could try giving it even more memory and see if
>>>>>> that helps, since it sounds like you have the memory available.
>>>>>> The extremely long delay is an annoying side effect I've seen when
>>>>>> processing large memory jobs... it garbage collects over and over
>>>>>> before deciding it can't continue.
>>>>>> John
>>>>>> On Fri, Mar 18, 2011 at 6:00 AM, Mark Hansel <hansel at>
>>>>>>> How large a text will the stanford-corenlp process? After almost 39
>>>>>>> hours,
>>>>>>> the process terminated with no error messages, but an empty xml file
>>>>>>> have
>>>>>>> found no log file).
>>>>>>> I fed corenlp (with coreference resolution)  a complete novel of
>>>>>>> lines,
>>>>>>> 76000+ words and about 425K characters. Splitting the novel in two,
>>>>>>> produces
>>>>>>> usable output (in resonable time - ~1H).
>>>>>>> Some details: I set -Xm to 4g. It uses 3.2G of RAM used and about
>>>>>>> of
>>>>>>> virtual memory. I made only minor modifications to the suggested
>>>>>>> command
>>>>>>> line in the corenlp distriubution (file list, output directory).
>>>>>>> (Workstation is sufficient for the task - 6GB RAM, multiple cores.)
>>>>>>> Without coreference resolution, the software offers me little gain.
>>>>>>> Any suggestions? (I am not a java native, fluent in Fortran and C.
>>>>>>> this
>>>>>>> project, C.)
>>>>>>> Thank you,
>>>>>>> Mark Hansel
>>>>>>> Emeritus Professor of Sociology and Criminal Justice
>>>>>>> Minnesota State University Moorhead
>>>>>>> hansel at
>>>>>>> _______________________________________________
>>>>>>> java-nlp-user mailing list
>>>>>>> java-nlp-user at
>>>>> _______________________________________________
>>>>> java-nlp-user mailing list
>>>>> java-nlp-user at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the java-nlp-user mailing list