Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[java-nlp-user] How to train chinese word with Stanford segmenter

John Bauer horatio at
Mon May 2 11:58:25 PDT 2011

You need dictionaries: files full of words, probably one word on each
line.  I don't think we can supply those, but here's one place you
might consider looking:  If you download
these files, you will obviously have to postprocess them some yourself
to get them in the one-word-per-line format.  If you have multiple
dictionaries, you will be able to specify that with a comma separated
list of files.

You also need a properties file.  For example, the one included, in
which you will obviously have to change the paths for "trainFile" and
"dictionary".  You can experiment with changing other settings to get
different training behavior.

Then, you just run it with

java6 -mx12g -prop
pk-chris3.prop -serializeTo foo.ser.gz

You can put the -serializeTo flag in the properties file if you want,
or you can specify other flags on the command line.  Also, you really
do need a ton of memory.  It might not be 12g, but it will be a lot.

I don't think we use POS as a feature.  You can imagine it would be
hard to tag something if you don't even know how to separate it into

I believe dict-chris6.ser.gz is a serialization of some of the
dictionaries we've used, but I'm not 100% sure of that, and I don't
know how to use it.

This information and the package currently available should get you
started.  We actually are going to update the distribution soon, by
the end of the month, but the models won't substantially change.


On Mon, May 2, 2011 at 3:25 AM, Rueshyna <rueshyna at> wrote:
> hi, everyone
> I want to train a new chineses dictionary that helps me to segment chinese
> sentences with white space.
> In other words, all the sentence in my corpus are well-segmented, for
> instance, the following sentences:
> 例如 , 用具 有 广谱 抗微生物活性 的 聚 腈基 丙烯酸酯 膜覆盖 皮肤表面 的 不可 缝合 性 小 伤口 将会 减弱 伤口感染 的 可能 。
> 第一 , 抗微生物剂 在 腈基 丙烯酸酯组合物 内 必须 是 可溶 或 可分散的 , 其 浓度 需要 达到 能 产生 抗微生物性质 。
> I had downloaded train the segmenter of an example and tried to understand
> it.
> Why does it use POS as a feature?
> I don't want to train a dictionary with POS feature.
> I saw feature about current caharacter, previous character, next charcter
> and the conjuntion of pervious and current on paper[1].
> How do I use it?
> What is the "dict-chris6.ser.gz"??????
> How do I use it with training dictionary?
> Thanks!
> [1]    P.-C. Chang, et al., "Optimizing Chinese word segmentation for
> machine translation performance," presented at the Proceedings of the Third
> Workshop on Statistical Machine Translation, Columbus, Ohio, 2008.
> by Rueshyna
> _______________________________________________
> java-nlp-user mailing list
> java-nlp-user at
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pk-chris3.prop
Type: application/octet-stream
Size: 1969 bytes
Desc: not available
URL: <>

More information about the java-nlp-user mailing list