Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[liberationtech] Complete GFW Rulebook for Wikipedia

夏楚 summer.agony at
Wed Oct 2 16:36:46 PDT 2013

Thanks Colin for your nice words!

I just uploaded the "rulebook" to Google Spreadsheet at
It can be used in whatever way, but do read the paper
(http://<> first to avoid misinterpretation.


On Tue, Oct 1, 2013 at 12:45 PM, Collin Anderson
<collin at>wrote:

> Congratulations, this is impressive work. I am also completely jealous --
> a colleague and myself will be releasing a similar report for Iran in the
> next two weeks. This is intended at a broader global project on Wikipedia
> censorship ({{Citation Filtered}}) that I would hope might merge well into
> what you are doing.
> On Mon, Sep 30, 2013 at 7:26 PM, 夏楚 <summer.agony at> wrote:
>> To all,
>> I just finished writing up my research on GFW (Great Firewall of China)
>> blacklist for Wikipedia. Some of you might find it interesting.
>> The paper can be found at (tweeted here<>).
>> Here I paste excerpts from the Abstract and Conclusions below.
>> *Abstract*
>> In this report, we detail the *complete* and *exact* rulebook that the
>> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
>> (instead of the common term "blacklist'') because we not only identify the
>> blacklisted terms, but also the exact string matching rules deployed by
>> GFW. An efficient probing methodology makes this possible.
>> ...
>> Wikipedia contains millions of pages, e.g. more than 700,000 articles for
>> the Chinese version, and more than 4,240,000 articles for the English
>> version. It seems a daunting and unfeasible task to test these pages
>> exhaustively, hence there has been no well known attempt to gather the
>> complete blacklist.
>> While a small sample of the blacklist is useful, the complete picture
>> can be much more powerful in revealing the underlying works of GFW and
>> its operators. In this study, we devised a methodology which efficiently
>> examines the entire Wikipedia corpus, hence exposing to the world the
>> complete GFW rulebook for Wikipedia the first time. In total, there are 919
>> rules (excluding URL terms) which are applicable to Wikipedia, affecting
>> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>> The revealed rulebook also demonstrates that the GFW operation is
>> haphazard and ill-maintained. At the same time, Chinese
>> censorship bureaucracy *intends* to be thorough and extensive.
>> To be precise, the findings in this report are on two Wikipedia
>> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
>> English version.
>> *Conclusion Remarks*
>> In this study, we examined the entire Wikipedia corpus (Chinese version
>> and English version) and revealed the complete and exact GFW rulebook for
>> Wikipedia (with caveats described in Section 6).
>> A sample of notable findings are:
>>    - There are 78 terms for which GFW blocks a non-standard variant but
>>    not the canonical path. These are cases the censors intend to block but the
>>    block does not really happen, suggesting the censors have poor
>>    understanding of Wikipedia's serving system.
>>    - Many obscure non-article pages are blocked, which raises suspicion
>>    that these pages were provided to the censorship bureaucrats by Wikipedia
>>    editors who are very familiar with the content (e.g. those who participated
>>    in the edit wars and/or discussions regarding self-censorship proposals).
>>    - GFW string matching rules have a 64-byte hard limit of size.
>> The biggest learning out of this study, in my opinion, is that GFW
>> operation
>> is haphazard and ill-maintained. Also, there are many indications that the
>> GFW operators are somewhat disconnected from the censorship bureaucrats.
>> We hope the revealing can be of interest to internet censorship watchers,
>> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>> --
>> Xia Chu (Twitter: @summer.agony)
>> --
>> Liberationtech is public & archives are searchable on Google. Violations
>> of list guidelines will get you moderated:
>> Unsubscribe, change to digest, or change password by emailing moderator at
>> companys at
> --
> *Collin David Anderson*
> | @cda | Washington, D.C.
> --
> Liberationtech is public & archives are searchable on Google. Violations
> of list guidelines will get you moderated:
> Unsubscribe, change to digest, or change password by emailing moderator at
> companys at

Xia Chu (Twitter: @summer.agony; Google+:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the liberationtech mailing list