Search Mailing List Archives
[liberationtech] Complete GFW Rulebook for Wikipedia
summer.agony at gmail.com
Mon Sep 30 16:26:01 PDT 2013
I just finished writing up my research on GFW (Great Firewall of China)
blacklist for Wikipedia. Some of you might find it interesting.
The paper can be found at goo.gl/RnMvG1 (tweeted
Here I paste excerpts from the Abstract and Conclusions below.
In this report, we detail the *complete* and *exact* rulebook that the
Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
(instead of the common term "blacklist'') because we not only identify the
blacklisted terms, but also the exact string matching rules deployed by
GFW. An efficient probing methodology makes this possible.
Wikipedia contains millions of pages, e.g. more than 700,000 articles for
the Chinese version, and more than 4,240,000 articles for the English
version. It seems a daunting and unfeasible task to test these pages
exhaustively, hence there has been no well known attempt to gather the
While a small sample of the blacklist is useful, the complete picture
can be much more powerful in revealing the underlying works of GFW and
its operators. In this study, we devised a methodology which efficiently
examines the entire Wikipedia corpus, hence exposing to the world the
complete GFW rulebook for Wikipedia the first time. In total, there are 919
rules (excluding URL terms) which are applicable to Wikipedia, affecting
5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
The revealed rulebook also demonstrates that the GFW operation is
haphazard and ill-maintained. At the same time, Chinese
censorship bureaucracy *intends* to be thorough and extensive.
To be precise, the findings in this report are on two Wikipedia
snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
In this study, we examined the entire Wikipedia corpus (Chinese version
and English version) and revealed the complete and exact GFW rulebook for
Wikipedia (with caveats described in Section 6).
A sample of notable findings are:
- There are 78 terms for which GFW blocks a non-standard variant but not
the canonical path. These are cases the censors intend to block but the
block does not really happen, suggesting the censors have poor
understanding of Wikipedia's serving system.
- Many obscure non-article pages are blocked, which raises suspicion
that these pages were provided to the censorship bureaucrats by Wikipedia
editors who are very familiar with the content (e.g. those who participated
in the edit wars and/or discussions regarding self-censorship proposals).
- GFW string matching rules have a 64-byte hard limit of size.
The biggest learning out of this study, in my opinion, is that GFW operation
is haphazard and ill-maintained. Also, there are many indications that the
GFW operators are somewhat disconnected from the censorship bureaucrats.
We hope the revealing can be of interest to internet censorship watchers,
Wikipedia researchers, China observers, and ordinary Chinese citizens.
Xia Chu (Twitter: @summer.agony)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the liberationtech