Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[liberationtech] Version 3.0 Complete GFW Rulebook for Wikipedia plus Comprehensive List for Websites, IPs, IMDB and AppStore (shortcut:

夏楚 summer.agony at
Wed Dec 25 23:09:03 PST 2013

To all,

Happy Holidays!

I just published Version 3.0 of my GFW research.

First of all, I created a "master spreadsheet" for all the findings and
updates at It contains links to the papers and
various lists. Also tweeted

There are several major additions in this version (V3.0 is located at

1, I created a monitoring pipeline which monitors GFW's updates on
Wikipedia. (For updates, one can subscribe to the mailing list
summeragony+subscribe at

2, I applied the methodology to four more areas:
  A. I examined more than 1 million website names (obtained from Alexa and
several online lists: greatfire, autoproxy). I identified 3644 GFW
filtering rules targeting website names. This list is significantly more
comprehensive and more precise than all precedents.
  B. I applied the methodology to IMDB, examined 4M titles and identified 6
GFW rules.
  C. I examined a big repository of AppStore apps (648,567 items) and
identified 26 GFW rules.
  D. I checked 786,432 IP strings and identified 130 GFW rules.

3. 9 new rules (deployed after 2013-10-01) against Wikipedia were

For readers who have seen V2.0 of the paper, the new sections are Section 9
(websites), Section 10 (IP strings), Section 11 (IMDB), Section 12
(AppStore) and Appendix C (list of the 3644 websites).

Again, this research is a solo project in my spare time, and people's
feedback is greatly appreciated. In particular, if you know large corpus
that GFW may filter, I'd love that input. For example, I only examined 1M
website names and ~60% of AppStore apps here, if you have a bigger
collection of website names or if you have a way to get the full AppStore
list, I'd love to take a look.

Last but not the least, as I mentioned in the paper, this study was
originally motivated by Dr Xu Zhiyong (wiki
news search <>),
whose Chinese
Wikipedia page <>is
(surprisingly) accessible in China (it turned out that GFW blocked a
non-standard variant of the page). Dr Xu is currently facing trial in
Beijing and may be sentenced to several years in prison, for his peaceful
efforts to make China a place with a little bit more freedom, righteousness
and love. China's New Citizens'
more support from the world!!


Xia Chu

On Fri, Oct 18, 2013 at 6:20 PM, 夏楚 <summer.agony at> wrote:

> To all,
> I just wrote up my new study of GFW and it is available at
> In this new version, I thoroughly studied GFW's HTTP response filtering
> scheme, which has not been well studied in the past. The bulk of the new
> result is in Section 5 (pp 8-12). The following is some excerpts regarding
> the new findings.
> *Abstract*
> In Version 2.0, we studied GFW's filtering rules for HTTP responses
> extensively and identified a comprehensive list (including those affecting
> Wikipedia and beyond). This list is small (19 items) but they affect many
> more pages on Wikipedia and other websites.
> *Section 5.3 Learnings and Mysteries of GFW's HTTP Response Filtering*
>    - GFW's HTTP request filtering and response filtering are two separate
>    systems. For one, their filtering rules are entirely different. For two,
>    GFW's HTTP request filtering is homogeneous and has near perfect trigger
>    rate, but GFW's HTTP response filtering varies hugely, not only in the
>    triggering rates, but also in the filtering rules in effect. For example,
>    CERNET (Chinese Education and Research Network) seems to have all the rules
>    in place, but some other ISPs only have a subset.
>    - One remarkable finding is that GFW does not just look at individual
>    TCP packet, but instead, it ``remembers'' the entire TCP session to look
>    for offenders. This becomes evident when the filtering rule is ``\$term\_A
>    \& \$term\_B'', and the two terms show up far apart (hundreds of thousands
>    bytes from each other) on a webpage, GFW will still be able to reset the
>    connection. To achieve this requires significant investment in
>    infrastructure, and it is probably also the reason why the rulebook is so
>    much smaller for HTTP response filtering than HTTP request filtering.
> Best,
> On Mon, Sep 30, 2013 at 4:26 PM, 夏楚 <summer.agony at> wrote:
>> To all,
>> I just finished writing up my research on GFW (Great Firewall of China)
>> blacklist for Wikipedia. Some of you might find it interesting.
>> The paper can be found at (tweeted here<>).
>> Here I paste excerpts from the Abstract and Conclusions below.
>> *Abstract*
>> In this report, we detail the *complete* and *exact* rulebook that the
>> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
>> (instead of the common term "blacklist'') because we not only identify the
>> blacklisted terms, but also the exact string matching rules deployed by
>> GFW. An efficient probing methodology makes this possible.
>> ...
>> Wikipedia contains millions of pages, e.g. more than 700,000 articles for
>> the Chinese version, and more than 4,240,000 articles for the English
>> version. It seems a daunting and unfeasible task to test these pages
>> exhaustively, hence there has been no well known attempt to gather the
>> complete blacklist.
>> While a small sample of the blacklist is useful, the complete picture
>> can be much more powerful in revealing the underlying works of GFW and
>> its operators. In this study, we devised a methodology which efficiently
>> examines the entire Wikipedia corpus, hence exposing to the world the
>> complete GFW rulebook for Wikipedia the first time. In total, there are 919
>> rules (excluding URL terms) which are applicable to Wikipedia, affecting
>> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>> The revealed rulebook also demonstrates that the GFW operation is
>> haphazard and ill-maintained. At the same time, Chinese
>> censorship bureaucracy *intends* to be thorough and extensive.
>> To be precise, the findings in this report are on two Wikipedia
>> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
>> English version.
>> *Conclusion Remarks*
>> In this study, we examined the entire Wikipedia corpus (Chinese version
>> and English version) and revealed the complete and exact GFW rulebook for
>> Wikipedia (with caveats described in Section 6).
>> A sample of notable findings are:
>>    - There are 78 terms for which GFW blocks a non-standard variant but
>>    not the canonical path. These are cases the censors intend to block but the
>>    block does not really happen, suggesting the censors have poor
>>    understanding of Wikipedia's serving system.
>>    - Many obscure non-article pages are blocked, which raises suspicion
>>    that these pages were provided to the censorship bureaucrats by Wikipedia
>>    editors who are very familiar with the content (e.g. those who participated
>>    in the edit wars and/or discussions regarding self-censorship proposals).
>>    - GFW string matching rules have a 64-byte hard limit of size.
>> The biggest learning out of this study, in my opinion, is that GFW
>> operation
>> is haphazard and ill-maintained. Also, there are many indications that the
>> GFW operators are somewhat disconnected from the censorship bureaucrats.
>> We hope the revealing can be of interest to internet censorship watchers,
>> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>> --
>> Xia Chu (Twitter: @summer.agony)
> --
> Xia Chu (Twitter: @summer.agony)

Xia Chu (Twitter: @summer.agony; Google+:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the liberationtech mailing list