A little Study of the Internet Censorship in China
Written by Julen Madariaga on January 14th, 2009Last Sunday I did a post on internet censorship in China where I mixed in various different ideas and I’m afraid the final result regarding Search Engine Censorship didn’t come out as clear as I would have liked. I think it is an important subject, so here are the complete results:
We will be looking at Google.cn, Google.com and Baidu.com, and we will try in each of them 3 different kind of search terms.
A- Chrter 08: In all its combinations, which are 08宪章 and 零八宪章
B- Political Terms: Tiananmen incidents (天安门六四事件), FLG.
C- Vulgar words: Sex. I will employ the “blog job” and the “chicken bar”.
It is understood that in all cases the search terms are in Simplified Chinese. The browser is Firefox 3.0.5. and the connection is a normal home DSL by China Telecom. The possible results are:
- Free Search - Results look consistent and realistic, like the ones obtained in the West.
- Reset Connection (RC) - This can only be seen in Mainland China. The result is an image like the one below and the search engine cannot open anymore for a while (I estimate 30 seconds). RC is not directly done by the Search Engine. Wikipedia internal search also gives RCs for B Terms.
- Forbidden Message (FM) - This is the forbidden Message that, with slight variations, is the same as shown below. It says something in the lines of: “Some results are not displayed according to the local laws, regulations and policies”.
- Manipulated Results (MR)- This is the case where the results are obviously manipulated, for example in the search of 天安门六四事件 (Tiananmen incident) on Baidu, where all the results are official newspapers such as People’s Daily, etc. Sometimes it can also carry on top of the page a FM.
Google.com
A -Free Search. (But click some individual results gives RC).
B- Reset Connection
C- Manipulated Results.
Google.cn
A- Forbidden Message and (sometimes *) Manipulated Results
B- Reset Connection.
C- Forbidden Message. When used “” gives Manipulated Results.
Baidu.com
A- Manipulated Results. When used “” gives Forbidden Message.
B- FM and Manipulated results.
C-FM and Manipulated Results.
Conclusions
1- The results are somewhat erratic and it is difficult to see a pattern: it all looks like a series of patches on top of each other rather than a systematic implementation. Also, things change in time, as in *, where the Manipulated Result I saw Sunday cannot be seen anymore.
2- Baidu has a different system from Google: it has no Reset Connections. This is very advantageous for Baidu and I understand it is unfair competition, as a RC is one of the worst experiences while surfing.
3- This might be due to Google’s own preference server location: the involvement of the Search Engines in the RC is unclear no direct involvement (even Wikipedia has RCs!) whereas Manipulated Results obviously requires their action, and can more easily attract attention from Advocacy Groups. Of course, in the case of sexual terms (C), this is not a problem as the Manipulated Results can just be called “Safe Search”.
4- The Chrter 08 has different treatment than other political terms, but it might just be because it was banned urgently and suddenly, so it is only a quick fix added to existing structure. It does not provoke RC in any case. It looks like they have decided to leave it alone on Google.com to avoid attention from Western advocacy groups, but in exchange Google has had to give up Google.cn and apply the infamous “porn block” to it which is active censorship by SE. Why the FM and not RC? Who knows, I am guessing perhaps RC is more complicated to implement.
5- In any case, and however negative, I understand it is always better to show FM than Manipulated Results, because the former is openly admitting censorship, whereas the latter is a lie and a distortion of reality. Forbidden Message does increase transparency, yet does not justify involvement in political censorship. From this perspective, Google is closer to the truth than Baidu. Baidu seems indeed a more active participant in the government’s information control schemes, and Chinese users of Baidu are clearly the most exposed to Search Engine brainwash.
UPDATE: Following corrections by international expert Nart Villeneuve below: I have introduced a few changes of my own (in blue). In any case, this post is just a very basic review of the SE Censorship system from the perspective of a normal user. If you really want to understand how the GFW works, you should read proper research papers like this one, or this one.
.
IMAGES:
1- FORBIDDEN MESSAGE (FM)
2- RESET CONNECTION (RC)
NOTE: If someone is interested in this or has some more information to share please put it in comments. Unfortunately my time is very limited so I only ran 2 or 3 terms for each of the classes A, B and C above. There might be things I overlooked and I would be grateful if you can point them out.
15
PM
You forgot one thing:
“where the Manipulated Result I saw Sunday cannot be seen anymore.”
Based on their new granular filtering system, thanks to Cisco, your searches are actually building up the database. The system is based on a web crawling approach and also on a user based inquiry base.
This is the best combination and it’s more fast than on relying only on web crawling.
[Reply to this comment]
15
PM
I should add: VPN’s are also monitored or tampered with. Witopia for me was next to unusable in the days of the Olympic.
The company refused to engage in a deep discussion about it, but they clearly told me: China might be tampering with the VPN’s, but they can’t decode it (who knows)…
[Reply to this comment]
15
PM
Granular system? Mhh. I am not sure I understand the concept. You mean it learns from the searches, and since people searching for the Charter and finding a People’s Daily result don’t usually click on it, then the machine deems it irrelevant and eliminates it from next search?
[Reply to this comment]
16
AM
maybe with enough censorship ppl will begin using freenet, Psiphon, onion routing and many more p2p programs to make is a simply daunting task to track everything.
recipy?
waste networking + twitter + jabber + Drupal/CMS server-client + email/IM/voip + torrent stream-server/client + new DNS table = new internet backbone?
[Reply to this comment]
18
PM
@uln
Sorry, my comment was not clear enough. It basically means that the millions of searches that the people make, are contributing to the database of blocked content.
I have the feeling that their system is based on the following dual approach: Crawling the web with bots (similar to Google) in order to intercept the offensive content in advance + blocking new content based on their “offensive terms and words database” when people find new links and they get processed in Beijing. Basically, anything that goes in and out of China is mirrored on Beijing’s servers and then analyzed (automotically of course at first and potentially extensively by a human operator if further measures needs to be taken).
In the past, they would block whole domains, but now they are able to selectively block sub domains and even specific links inside a websites. This reinforce the pervasive notion that it “might just be” a technical problem since I can access the rest of the website.
Youtube is a very good example of this behavior.
[Reply to this comment]
18
PM
Mm. Interesting, I am going to watch the bots that visit my website for weird ones to see if I can find out which is the one of the State Council. If I know my CPC well, it should be easy to identify, it’s probably called “GreatWallbot” or “LiberationBot”
Back to your comment: there are many ways that the authorities censor content, and as you say, they can sometimes only censor one post within a blog (this was the case today when I found out one of the threads at FM had the RC block): it is here.
But there is a reason why I haven’t considered these things in my post. The above little “study” is only focused on “Search Engine censorship” and the extent to which these search engines collaborate with the censors. The examples we are giving here like Youtube and FM are a different aspect, and cannot be controlled by the owners of these sites or by the Search Engines.
[Reply to this comment]
19
AM
“But there is a reason why I haven’t considered these things in my post.”
My comment was not a critic, what you’ve done is interesting. I just wanted to expand a bit on the subject for the fun of it…
[Reply to this comment]
19
AM
And of course all this information is covered in the excellent article James Fallows wrote a while ago about it.
[Reply to this comment]
19
AM
Oops. Of course, I guess I forgot to say thanks
Yes, I know that Fallows article and it is brilliant, I have linked to it recently in another post.
[Reply to this comment]
1
AM
You might be interested in a paper I wrote on search engine filtering.
http://ssrn.com/abstract=1157373
“Baidu has a different system from Google: it has no Reset Connections. ”
This is because you are connecting to Baidu without passing through the filtering system (gfw). If you connected to Baidu from outside China I can trigger the RC. This is also why you get RC when connecting to Google.
The RC’s you get are due to the filtering (gfw), not Google(.com). Google.cn has servers inside China, but you can also connect to google.cn server outside China. i find it best to manually specify the IP, that way you know what/where you are connecting to.
Also, there are differences in search engine results for a variety of reason, one of which is the location of the crawlers — if they are indexing from inside China then sites blocked (gfw) are not indexed and don’t need to be censored by the search engines.
[Reply to this comment]
1
PM
Hi, thanks a lot. I downloaded your paper and I find it very helpful.
I am quite surprised by this part of your comment though: “This is because you are connecting to Baidu without passing through the filtering system (gfw). If you connected to Baidu from outside China I can trigger the RC.”
1- Basically what you are saying is that the GFW works in BOTH directions?? So it not only blocks incoming content, but also content going from China to the outside. If this is true, RC blocks on content that is hosted in a server within China can only be seen from outside China, and viceversa. I find this surprising, because it defeats the purpose of Chinese censorship: they want to block content from showing inside China, while giving an image of (relative) openness to the outside. Are you sure of this bidirectionality of GFW??
2- Also, one related question, are you sure GFW is ONLY applied to content crossing the border of mainland China, so it is only a “border control”, as opposed to also blocking content ciculating within China. I suspect this is true, as is exlplained for example in the famous
Fallows article. But I don’t have any proof.
3- Thanks for the info on the crawlers too. My approach however is different. I look at Censorship from the side of the final user, and my question is: What is google/baidu showing the users when they perform a search? Whether it is for crawlers/servers or other technical reasons -which Google has certainly the know-how to understand and solve- the essential is to find out: Search Engines are consiously giving the final user manipulated information, yes or not.
In this sense, I found very useful the points in your paper about transparency, thanks again for the link.
Check also the other more comprehensive post on censorship I wrote (link below).
[Reply to this comment]