http中的字符编码

字符编码和python中的字符编码两文对字符编码简单的介绍。现在开始讨论http中的编码问题，当完成编码系列的文章后，开始完成一系统http的文章，本文还是需要一些http基本的知识。

做为java出身的coder，今天还是用python语言来实现http的实例，java实现个东西太麻烦，Simple is better than complex.

响应头中的`Content-Type`

我们知道http响应报文，包括两部分实体首部（响应头）和实体主体（响应主体），响应头是对主体内容的描述，告知浏览器怎么处理主体内容（文本，图片等等）。上代码：

#coding=utf-8

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler

class MyRequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/plain')
        self.end_headers()
        self.wfile.write('hello web')

server = HTTPServer(('127.0.0.1', 9000), MyRequestHandler)
server.serve_forever()

不了解上面代码没关系，你只需要知道它是一个简单的web服务（只支持GET），只返回一个文本。运行后，打开浏览器访问http://localhost:9000

Encoding Img

看到上面我们预料之中的结果，如加入中文后，会出现什么情况

#将self.wfile.write('hello web')替换为下行内容
self.wfile.write('hello web 编码')

再次运行，访问浏览器

Encoding Img

乱码出现了。返回浏览器的主体是hello web 编码，响应头是Content-Type:text/plain。只说明返回的是文本，而没具体说明该用哪个字符集来解析该文本(浏览器默认操作系统字符集处理gbk)。若改Content-Type:text/plain;charset=utf-8，再次查看结果，乱码消失了。charset参数是告知浏览器如何把主体内容中的二进制转换为字符，同理可以推断出该程序会将文本按utf-8编码处理成二进制，在网络上传输。

响应头中的`Content-Encoding`

Content-Encoding常见的取值：

gzip        实体采用GNU zip编码
compress    采用Unix的文件压缩程序
deflate     用zlib格式压缩
identity    没有进行任何编码

上三种都是无损压缩算法，用于减少传输报文的大小写，不会导致信息损失。其中gzip效率是最高的。
与之请求对应的Accept-Encoding相对应。

#coding=utf-8
'''
http-encode-gzip.py 简单httpserver
'''

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler

import gzip, cStringIO, urllib

# 添加gzip压缩
def compressBuf(buf):
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode = 'wb',  fileobj = zbuf, compresslevel = 9)
    zfile.write(buf)
    zfile.close()
    return zbuf.getvalue()

class MyRequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/html')
        self.send_header('Content-Encoding','gzip')  #若注释该行，客户端就会出错
        self.end_headers()

        content = '''<html>
        <head>
            <title>最简单的httpserver</title>
            <meta charset="utf-8"/>
        </head>
        <body>就提供这一个页面</body></html>'''

        # 对返回客户端内容压缩
        zbuf = compressBuf(content) # 与self.send_header('Content-Encoding','gzip')对应
        print zbuf
        self.wfile.write(zbuf)

server = HTTPServer(('127.0.0.1', 9000), MyRequestHandler)
server.serve_forever()

上述代码是将内容进行gzip压缩，返回到浏览器端，必须在响应头上加上self.send_header('Content-Encoding','gzip')，若不加，浏览器按未压缩方式处理主体内容，就会出现乱码。

上面的是服务端的gzip压缩，一个真正的web服务器必须根据浏览器请求的头是否包含类似Accept-Encoding:gzip,deflate,sdch，来确定是否进行gzip压缩。

下面代码是模拟浏览器对服务器端字节gzip解压:

#coding=utf-8
'''
http-encode-gzip-client.py
'''
import urllib2, zlib

url = 'http://127.0.0.1:9000'
req  = urllib2.Request(
    url = url,
)
result = urllib2.urlopen(req)
text = result.read()
# 对服务器端发来的字节流 gzip解压
text = zlib.decompress(text, 16+zlib.MAX_WBITS)

# 编码处理==通过chardet模块，自动提取网页的编码
# http://www.cnblogs.com/CoolRandy/p/3251733.html
#infoencode = chardet.detect(text).get('encoding','utf-8')
#print text.decode(infoencode,'ignore')

print text

总结

这块仅是简单的介绍几个http消息头，详细介绍会单独在将来的http系列的文章里

响应头中的Content-Type

响应头中的Content-Encoding

总结

Comments

响应头中的`Content-Type`

响应头中的`Content-Encoding`