2

This question provides a background of this filename parameter.

I need to write a script to access some files on a web server. The filename contains CJK characters which cannot be encoded in ASCII.

$ curl -I 'http://bj.baidupcs.com/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1' ...... Content-Disposition: attachment;filename="【动漫之家汉化组】[最强会长黑神][第192话][黑神目泷依然健在][END].zip" ...... 

As you see, cURL decodes the filename properly. Firefox can also figure out the correct filename.

I wrote my script in Python. I tried requests first:

>>> import requests >>> r=requests.head('http://bj.baidupcs.com/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1') >>> r.headers['content-disposition'] 'attachment;filename="ã\x80\x90å\x8a¨æ¼«ä¹\x8bå®¶æ±\x89å\x8c\x96ç»\x84ã\x80\x91[æ\x9c\x80强ä¼\x9aé\x95¿é»\x91ç¥\x9e][第192è¯\x9d][é»\x91ç¥\x9eç\x9b®æ³·ä¾\x9dç\x84¶å\x81¥å\x9c¨][END].zip"' 

The filename looks like a weird representation of Python bytes. The problem is that this whole thing is already a Python string. I can't think of a way to get the actual bytes to decode.

>>> type(r.headers['content-disposition']) <class 'str'> 

The underlying library requests uses is the http.client standard library. I tried it but got the same thing:

>>> import http.client >>> conn = http.client.HTTPConnection("bj.baidupcs.com") >>> conn.request('HEAD', '/file/f6f258963f3c5daaa154ed441db232e1?xcode=f5a142e99df965f6a3b4c502a3c55a73283ef282da2f5c14&fid=1107408242-250528-2625488475&time=1373046574&sign=FDTAXER-DCb740ccc5511e5e8fedcff06b081203-QSIMrWw%2FICWQuExpdtyijM0vbMM%3D&to=bb&fm=N,Q,U&expires=8h&rt=sh&r=210487178&logid=3893215518&sh=1') >>> r=conn.getresponse() >>> r.getheader('content-disposition') 'attachment;filename="ã\x80\x90å\x8a¨æ¼«ä¹\x8bå®¶æ±\x89å\x8c\x96ç»\x84ã\x80\x91[æ\x9c\x80强ä¼\x9aé\x95¿é»\x91ç¥\x9e][第192è¯\x9d][é»\x91ç¥\x9eç\x9b®æ³·ä¾\x9dç\x84¶å\x81¥å\x9c¨][END].zip"' 

I'm using Python 3 on Windows.

1
  • I had a similar issue with a subprocess. The subprocess printed unicode text, but thought the OS's locale didn't allow anything outside of ASCII and printed something like what you have above. I fixed it by manually setting the locale: os.environ['LANG'] = 'enUS.UTF-8'. LANG might have been LC_ALL; I can't remember. Commented Jul 5, 2013 at 19:18

1 Answer 1

3

Looks like you're getting a UTF8-encoded (byte) string back as a Python 3 (Unicode) string. You'll have to do something like...

>>> s = 'attachment;filename="ã\x80\x90å\x8a¨æ¼«ä¹\x8bå®¶æ±\x89å\x8c\x96ç»\x84ã\x80\x91[æ\x9c\x80强ä¼\x9aé\x95¿é»\x91ç¥\x9e][第192è¯\x9d][é»\x91ç¥\x9eç\x9b®æ³·ä¾\x9dç\x84¶å\x81¥å\x9c¨][END].zip"' >>> s = s.encode('latin-1').decode('utf-8') >>> s 'attachment;filename="【动漫之家汉化组】[最强会长黑神][第192话][黑神目泷依然健在][END].zip"' 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.