5

I'm trying to download an entire website using wget and this is the command I use:

wget --recursive --no-clobber --page-requisites --convert-links --domains example.com --no-parent http://www.example.com/en/ 

It's working just fine but there is one problem. There files (mainly images) that their name contains Chinese characters like this:

http://www.example.com/path/to/首页主KV3.jpg

After downloading the file has been save with this name:

??%96页主KV3.jpg

And it's addressed in the html page like this and therefore issuing a 404 error:

�%2596页主KV3.jpg

I wonder how can I prevent this inconsistency?!

2 Answers 2

1

It's about the UTF-8 and ASCII encoding. the issue has been addressed in the following link:

https://www.win.tue.nl/~aeb/linux/misc/wget.html

Worth reading, but in essence you have to tell wget not to try and "fix" filenames by specifying --restrict-file-names=nocontrol:

wget -r -np -nc --restrict-file-names=nocontrol URL 
0
0

I fought with this today as well.

In my case the problem was with German letters like ä,ö,ü

I fixed it by setting all my language settings to UTF-8.

You can see a tutorial here:

https://perlgeek.de/en/article/set-up-a-clean-utf8-environment

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.