utf-8 encode

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
ok, let's go

- Copy (or screenshot) the full console output here (not only results but whole terminal screen including entered commands)

Captura de tela de 2020-01-27 09-37-14.png

KEEP the connection open, connect with another PuTTy instance but with UTF-8 as remote charset and do:
Captura de tela de 2020-01-27 09-44-56.png

- Now go back to the first PuTTy instance (here you will still have the 8859-1 lang) and while in /tmp/test call again the ls -la and paste another output here.

Captura de tela de 2020-01-27 09-55-42.png

Output will confirm the conversion was done. Now use ls -la and paste the output. You should see 1, 2 and 4 properly while 3 will be still scrambled.
Captura de tela de 2020-01-27 09-50-30.png


we had a slight difference but it worked.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Ok so it just confirms that the ISO8859-1 is just crap these days. For whatever reason it is not representing the symbols properly even if the client has the proper charset set. UTF-8 works properly which is good.

So consider doing following steps ... please note that i take no responsibility if something gets screwed. Ensure you will test properly and you have valid backups!

1] (Optional) Portugeese lang
- Change the ~/.login_conf to on of the Portugeese lang ...
Code:
me:\
    :charset=UTF-8:\
    :lang=pt_PT.UTF-8:

or
Code:
me:\
    :charset=UTF-8:\
    :lang=pt_BR.UTF-8:

- Then re-generate the locale again by cap_mkdb ~/.login_conf
- Reconnect to server and check the locale
- This should not break anything as it is still UTF-8 but double-check it by connecting via PuTTy with UTF-8 set and list the content of /tmp/test . You should still see 1, 2 and 4 properly and the 3 broken.

2] Test convmv on real data
- Pick one of the directories/files from your SMB shares that have the special characters
- Make a backup or just create some test directories and files with special chars VIA the SMB share ... so from some Windows client
- Try to convert the directory including all subdirs
Code:
convmv -f ISO-8859-1 -t UTF-8 -r ./<name>

where <name> is at least the partial directory name (as you can not use the special chars at this point). User " * " to bypass the chars. Or if you want to convert everything under the current directory just hit it with ./*
- Validate the test-run of the output. If you're satisfied then add --notest at the end of the command for real conversion
- You should now see all the chars properly in the terminal

3] Validate over SMB
- Now the tricky part. The converted DIR will be not visible over SMB as it still speaks in Latin1.
- You will have to change the UNIX charset from ISO-8859-1 to UTF-8 and restart SMB.
- Be aware that by doing so it will break all of the other shares with special chars. More over if anybody creates new files during this time it will be created with UTF-8 encode and you will end up in mixed files!
- So ideally do this at a time when nobody works with the shares (outside business hours?) OR temporarily lock all of the users. Or you can add some AUX parameters to restrict SMB shares to be only accessible from specific IP address.
- Whatever approach you choose the converted directory/files should be visible over SMB running with UTF8.
- Change the SMB back to ISO-8859-1 for now and restart service.

4] Migrate to UTF-8 (and never come back)
- Inform your users that this is going to happen. Give them a time to react (maybe someone is using some custom scripts which gets broken?)
- MAKE A BACKUP of all of your data !!! ... or at least these which are exposed via SMB shares (or other services)
- Ensure you have a backup ... seriously ...
- Prepare a list of path of ALL of your SMB shares
- If you have other data/files uploaded via another ways (FTP, AFP, NFS, SCP, ...) do your homework and check what is the charcode. They will be most probably in Latin1 as well as that was your previous locale.
- Take the list of the paths and prepare convmv commands for converting from ISO-8859-1 to UTF-8. It will be like this
convmv -f ISO-8859-1 -t UTF-8 -r ./<path> ... note the -r flag for recursive travel
- Run the convmv, ideally re-direct the output to some file for better read later on.
- Carefully review the testruns. You should not see any garbled characters. If something is converted wrongly check why. (Might be different encode of the files. It could be already in UTF-8 or maybe old ASCII ?)
- If everything looks OK and you're ready to hit the red button then ...
- Tripple-check you have a fresh backup !!
- Stop the SMB (and other services working with the data)
- Execute the convmv conversion. If you have thousand of files better to run it via nohup with redirected output or in tmux to avoid unexpected session crash
- Change the SMB from ISO-8859-1 to UTF-8 and start the service.
- Validate that the shares are intact and working properly from client side
- Adjust other services accordingly and validate
- Inform your users that the move to UTF-8 has been done. Windows users should be not affected. Linux/Mac might need to change their client charset to UTF-8 (Like the PuTTy, WinSCP, or shell scripts, etc...)
- Call it a day and enjoy the UTF-8 :]

I hope i did not missed anything.
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
Thank you very much for your help in addition to helping with my problem, you taught me a lot.
I am already planning the backups and resolving the encoding of folder by folder again, thanks!
 
Top