In this article, we will see how you can fix error ‘invalid byte sequence for encoding UTF8’ while restoring a PostgreSQL database. At work, I got a task to move DBs which has ASCII encoding to UTF8 encoding. Let me first confess that the ASCII DBs was not created by intention, someone accidentally created it!!! Having a DB ASCII encoded is very dangerous, it should be moved to UTF8 encoding as soon as possible. So the initial plan was to create archive dump of the DB with pg_dump , create a new DB with UTF8 encoding and restore the dump to the new DB using pg_restore . The plan worked for most of the DBs, but failed for one DB with below error.
DETAIL: Proceeding with relation creation anyway. pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 35091; 0 2527787452 TABLE DATA my_table release pg_restore: [archiver (db)] COPY failed for table "my_table": ERROR: invalid byte sequence for encoding "UTF8": 0xa5 CONTEXT: COPY my_table, line 41653 WARNING: errors ignored on res
As the error says, there are some invalid UTF8 characters in table “my_table” which prevents pg_restore from restoring the particular table. I did a lot of research and googling to see what to do. I will list out what all steps I did.
Assume ‘my_db’ and ‘my_table’ is the database name and table name respectively.
Dump the Database excluding particular table ‘my_table’. I would suggest dumping the database in archive format for saving time and disk space.
pg_dump -Fc -T 'my_table' -p 1111 -f dbdump.pgd my_db
Create the new database with UTF8 encoding and restore the dump.
pg_restore -p 2222 -j 8 -d my_new_db dbdump.pgd
The restoration should be successful as we didn’t restore the offending table.
Dump the offending table ‘my_table’ in plain text format.
pg_dump -Fp -t 'my_table' -p 1111 my_db > my_db_table_only.sql
Now we have table data in plain text. Let’s find invalid UTF8 characters in the file by running below command(make sure locale is set to UTF-8,).
# grep -naxv '.*' my_db_table_only.sql 102:2010-03-23 ��ԥ� data1 data2
� represents an invalid UTF8 character and it is present in 102th line of the file.
Find which charset the invalid UTF8 characters belongs to.
#grep -naxv '.*' my_db_table_only.sql > test.txt #file -i test.txt test.txt: text/plain; charset=iso-8859-1
As per the output, those characters belongs to iso-8859-1. The charset may be different in your case.
Let’s convert iso-8859-1 to UTF8 using iconv command.
#grep -naxv '.*' my_db_table_only.sql | iconv --from-code=ISO-8859-1 --to-code=UTF-8 102:2010-03-23 ¥Êԥ¡ data1 data2
Now you got the characters in UTF8 encoding. So you can just replace ��ԥ� with ¥Êԥ¡ in 102th line of dump file(I used nano editor to do this, faced issues with Vim .)
I know that replacing characters manually could be a pain in the ass if there are lot of invalid UTF8 characters. We can run iconv on the whole file as shown below.
iconv --from-code=ISO-8859-1 --to-code=UTF-8 my_db_table_only.sql > my_db_table_only_utf8.sql
But I won’t recommend this as it may change valid characters(eg: Chinese characters ) to some other characters. If you plan to run iconv on the file, just make sure only invalid UTF8 characters are converted by taking diff of both files.
Once the characters are replaced. Restore the table to the database.
psql -p 2222 -d my_new_db -f my_db_table_only.sql
No more “Invalid byte sequence for encoding UTF8” error. Thanks for the time taken to read my blog. Subscribe to this blog so that you don’t miss out anything useful (Checkout Right Sidebar for the Subscription Form and Facebook follow button) . Please also put your thoughts as comments .