Fix for ‘Invalid byte sequence for encoding UTF8’ error while restoring PostgreSQL database


In this article, we will see how you can fix error ‘invalid byte sequence for encoding UTF8’ while restoring a PostgreSQL database. At work, I got a task to move DBs which has ASCII encoding to UTF8 encoding. Let me first confess that the ASCII DBs was not created by intention, someone accidentally created it!!! Having a DB ASCII encoded is very dangerous, it should be moved to UTF8 encoding as soon as possible. So the initial plan was to create archive dump of the DB with  pg_dump , create a new DB with UTF8 encoding and restore the dump to the new DB using  pg_restore . The plan worked for most of the DBs, but failed for one DB with below error.

DETAIL: Proceeding with relation creation anyway.
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 35091; 0 2527787452 TABLE DATA my_table release
pg_restore: [archiver (db)] COPY failed for table "my_table": ERROR: invalid byte sequence for encoding "UTF8": 0xa5
CONTEXT: COPY my_table, line 41653
WARNING: errors ignored on res

As the error says, there are some invalid UTF8 characters in table “my_table” which prevents pg_restore from restoring the particular table. I did a lot of research and googling to see what to do. I will list out what all steps I did.

Assume ‘my_db’ and ‘my_table’ is the database name and table name respectively.

Step 1:

Dump the Database excluding particular table ‘my_table’. I would suggest dumping the database in archive format for saving time and disk space.

pg_dump -Fc -T 'my_table' -p 1111  -f dbdump.pgd my_db

Step 2:

Create the new database with UTF8 encoding and restore the dump.

pg_restore -p 2222 -j 8 -d my_new_db dbdump.pgd

The restoration should be successful as we didn’t restore the offending table.

Step 3:

Dump the offending table ‘my_table’ in plain text format.

pg_dump -Fp -t 'my_table' -p 1111 my_db >  my_db_table_only.sql

Step 4:

Now we have table data in plain text. Let’s find invalid UTF8 characters in the file by running below command(make sure locale is set to UTF-8,).

# grep -naxv '.*'   my_db_table_only.sql
102:2010-03-23 ��ԥ�	data1 data2

� represents an invalid UTF8 character and it is present in 102th line of the file.

Step 5:

Find which charset the invalid UTF8 characters belongs to.

#grep -naxv '.*' my_db_table_only.sql > test.txt 
#file -i test.txt
test.txt: text/plain; charset=iso-8859-1

As per the output, those characters belongs to iso-8859-1. The charset may be different in your case.

Step 6:

Let’s convert  iso-8859-1  to  UTF8  using  iconv  command.

#grep -naxv '.*' my_db_table_only.sql |  iconv --from-code=ISO-8859-1 --to-code=UTF-8
102:2010-03-23 ¥Êԥ¡ data1 data2

Now you got the characters in UTF8 encoding. So you can just replace  ��ԥ� with  ¥Êԥ¡   in 102th line of dump file(I used  nano  editor to do this, faced issues with  Vim .)

 

I know that replacing characters manually could be a pain in the ass if there are lot of invalid UTF8 characters. We can run  iconv  on the whole file as shown below.

iconv --from-code=ISO-8859-1 --to-code=UTF-8 my_db_table_only.sql  > my_db_table_only_utf8.sql

But I won’t recommend this as it may change valid characters(eg: Chinese characters ) to some other characters. If you plan to run iconv on the file, just make sure only invalid UTF8 characters are converted by taking  diff  of both files.

Step7.

Once the characters are replaced. Restore the table to the database.

psql -p 2222 -d my_new_db -f my_db_table_only.sql

No more “Invalid byte sequence for encoding UTF8” error. Thanks for the time taken to read my blog. Subscribe to this blog so that you don’t miss out anything useful   (Checkout Right Sidebar for the Subscription Form and Facebook follow button)  . Please also put your thoughts as comments .

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top
x