Title	Converting a Database to UTF8 Format Using copy() with the nvarchar copy format

URL Name	Converting-a-Database-to-UTF8-Format-Using-copy-with-the-nvarchar-copy-format

Assigned To

Summary Text Converting a Database to UTF8 Format Using copy() with the nvarchar copy format
Migrating a Database from non-UTF8 to UTF8 format requires to unload the data, convert it to UTF8 and then load the data into a new Database.

Resolution

The SQL copy command knows the nchar(0)/nvarchar(0) format. When using this format to copy the data out, the data is converted to UTF8 strings on the fly and the (byte)size of each value in the datafiles is adjusted correctly.
Here is an example for a table with an integer column id, a varchar column txt and a long varchar column longtxt:

copy t1(
        id= c0tab,
        txt= nvarchar(0)tab,
        longtxt= long nvarchar(0)nl with null(']^NULL^['))
into 't1.dat'

The database needs to be Unicode enabled to get the conversion working. The nchar(0)/nvarchar(0) copy formats are available in all 9.x releases too.
Starting with Ingres 10 the copydb command knows a -nvarchar flag, it changes the default copy specification for character types from varchar(0) to nvarchar(0), so there is no need to edit the copy.out script manually, just run:

copydb -nvarchar mydb

However, the -nvarchar flag also affects the generated copy.in file. Within the create table definitions, the char types are replaced by the corresponding n-char types and the nvarchar(0) copy format is used to copy the data in.
When using this copy.in we end up with a database where we actually wouldn't need an UTF8 installation, the data is now stored in nchar/nvarchar columns (so in UTF16 format) with all it's advantages and disadvantages.

There is a solution for this:
When the conversion to the nchar/nvarchar datatypes is not wanted, run copydb twice.
One time with the -nvarchar flag to get the copy.out with conversion, and a second time without that flag, but with the -c flag instead of the -nvarchar flag to get the copy.in. (Don't forget to save the copy.out before running copydb the second time).

Using the copy.out generated with copydb -nvarchar converts the data to UTF8, while the copy.in created by copydb without the -nvarchar flag uses the standard char/varchar datatypes and the standard varchar(0) copy format. In an UTF8 installation any char/varchar is expected to be in UTF8 format, so this exactly what we want.

Obviously, the target installation needs to be an UTF8 installation when applying the data with this copy.in file.

Caution:

When having char(n)/varchar(n) columns in a UTF8 installation, the specified size n describes the number of bytes, not the number of characters. In UTF8 a character can take 1 to 4 bytes. All the characters in the ASCII range take one byte, 4 byte characters are rarely used, but in general you need a bigger size, than used before.

In contrast to this, for nchar(n)/nvarchar(n) columns, the size n specifies the number of characters. Every character takes two bytes as the data is in UCS2 format (a UTF16 subset)

On Windows under rare circumstances this approach may fail, for details and how to solve this check
Specifying a File Type for the COPY Statement

Solution Date 10-Mar-2012

Product Actian Ingres