Converting Character Sets

March 17, 2009 by Ryan Lowe 23 Comments

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

 
Shell
 
           1 
         
          ALTER  
          TABLE 
            
          ` 
          t1 
          ` 
            
          CONVERT  
          TO 
            
          CHARACTER  
          SET  
          utf8  
          COLLATE  
          utf8_general_ci 
          ;

Take the following table as an example why this approach will not work:

Shell
 
   
 
     
      
       
           1 
         

           2 
         

           3 
         

           4 
         

           5 
         

           6 
         

           7 
         

           8 
         

           9 
         

           10 
         

           11 
         

           12 
         

           13 
         

           14 
         

           15 
         

           16 
         
 
        
          mysql 
          > 
           
          CREATE 
          TABLE 
           
          ` 
          t1 
          ` 
           
          ( 
         
 
          -> 
           
          ` 
          c1 
          ` 
           
          text 
          NOT 
           
          NULL 
         
 
          -> 
           
          ) 
           
          ENGINE 
          = 
          MyISAM 
          DEFAULT 
           
          CHARSET 
          = 
          latin1 
          ; 
         
 
          Query 
          OK 
          , 
           
          0 
           
          rows 
          affected 
           
          ( 
          0.02 
           
          sec 
          ) 
         

             
         
 
          mysql 
          > 
           
          ALTER 
          TABLE 
           
          ` 
          t1 
          ` 
           
          CONVERT 
          TO 
           
          CHARACTER 
          SET 
          utf8 
          COLLATE 
          utf8_general_ci 
          ; 
         
 
          Query 
          OK 
          , 
           
          0 
           
          rows 
          affected 
           
          ( 
          0.02 
           
          sec 
          ) 
         
 
          Records 
          : 
           
          0 
           
          Duplicates 
          : 
           
          0 
           
          Warnings 
          : 
           
          0 
         

             
         
 
          mysql 
          > 
           
          SHOW 
          CREATE 
          TABLE 
           
          ` 
          t1 
          ` 
          \ 
          G 
         
 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
           
          1. 
           
          row 
           
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
         
 
          Table 
          : 
           
          t1 
         
 
          Create 
          Table 
          : 
           
          CREATE 
          TABLE 
           
          ` 
          t1 
          ` 
           
          ( 
         
 
          ` 
          c1 
          ` 
           
          mediumtext 
          NOT 
           
          NULL 
         
 
          ) 
           
          ENGINE 
          = 
          MyISAM 
          DEFAULT 
           
          CHARSET 
          = 
          utf8 
         
 
          1 
           
          row 
          in 
           
          set 
           
          ( 
          0.01 
           
          sec 
          ) 
         
 
      
 
     
   

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

Shell
 
           1 
         
          % 
          > 
           
          wget 
          http 
          : 
          / 
          / 
          www 
          .pablowe 
          .net 
          / 
          convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)

 
Shell
 
           1 
         
           2 
         
           3 
         
           4 
         
           5 
         
           6 
         
           7 
         
           8 
         
           9 
         
           10 
         
           11 
         
           12 
         
           13 
         
           14 
         
           15 
         
           16 
         
           17 
         
           18 
         
           19 
         
           20 
         
           21 
         
           22 
         
           23 
         
           24 
         
           25 
         
           26 
         
           27 
         
           28 
         
           29 
         
           30 
         
           31 
         
           32 
         
           33 
         
           34 
         
           35 
         
           36 
         
           37 
         
           38 
         
           39 
         
           40 
         
           41 
         
           42 
         
          Usage 
          : 
         
          convert_charset 
            
          -- 
          database 
          = 
          database 
            
          [ 
          options 
          ] 
         
          Options 
          : 
         
          -- 
          askpass         
          Prompt  
          for 
            
          a 
            
          MySQL  
          password 
         
          -- 
          charset         
          The  
          target  
          character  
          set  
          to 
            
          convert  
          to 
         
          -- 
          collate         
          The  
          target  
          collation  
          to 
            
          convert  
          to 
         
          -- 
          database 
          | 
          d 
                
          The  
          target  
          database 
         
          -- 
          help 
          | 
          ? 
                    
          Display  
          this 
            
          help  
          and 
            
          exit 
         
          -- 
          host 
          | 
          h 
                    
          The  
          target  
          host 
         
          -- 
          ignore 
          - 
          columns  
          Columns  
          to 
            
          ignore 
          , 
            
          useful  
          if 
            
          you  
          want  
          to 
         
          keep  
          the  
          existing  
          charset  
          for 
            
          a 
            
          target  
          column 
         
          Comma 
          - 
          separated 
          . 
             
          NO  
          SPACES 
          . 
         
          table 
          .column 
         
          -- 
          ignore 
          - 
          tables 
             
          A 
            
          comma 
          - 
          separated  
          list  
          of  
          tables  
          to 
            
          ignore 
         
          -- 
          password 
          | 
          p 
                
          The  
          MySQL  
          password  
          to 
            
          use 
         
          -- 
          port            
          The  
          target  
          port 
         
          -- 
          tables 
                    
          A 
            
          comma 
          - 
          separated  
          list  
          of  
          tables  
          to 
            
          convert 
          . 
         
          All  
          non 
          - 
          named  
          tables  
          will  
          be  
          ignored 
         
          -- 
          test 
                      
          Print  
          the  
          ALTER  
          statements  
          that  
          would  
          be  
          executed 
         
          without  
          executing  
          them 
          . 
         
          -- 
          user 
          | 
          u 
                    
          The  
          MySQL  
          user 
         
          -- 
          version 
          | 
          V 
                 
          Display  
          version  
          information  
          and 
            
          exit 
         
          defaults  
          are 
          : 
         
          ATTRIBUTE                   
          VALUE 
         
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
            
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
         
          askpass                     
          FALSE 
         
          charset                     
          utf8 
         
          collate                     
          No  
          Default 
            
          Value 
         
          database                    
          No  
          Default 
            
          Value 
         
          help                        
          FALSE 
         
          host                        
          localhost 
         
          ignore 
          - 
          columns              
          No  
          Default 
            
          Value 
         
          ignore 
          - 
          tables               
          No  
          Default 
            
          Value 
         
          password                    
          No  
          Default 
            
          Value 
         
          port 
                                  
          3306 
         
          tables                      
          No  
          Default 
            
          Value 
         
          test 
                                  
          FALSE 
         
          user                        
          Current  
          User 
         
          version                     
          FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

March 17, 2009 by Ryan Lowe 23 Comments

Approach #1:

Shell
 
           1 
         
          ALTER 
          TABLE 
           
          ` 
          t1 
          ` 
           
          CONVERT 
          TO 
           
          CHARACTER 
          SET 
          utf8 
          COLLATE 
          utf8_general_ci 
          ;

Take the following table as an example why this approach will not work:

 
Shell
 
           1 
         
           2 
         
           3 
         
           4 
         
           5 
         
           6 
         
           7 
         
           8 
         
           9 
         
           10 
         
           11 
         
           12 
         
           13 
         
           14 
         
           15 
         
           16 
         
          mysql 
          > 
            
          CREATE  
          TABLE 
            
          ` 
          t1 
          ` 
            
          ( 
         
          -> 
            
          ` 
          c1 
          ` 
            
          text  
          NOT 
            
          NULL 
         
          -> 
            
          ) 
            
          ENGINE 
          = 
          MyISAM  
          DEFAULT 
            
          CHARSET 
          = 
          latin1 
          ; 
         
          Query  
          OK 
          , 
            
          0 
            
          rows  
          affected 
            
          ( 
          0.02 
            
          sec 
          ) 
         
          mysql 
          > 
            
          ALTER  
          TABLE 
            
          ` 
          t1 
          ` 
            
          CONVERT  
          TO 
            
          CHARACTER  
          SET  
          utf8  
          COLLATE  
          utf8_general_ci 
          ; 
         
          Query  
          OK 
          , 
            
          0 
            
          rows  
          affected 
            
          ( 
          0.02 
            
          sec 
          ) 
         
          Records 
          : 
            
          0 
            
          Duplicates 
          : 
            
          0 
            
          Warnings 
          : 
            
          0 
         
          mysql 
          > 
            
          SHOW  
          CREATE  
          TABLE 
            
          ` 
          t1 
          ` 
          \ 
          G 
         
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
            
          1. 
            
          row 
            
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
          * 
         
          Table 
          : 
            
          t1 
         
          Create  
          Table 
          : 
            
          CREATE  
          TABLE 
            
          ` 
          t1 
          ` 
            
          ( 
         
          ` 
          c1 
          ` 
            
          mediumtext  
          NOT 
            
          NULL 
         
          ) 
            
          ENGINE 
          = 
          MyISAM  
          DEFAULT 
            
          CHARSET 
          = 
          utf8 
         
          1 
            
          row  
          in 
            
          set 
            
          ( 
          0.01 
            
          sec 
          )

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

 
Shell
 
           1 
         
          % 
          > 
            
          wget  
          http 
          : 
          / 
          / 
          www 
          .pablowe 
          .net 
          / 
          convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

Shell
 
           1 
         
           2 
         
           3 
         
           4 
         
           5 
         
           6 
         
           7 
         
           8 
         
           9 
         
           10 
         
           11 
         
           12 
         
           13 
         
           14 
         
           15 
         
           16 
         
           17 
         
           18 
         
           19 
         
           20 
         
           21 
         
           22 
         
           23 
         
           24 
         
           25 
         
           26 
         
           27 
         
           28 
         
           29 
         
           30 
         
           31 
         
           32 
         
           33 
         
           34 
         
           35 
         
           36 
         
           37 
         
           38 
         
           39 
         
           40 
         
           41 
         
           42 
         
          Usage 
          : 
         
          convert_charset 
           
          -- 
          database 
          = 
          database 
           
          [ 
          options 
          ] 
         
          Options 
          : 
         
          -- 
          askpass         
          Prompt 
          for 
           
          a 
           
          MySQL 
          password 
         
          -- 
          charset         
          The 
          target 
          character 
          set 
          to 
           
          convert 
          to 
         
          -- 
          collate         
          The 
          target 
          collation 
          to 
           
          convert 
          to 
         
          -- 
          database 
          | 
          d 
               
          The 
          target 
          database 
         
          -- 
          help 
          | 
          ? 
                   
          Display 
          this 
           
          help 
          and 
           
          exit 
         
          -- 
          host 
          | 
          h 
                   
          The 
          target 
          host 
         
          -- 
          ignore 
          - 
          columns 
          Columns 
          to 
           
          ignore 
          , 
           
          useful 
          if 
           
          you 
          want 
          to 
         
          keep 
          the 
          existing 
          charset 
          for 
           
          a 
           
          target 
          column 
         
          Comma 
          - 
          separated 
          . 
             
          NO 
          SPACES 
          . 
         
          table 
          .column 
         
          -- 
          ignore 
          - 
          tables 
             
          A 
           
          comma 
          - 
          separated 
          list 
          of 
          tables 
          to 
           
          ignore 
         
          -- 
          password 
          | 
          p 
               
          The 
          MySQL 
          password 
          to 
           
          use 
         
          -- 
          port           
          The 
          target 
          port 
         
          -- 
          tables 
                   
          A 
           
          comma 
          - 
          separated 
          list 
          of 
          tables 
          to 
           
          convert 
          . 
         
          All 
          non 
          - 
          named 
          tables 
          will 
          be 
          ignored 
         
          -- 
          test 
                     
          Print 
          the 
          ALTER 
          statements 
          that 
          would 
          be 
          executed 
         
          without 
          executing 
          them 
          . 
         
          -- 
          user 
          | 
          u 
                   
          The 
          MySQL 
          user 
         
          -- 
          version 
          | 
          V 
                 
          Display 
          version 
          information 
          and 
           
          exit 
         
          defaults 
          are 
          : 
         
          ATTRIBUTE                   
          VALUE 
         
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
           
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
          -- 
         
          askpass                     
          FALSE 
         
          charset                     
          utf8 
         
          collate                     
          No 
          Default 
           
          Value 
         
          database                   
          No 
          Default 
           
          Value 
         
          help                       
          FALSE 
         
          host                       
          localhost 
         
          ignore 
          - 
          columns             
          No 
          Default 
           
          Value 
         
          ignore 
          - 
          tables               
          No 
          Default 
           
          Value 
         
          password                   
          No 
          Default 
           
          Value 
         
          port 
                                 
          3306 
         
          tables                     
          No 
          Default 
           
          Value 
         
          test 
                                 
          FALSE 
         
          user                       
          Current 
          User 
         
          version                     
          FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

备注：该篇技术文档是与MySQL数据库中的“备份恢复”主题相关的技术主题。

本文转载自：http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/

March 17, 2009 by Ryan Lowe 23 Comments

Approach #1:

 
Shell
 
            1 
          
           ALTER  
           TABLE 
             
           ` 
           t1 
           ` 
             
           CONVERT  
           TO 
             
           CHARACTER  
           SET  
           utf8  
           COLLATE  
           utf8_general_ci 
           ;

Take the following table as an example why this approach will not work:

Shell
 
    
 
      
       
        
            1 
          

            2 
          

            3 
          

            4 
          

            5 
          

            6 
          

            7 
          

            8 
          

            9 
          

            10 
          

            11 
          

            12 
          

            13 
          

            14 
          

            15 
          

            16 
          
 
         
           mysql 
           > 
            
           CREATE 
           TABLE 
            
           ` 
           t1 
           ` 
            
           ( 
          
 
           -> 
            
           ` 
           c1 
           ` 
            
           text 
           NOT 
            
           NULL 
          
 
           -> 
            
           ) 
            
           ENGINE 
           = 
           MyISAM 
           DEFAULT 
            
           CHARSET 
           = 
           latin1 
           ; 
          
 
           Query 
           OK 
           , 
            
           0 
            
           rows 
           affected 
            
           ( 
           0.02 
            
           sec 
           ) 
          

              
          
 
           mysql 
           > 
            
           ALTER 
           TABLE 
            
           ` 
           t1 
           ` 
            
           CONVERT 
           TO 
            
           CHARACTER 
           SET 
           utf8 
           COLLATE 
           utf8_general_ci 
           ; 
          
 
           Query 
           OK 
           , 
            
           0 
            
           rows 
           affected 
            
           ( 
           0.02 
            
           sec 
           ) 
          
 
           Records 
           : 
            
           0 
            
           Duplicates 
           : 
            
           0 
            
           Warnings 
           : 
            
           0 
          

              
          
 
           mysql 
           > 
            
           SHOW 
           CREATE 
           TABLE 
            
           ` 
           t1 
           ` 
           \ 
           G 
          
 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
            
           1. 
            
           row 
            
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
           * 
          
 
           Table 
           : 
            
           t1 
          
 
           Create 
           Table 
           : 
            
           CREATE 
           TABLE 
            
           ` 
           t1 
           ` 
            
           ( 
          
 
           ` 
           c1 
           ` 
            
           mediumtext 
           NOT 
            
           NULL 
          
 
           ) 
            
           ENGINE 
           = 
           MyISAM 
           DEFAULT 
            
           CHARSET 
           = 
           utf8 
          
 
           1 
            
           row 
           in 
            
           set 
            
           ( 
           0.01 
            
           sec 
           ) 
          
 
       
 
      
    

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

Shell
 
            1 
          
           % 
           > 
            
           wget 
           http 
           : 
           / 
           / 
           www 
           .pablowe 
           .net 
           / 
           convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

 
Shell
 
            1 
          
            2 
          
            3 
          
            4 
          
            5 
          
            6 
          
            7 
          
            8 
          
            9 
          
            10 
          
            11 
          
            12 
          
            13 
          
            14 
          
            15 
          
            16 
          
            17 
          
            18 
          
            19 
          
            20 
          
            21 
          
            22 
          
            23 
          
            24 
          
            25 
          
            26 
          
            27 
          
            28 
          
            29 
          
            30 
          
            31 
          
            32 
          
            33 
          
            34 
          
            35 
          
            36 
          
            37 
          
            38 
          
            39 
          
            40 
          
            41 
          
            42 
          
           Usage 
           : 
          
           convert_charset 
            
           -- 
           database 
           = 
           database 
            
           [ 
           options 
           ] 
          
           Options 
           : 
          
           -- 
           askpass         
           Prompt 
           for 
            
           a 
            
           MySQL 
           password 
          
           -- 
           charset         
           The 
           target 
           character 
           set 
           to 
            
           convert 
           to 
          
           -- 
           collate         
           The 
           target 
           collation 
           to 
            
           convert 
           to 
          
           -- 
           database 
           | 
           d 
                
           The 
           target 
           database 
          
           -- 
           help 
           | 
           ? 
                    
           Display 
           this 
            
           help 
           and 
            
           exit 
          
           -- 
           host 
           | 
           h 
                    
           The 
           target 
           host 
          
           -- 
           ignore 
           - 
           columns 
           Columns 
           to 
            
           ignore 
           , 
            
           useful 
           if 
            
           you 
           want 
           to 
          
           keep 
           the 
           existing 
           charset 
           for 
            
           a 
            
           target 
           column 
          
           Comma 
           - 
           separated 
           . 
              
           NO 
           SPACES 
           . 
          
           table 
           .column 
          
           -- 
           ignore 
           - 
           tables 
              
           A 
            
           comma 
           - 
           separated 
           list 
           of 
           tables 
           to 
            
           ignore 
          
           -- 
           password 
           | 
           p 
                
           The 
           MySQL 
           password 
           to 
            
           use 
          
           -- 
           port           
           The 
           target 
           port 
          
           -- 
           tables 
                    
           A 
            
           comma 
           - 
           separated 
           list 
           of 
           tables 
           to 
            
           convert 
           . 
          
           All 
           non 
           - 
           named 
           tables 
           will 
           be 
           ignored 
          
           -- 
           test 
                      
           Print 
           the 
           ALTER 
           statements 
           that 
           would 
           be 
           executed 
          
           without 
           executing 
           them 
           . 
          
           -- 
           user 
           | 
           u 
                    
           The 
           MySQL 
           user 
          
           -- 
           version 
           | 
           V 
                  
           Display 
           version 
           information 
           and 
            
           exit 
          
           defaults 
           are 
           : 
          
           ATTRIBUTE                   
           VALUE 
          
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
            
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
           -- 
          
           askpass                     
           FALSE 
          
           charset                     
           utf8 
          
           collate                     
           No 
           Default 
            
           Value 
          
           database                   
           No 
           Default 
            
           Value 
          
           help                       
           FALSE 
          
           host                       
           localhost 
          
           ignore 
           - 
           columns             
           No 
           Default 
            
           Value 
          
           ignore 
           - 
           tables               
           No 
           Default 
            
           Value 
          
           password                   
           No 
           Default 
            
           Value 
          
           port 
                                  
           3306 
          
           tables                     
           No 
           Default 
            
           Value 
          
           test 
                                  
           FALSE 
          
           user                       
           Current 
           User 
          
           version                     
           FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

Converting Character Sets

你可能感兴趣的:(mysql,备份)