JAVA实现批量URICode url中的中文并更新

1.背景描述

迁移数据时涉及到带中文的图片地址，类似：https://cccc.cn/cms/中国红_SPU001.jpg
DB在处理数据的时候报错

错误图片.png

需要转换成这样的地址：https://cccc.cn/cms/%E4%B8%AD%E5%9B%BD%E7%BA%A2_SPU001.jpg
才能成功导入到数据库中。

2.需求

把excel中第3-5列中的url地址进行URICode,并更新到该excel中。

3.分析

网上有一些在线转换的工具，由于数量比较多，一个个的转比较费时，还容易出错，写一段代码来完成这个工作。
3.1 处理url的时候先检测到中文，然后把中文进行URICode，变成预期的url地址；
3.2 使用POI读取excel表格的url数据，进行3.1，转换后的url更新到原url在的位置。

4.编码

引入poi相关的jar包（gradle的方式）

 implementation group: 'org.apache.poi', name: 'poi', version: '5.2.2'
 implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '5.2.2'

代码如下：

public class DealURLData {

    private static final Integer INITIAL_COLUMN = 2;
    private static final Integer END_COLUMN = 4;
    private static final Integer INITIAL_ROW = 1;
    private static final String FILE_PATH = "C:\\XXXX\\XXX\\Desktop\\XXX\\lotus.xlsx";
    private static final String XLS_SUFFIX_NAME = "xls";
    private static final String CHECK_CN_CHARACTERS = "[\\u4e00-\\u9fa5]|[\\u3002\\uff1b\\uff0c\\uff1a\\u201c\\u201d\\uff08\\uff09\\u3001\\uff1f\\u300a\\u300b\\u3010\\u3011]";
    private static final String ESCAPE_CHARACTER_OF_NEWLINE = "\\N";

    public static void main(String[] args) throws Exception {

        readAndChangeImageUrl(FILE_PATH);

    }

    private static void readAndChangeImageUrl(String filePath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(filePath);
        Workbook workbook_des;
        //to obtain the file suffix,xls uses HSSFWorkbook; xlsx uses the XSSFWorkbook
        String suffix = getSuffix(filePath);
        if (XLS_SUFFIX_NAME.equals(suffix)) {
            workbook_des = new HSSFWorkbook(fileInputStream);
        } else {
            workbook_des = new XSSFWorkbook(fileInputStream);
        }

        Iterator iterator = workbook_des.sheetIterator();
        //get the sheet by iterating through the iterator
        while (iterator.hasNext()) {
            Sheet sheet = iterator.next();
            if (ObjectUtils.isEmpty(sheet)) {
                break;
            }
            /* Iterate through columns 3 through 5, since these columns store the url's address*/
            for (int columnNum = INITIAL_COLUMN; columnNum <= END_COLUMN; columnNum++) {
                for (int rowNum = INITIAL_ROW; rowNum <= sheet.getLastRowNum(); rowNum++) {
                    // get the row
                    Row row = sheet.getRow(rowNum);
                    Cell cell = row.getCell(columnNum);
                    if (ObjectUtils.isEmpty(cell) || ESCAPE_CHARACTER_OF_NEWLINE.equals(cell)) {
                        break;
                    }
                    // set the Cell type is String
                    cell.setCellType(CellType.STRING);
                    // gets the data in the cell
                    String data = new DataFormatter().formatCellValue(cell);
                    // The url address is converted and updated to this cell
                    row.createCell(columnNum).setCellValue(urlChange(data));
                }
            }
        }
        FileOutputStream out = new FileOutputStream(filePath);
        out.flush();
        workbook_des.write(out);
        out.close();
    }

    /**
     * get the suffix of the file 
     * @param filepath
     */
    private static String getSuffix(String filepath) {
        if (StringUtils.isEmpty(filepath)) {
            return "";
        }
        int index = filepath.lastIndexOf(".");
        if (index == -1) {
            return "";
        }
        return filepath.substring(index + 1, filepath.length());
    }

    /**
     * @Description convert the Chinese characters in the url
     * @Param url
     **/
    private static String urlChange(String url) {
        Matcher matcher = Pattern.compile(CHECK_CN_CHARACTERS).matcher(url);
        while (matcher.find()) {
            String tmp = matcher.group();
            try {
                url = url.replaceAll(tmp, java.net.URLEncoder.encode(tmp, "utf-8"));
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }
        return url;
    }

}

5.验证

转换前的样子：

转换前.png

转换后的结果：

转换后的结果.png

6.回炉

当我满怀欣喜执行数据的时候，发现还有问题，像Crème-d’Nude这种字符，里面没有中文，上面的代码也不会转换，在更新数据库中还是有问题。
就是要把url转成前端用 encodeURIComponent 转码后的结果入下图：

FE-encodeURIComponent.png

后端用下面代码执行后的结果为：

BE-encodeUrl.png

在网上找了下述代码，测试可以

public static String urlEncode(String url) throws UnsupportedEncodingException {
       if(url == null) {
           return null;
       }
       final String reserved_char = ";/?:@=&";
       String ret = "";
       for(int i=0; i < url.length(); i++) {
           String cs = String.valueOf( url.charAt(i) );
           if(reserved_char.contains(cs)){
               ret += cs;
           }else{
               ret += URLEncoder.encode(cs, "utf-8");
           }
       }
       return ret.replace("+", "%20");
   }

参考链接：
https://blog.csdn.net/zhemeban/article/details/124895170
https://blog.csdn.net/weixin_50002647/article/details/123986127
https://blog.csdn.net/qq_35856336/article/details/92629290