检测编码并制作一切UTF-8

本文翻译自:Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. 我正在从各种RSS提要中读取很多文本,并将其插入到数据库中。

Of course, there are several different character encodings used in the feeds, eg UTF-8 and ISO 8859-1. 当然,提要中使用了几种不同的字符编码,例如UTF-8和ISO 8859-1。

Unfortunately, there are sometimes problems with the encodings of the texts. 不幸的是,文本的编码有时会出现问题。 Example: 例:

  1. The "ß" in "Fußball" should look like this in my database: "Ÿ". “Fußball”中的“ß”在我的数据库中应如下所示:“Ÿ”。 If it is a "Ÿ", it is displayed correctly. 如果它是“”,则正确显示。

  2. Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". 有时,“Fußball”中的“ß”在我的数据库中看起来像这样:“ß”。 Then it is displayed wrongly, of course. 然后,当然会显示错误。

  3. In other cases, the "ß" is saved as a "ß" - so without any change. 在其他情况下,“ß”另存为“ß”-因此无需进行任何更改。 Then it is also displayed wrongly. 然后它也会显示错误。

What can I do to avoid the cases 2 and 3? 我该如何避免情况2和情况3?

How can I make everything the same encoding, preferably UTF-8? 如何使所有内容都使用相同的编码,最好是UTF-8? When must I use utf8_encode() , when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input? 什么时候必须使用utf8_encode() ,什么时候必须使用utf8_decode() (很明显会产生什么效果,但是什么时候必须使用这些函数?),什么时候不对输入执行任何操作?

How do I make everything the same encoding? 如何使所有内容都具有相同的编码? Perhaps with the function mb_detect_encoding() ? 也许使用mb_detect_encoding()函数? Can I write a function for this? 我可以为此编写函数吗? So my problems are: 所以我的问题是:

  1. How do I find out what encoding the text uses? 如何找出文字使用的编码方式?
  2. How do I convert it to UTF-8 - whatever the old encoding is? 我如何将其转换为UTF-8-不管旧的编码是什么?

Would a function like this work? 这样的功能会起作用吗?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. 我已经测试过了,但是没有用。 What's wrong with it? 它出什么问题了?


#1楼

参考:https://stackoom.com/question/3owD/检测编码并制作一切UTF


#2楼

I had same issue with phpQuery ( ISO-8859-1 instead of UTF-8 ) and this hack helped me: 我在phpQuery上遇到了同样的问题( ISO-8859-1而不是UTF-8 ),这个hack帮助了我:

$html = '' . $html;

mb_internal_encoding('UTF-8') , phpQuery::newDocumentHTML($html, 'utf-8') , mbstring.internal_encoding and other manipulations didn't take any effect. mb_internal_encoding('UTF-8')phpQuery::newDocumentHTML($html, 'utf-8')mbstring.internal_encoding和其他操作均无效。


#3楼

Get encoding from headers and convert it to utf-8. 从标题获取编码并将其转换为utf-8。

$post_url='http://website.domain';

/// Get headers 
function get_headers_curl($url) 
{ 
    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_URL,            $url); 
    curl_setopt($ch, CURLOPT_HEADER,         true); 
    curl_setopt($ch, CURLOPT_NOBODY,         true); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    curl_setopt($ch, CURLOPT_TIMEOUT,        15); 

    $r = curl_exec($ch); 
    return $r; 
}
$the_header = get_headers_curl($post_url);
/// check for redirect /
if (preg_match("/Location:/i", $the_header)) {
    $arr = explode('Location:', $the_header);
    $location = $arr[1];

    $location=explode(chr(10), $location);
    $location = $location[0];

$the_header = get_headers_curl(trim($location));
}
/// Get charset /
if (preg_match("/charset=/i", $the_header)) {
    $arr = explode('charset=', $the_header);
    $charset = $arr[1];

    $charset=explode(chr(10), $charset);
    $charset = $charset[0];
    }
///
// echo $charset;

if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }

#4楼

I know this is an older question, but I figure a useful answer never hurts. 我知道这是一个比较老的问题,但是我认为一个有用的答案永远不会让您感到痛苦。 I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. 我在桌面应用程序,SQLite和GET / POST变量之间的编码存在问题。 Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved. 有些使用UTF-8,有些使用ASCII,并且基本上,当涉及到外来字符时,所有内容都会搞砸。

Here is my solution. 这是我的解决方案。 It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. 在处理之前,它会在每次加载页面时擦洗您的GET / POST / REQUEST(我省略了cookie,但可以根据需要添加它们)。 It works well in a header. 它在标头中运行良好。 PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s. 如果PHP无法自动检测源编码,则会抛出警告,因此这些警告将被@抑制。

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}

#5楼

A really nice way to implement an isUTF8 -function can be found on php.net : 可以在php.net上找到实现isUTF8功能的一种非常好的方法:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}

#6楼

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output. 如果将utf8_encode()应用于已经存在的UTF-8字符串,它将返回乱码的UTF-8输出。

I made a function that addresses all this issues. 我做了一个解决所有这些问题的功能。 It´s called Encoding::toUTF8() . 这就是所谓的Encoding::toUTF8()

You don't need to know what the encoding of your strings is. 您不需要知道字符串的编码是什么。 It can be Latin1 ( ISO 8859-1) , Windows-1252 or UTF-8, or the string can have a mix of them. 它可以是Latin1( ISO 8859-1) , Windows-1252或UTF-8,或者字符串可以混合使用。 Encoding::toUTF8() will convert everything to UTF-8. Encoding::toUTF8()会将所有内容转换为UTF-8。

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string. 我之所以这样做,是因为某项服务使我的数据馈送全乱了,在同一字符串中混合了UTF-8和Latin1。

Usage: 用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download: 下载:

https://github.com/neitanod/forceutf8 https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8() , which will fix every UTF-8 string that looks garbled. 我包括了另一个函数Encoding::fixUFT8() ,该函数将修复每个看起来乱码的UTF-8字符串。

Usage: 用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples: 例子:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output: 将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function ( forceUTF8 ) into a family of static functions on a class called Encoding . 我已经将函数( forceUTF8 )转换为称为Encoding的类的静态函数系列。 The new function is Encoding::toUTF8() . 新函数是Encoding::toUTF8()

你可能感兴趣的:(检测编码并制作一切UTF-8)