本文翻译自:Detect encoding and make everything UTF-8
I'm reading out lots of texts from various RSS feeds and inserting them into my database. 我正在从各种RSS提要中读取很多文本,并将其插入到数据库中。
Of course, there are several different character encodings used in the feeds, eg UTF-8 and ISO 8859-1. 当然,提要中使用了几种不同的字符编码,例如UTF-8和ISO 8859-1。
Unfortunately, there are sometimes problems with the encodings of the texts. 不幸的是,文本的编码有时会出现问题。 Example: 例:
The "ß" in "Fußball" should look like this in my database: "Ÿ". “Fußball”中的“ß”在我的数据库中应如下所示:“Ÿ”。 If it is a "Ÿ", it is displayed correctly. 如果它是“”,则正确显示。
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". 有时,“Fußball”中的“ß”在我的数据库中看起来像这样:“ß”。 Then it is displayed wrongly, of course. 然后,当然会显示错误。
In other cases, the "ß" is saved as a "ß" - so without any change. 在其他情况下,“ß”另存为“ß”-因此无需进行任何更改。 Then it is also displayed wrongly. 然后它也会显示错误。
What can I do to avoid the cases 2 and 3? 我该如何避免情况2和情况3?
How can I make everything the same encoding, preferably UTF-8? 如何使所有内容都使用相同的编码,最好是UTF-8? When must I use utf8_encode()
, when must I use utf8_decode()
(it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input? 什么时候必须使用utf8_encode()
,什么时候必须使用utf8_decode()
(很明显会产生什么效果,但是什么时候必须使用这些函数?),什么时候不对输入执行任何操作?
How do I make everything the same encoding? 如何使所有内容都具有相同的编码? Perhaps with the function mb_detect_encoding()
? 也许使用mb_detect_encoding()
函数? Can I write a function for this? 我可以为此编写函数吗? So my problems are: 所以我的问题是:
Would a function like this work? 这样的功能会起作用吗?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it, but it doesn't work. 我已经测试过了,但是没有用。 What's wrong with it? 它出什么问题了?
参考:https://stackoom.com/question/3owD/检测编码并制作一切UTF
I had same issue with phpQuery ( ISO-8859-1 instead of UTF-8 ) and this hack helped me: 我在phpQuery上遇到了同样的问题( ISO-8859-1而不是UTF-8 ),这个hack帮助了我:
$html = '' . $html;
mb_internal_encoding('UTF-8')
, phpQuery::newDocumentHTML($html, 'utf-8')
, mbstring.internal_encoding
and other manipulations didn't take any effect. mb_internal_encoding('UTF-8')
, phpQuery::newDocumentHTML($html, 'utf-8')
, mbstring.internal_encoding
和其他操作均无效。
Get encoding from headers and convert it to utf-8. 从标题获取编码并将其转换为utf-8。
$post_url='http://website.domain';
/// Get headers
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// check for redirect /
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location=explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset /
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset=explode(chr(10), $charset);
$charset = $charset[0];
}
///
// echo $charset;
if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }
I know this is an older question, but I figure a useful answer never hurts. 我知道这是一个比较老的问题,但是我认为一个有用的答案永远不会让您感到痛苦。 I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. 我在桌面应用程序,SQLite和GET / POST变量之间的编码存在问题。 Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved. 有些使用UTF-8,有些使用ASCII,并且基本上,当涉及到外来字符时,所有内容都会搞砸。
Here is my solution. 这是我的解决方案。 It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. 在处理之前,它会在每次加载页面时擦洗您的GET / POST / REQUEST(我省略了cookie,但可以根据需要添加它们)。 It works well in a header. 它在标头中运行良好。 PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s. 如果PHP无法自动检测源编码,则会抛出警告,因此这些警告将被@抑制。
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
A really nice way to implement an isUTF8
-function can be found on php.net : 可以在php.net上找到实现isUTF8
功能的一种非常好的方法:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
If you apply utf8_encode()
to an already UTF-8 string, it will return garbled UTF-8 output. 如果将utf8_encode()
应用于已经存在的UTF-8字符串,它将返回乱码的UTF-8输出。
I made a function that addresses all this issues. 我做了一个解决所有这些问题的功能。 It´s called Encoding::toUTF8()
. 这就是所谓的Encoding::toUTF8()
。
You don't need to know what the encoding of your strings is. 您不需要知道字符串的编码是什么。 It can be Latin1 ( ISO 8859-1) , Windows-1252 or UTF-8, or the string can have a mix of them. 它可以是Latin1( ISO 8859-1) , Windows-1252或UTF-8,或者字符串可以混合使用。 Encoding::toUTF8()
will convert everything to UTF-8. Encoding::toUTF8()
会将所有内容转换为UTF-8。
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string. 我之所以这样做,是因为某项服务使我的数据馈送全乱了,在同一字符串中混合了UTF-8和Latin1。
Usage: 用法:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download: 下载:
https://github.com/neitanod/forceutf8 https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8()
, which will fix every UTF-8 string that looks garbled. 我包括了另一个函数Encoding::fixUFT8()
,该函数将修复每个看起来乱码的UTF-8字符串。
Usage: 用法:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples: 例子:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output: 将输出:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function ( forceUTF8
) into a family of static functions on a class called Encoding
. 我已经将函数( forceUTF8
)转换为称为Encoding
的类的静态函数系列。 The new function is Encoding::toUTF8()
. 新函数是Encoding::toUTF8()
。