C++ fopen、CFile如何以UTF-8编码格式读写文件

阅读更多

 

How to write UTF-8 file with fprintf in C++

http://stackoverflow.com/questions/10028750/how-to-write-utf-8-file-with-fprintf-in-c

 

ou shouldn't need to set your locale or set any special modes on the file if you just want to use fprintf. You simply have to use UTF-8 encoded strings.

#include 
#include 

int main() {
    std::wstring_convert,wchar_t> convert;
    std::string utf8_string = convert.to_bytes(L"кошка 日本国");

    if(FILE *f = fopen("tmp","w"))
    fprintf(f,"%s\n",utf8_string.c_str());
}

 

Save the program as UTF-8 with signature or UTF-16 (i.e. don't use UTF-8 without signature, otherwise VS won't produce the right string literal). The file written by the program will contain the UTF-8 version of that string. Or you can do:

int main() {
    if(FILE *f = fopen("tmp","w"))
        fprintf(f,"%s\n","кошка 日本国");
}

 

In this case you must save the file as UTF-8 without signature, because you want the compiler to think the source encoding is the same as the execution encoding... This is a bit of a hack that relies on the compiler's, IMO, broken behavior.

You can do basically the same thing with any of the other APIs for writing narrow characters to a file, but note that none of these methods work for writing UTF-8 to the Windows console. Because the C runtime and/or the console is a bit broken you can only write UTF-8 directly to the console by doing SetConsoleOutputCP(65001) and then using one of the puts variety of function.

If you want to use wide characters instead of narrow characters then locale based methods and setting modes on file descriptors could come into play.

#include 
#include 
#include 

int main() {
    if(FILE *f = fopen("tmp","w")) {
        _setmode(_fileno(f), _O_U8TEXT);
        fwprintf(f,L"%s\n",L"кошка 日本国");
    }
}

 

#include 
#include 

int main() {
    if(auto f = std::wofstream("tmp")) {
        f.imbue(std::locale(std::locale(),
                new std::codecvt_utf8_utf16)); // assumes wchar_t is UTF-16
        f << L"кошка 日本国\n";
    }
}

 

The first example uses wstring_convert from C++11, but any other method of obtaining a UTF-8 encoding works too, e.g. WideCharToMultiByte. The last example uses a C++11 codecvt facet for which there's not a built-in, pre-c++11 replacement. The other two examples don't use C++11. 

 

How to Read/Write UTF8 text files in C?

http://stackoverflow.com/questions/21737906/how-to-read-write-utf8-text-files-in-c

Instead of

fprintf(fout,"%c ",character);

 

use

fprintf(fout,"%c",character);

 

The second fprintf() does not contain a space after %c which is what was causing out.txt to display weird characters. The reason is that fgetc() is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.

putchar(character) output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try

while((character=fgetc(fin))!=EOF){
    putchar(character);
    printf(" "); // This mimics what you are doing when you write to out.txt
    fprintf(fout,"%c ",character);
}

 

If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.

#include 
#include 

/* The first byte of a UTF-8 character
 * indicates how many bytes are in
 * the character, so only check that
 */
int numberOfBytesInChar(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

int main(){
    FILE *fin;
    FILE *fout;
    int character;
    fin = fopen("in.txt", "r");
    fout = fopen("out.txt","w");
    while( (character = fgetc(fin)) != EOF) {
        for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
            putchar(character);
            fprintf(fout, "%c", character);
            character = fgetc(fin);
        }
        putchar(character);
        printf(" ");
        fprintf(fout, "%c ", character);
    }
    fclose(fin);
    fclose(fout);
    printf("\nFile has been created...\n");
    return 0;
}

 

UTF-8, CString and CFile? (C++, MFC)

http://stackoverflow.com/questions/2318481/utf-8-cstring-and-cfile-c-mfc

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read this: http://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

  • The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
  • The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where Windicates wide characters).
  • The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
  • Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTRis always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

 

 

你可能感兴趣的:(C++)