I’ve made a program for school; that goes through a plain text file and makes a concordance for each word. It will take each word, remove non-alphabetical characters from the front and back, and put it into a Binary Search Tree. When encountering Unicode characters in a text, you get random ascii characters that make up the multibyte character instead of what it is: for example, “yarns—and,” is outputted as “yarnsùand.” I spent hours months ago and this week trying to solve this problem, so what do I do?
https://www.codeproject.com/Articles/38242/Reading-UTF-8-with-C-streams#mozTocId353176 This article seemed useful. But not being able to read in utf-8 is a solved problem, so making up a facet didn’t seem useful. I didn’t try it though because of that.
Here is a MRE of the bug.
#include <string>
#include <iostream>
#include <fstream>
#include <windows.h>
#include <consoleapi2.h>
using namespace std;
int main()
{
wfstream file;
file.open("Example.txt", ios::in);
// Changes buffer from char to wchar_t
wchar_t* buffer = new wchar_t[100];
file.rdbuf()->pubsetbuf(buffer, 100);
wchar_t CurrentStreamCharacter = file.get();
wstring NewWord = L"";
while (file)
{
NewWord.push_back(CurrentStreamCharacter);
CurrentStreamCharacter = file.get();
}
//SetConsoleOutputCP(65001);
wcout << NewWord << endl;
wcout << "yarns—and even convictions. The Lawyer—the best of old fellows—had,";
return 0;
}
Here is the text in Example.txt.
yarns—and even convictions. The Lawyer—the best of old fellows—had,