the problem of test.js which using UTF8 encoding #8

Closed
opened 2016-05-26 04:04:03 +00:00 by gxlmyacc · 5 comments
gxlmyacc commented 2016-05-26 04:04:03 +00:00 (Migrated from github.com)

Hi, here is my js file which save as utf8 encoding:

function test(a) {
  println(a);
  return 1;
}
test('哈哈');

I used BESENShell.exe test.js to execute it. but '哈哈' cannot print out.

Hi, here is my js file which save as utf8 encoding: <pre> function test(a) { println(a); return 1; } test('哈哈'); </pre> I used `BESENShell.exe test.js` to execute it. but '哈哈' cannot print out.
BeRo1985 commented 2016-05-26 06:43:07 +00:00 (Migrated from github.com)

The BESENShell.exe on the repo is built with Delphi 7 (which is from August 2002 and before the offical Unicode-Support-epoch inside Delphi), and BESENShell uses Write/WriteLn, so that it's normal.

For UTf8 output on the win32 console on older Delphi versions (like for example Delphi 7), you do need SetConsoleOutputCP(CP_UTF8); and changing the console font to a UTF8-capable per GetStdHandle(STD_OUTPUT_HANDLE), GetCurrentConsoleFontEx and SetCurrentConsoleFontEx, and then use something like WriteConsoleW(ConsoleHandle,PWideChar(s),80,Written,nil); instead Write/WriteLn.

For more details, look into the MSDN.

And since it isn't a BESEN issue per se, I'll close this issue.

The BESENShell.exe on the repo is built with Delphi 7 (which is from August 2002 and before the offical Unicode-Support-epoch inside Delphi), and BESENShell uses Write/WriteLn, so that it's normal. For UTf8 output on the win32 console on older Delphi versions (like for example Delphi 7), you do need SetConsoleOutputCP(CP_UTF8); and changing the console font to a UTF8-capable per GetStdHandle(STD_OUTPUT_HANDLE), GetCurrentConsoleFontEx and SetCurrentConsoleFontEx, and then use something like WriteConsoleW(ConsoleHandle,PWideChar(s),80,Written,nil); instead Write/WriteLn. For more details, look into the MSDN. And since it isn't a BESEN issue per se, I'll close this issue.
gxlmyacc commented 2016-05-26 07:03:49 +00:00 (Migrated from github.com)

but if encoding of test.js is ansi, I use

BESENConvertToUTF8(BESENGetFileContent(filename))
to covert it to utf8, then '哈哈' can print.
and the charcode of
BESENConvertToUTF8(BESENGetFileContent(filename))
is difference from test.js which using utf8 encoding.

but if encoding of test.js is ansi, I use <pre>BESENConvertToUTF8(BESENGetFileContent(filename))</pre> to covert it to utf8, then '哈哈' can print. and the charcode of <pre>BESENConvertToUTF8(BESENGetFileContent(filename))</pre> is difference from test.js which using utf8 encoding.
gxlmyacc commented 2016-05-26 09:16:01 +00:00 (Migrated from github.com)

and if I use Delphi xe2, '哈哈' will print '1t1t'

and if I use Delphi xe2, **'哈哈'** will print **'1t1t'**
BeRo1985 commented 2016-05-26 09:56:38 +00:00 (Migrated from github.com)

BESENGetFileContent is defined as function BESENGetFileContent(fn:TBESENANSISTRING):TBESENANSISTRING; so it returns a ansistring, where BESENConvertToUTF8 converts this "ansi" string into a UTF8 string (but still as ansistring as container due to support for older Delphi versions) but with the ansi codepoints, so insert a UTF8 BOM (0xEF 0xBB 0xBF) at the beginning "or" remove the BESENConvertToUTF8 call, because BESENConvertToUTF8 does at the UTF8 BOM case just:

if (length(s)>=3) and (s[1]=#$ef) and (s[2]=#$bb) and (s[3]=#$bf) then begin
  // UTF8
  result:=copy(s,4,length(s)-3);
 end else 
....

so that you should use BESENUTF8ToUTF16 for to convert a ansistring-misused UTF8 string into a WideString for the Win32 unicode-capable API, or even just that from ansistring-misused UTF8 string to a real UTF8String for newer Delphi versions:

function FakeConvert(const s:ansistring):UTF8String;
begin
 SetLength(result,length(s));
 Move(s[1],result[1],length(s));
end; 

and have a look into http://stackoverflow.com/questions/26255148/is-writeln-capable-of-supporting-unicode due to the general Write/WriteLn Unicode problematic.

TLDR as summary: BESEN misuses the ansistring datatype for to containing UTF8 data, so that it works also on older Delphi and FreePascal versions, and you must convert it to a real unicode string datatype in a raw way for printing it to the screen.

BESENGetFileContent is defined as `function BESENGetFileContent(fn:TBESENANSISTRING):TBESENANSISTRING;` so it returns a ansistring, where BESENConvertToUTF8 converts this "ansi" string into a UTF8 string (but still as ansistring as container due to support for older Delphi versions) but with the ansi codepoints, so insert a UTF8 BOM (0xEF 0xBB 0xBF) at the beginning "or" remove the BESENConvertToUTF8 call, because BESENConvertToUTF8 does at the UTF8 BOM case just: ``` if (length(s)>=3) and (s[1]=#$ef) and (s[2]=#$bb) and (s[3]=#$bf) then begin // UTF8 result:=copy(s,4,length(s)-3); end else .... ``` so that you should use BESENUTF8ToUTF16 for to convert a ansistring-misused UTF8 string into a WideString for the Win32 unicode-capable API, or even just that from ansistring-misused UTF8 string to a real UTF8String for newer Delphi versions: ``` function FakeConvert(const s:ansistring):UTF8String; begin SetLength(result,length(s)); Move(s[1],result[1],length(s)); end; ``` and have a look into http://stackoverflow.com/questions/26255148/is-writeln-capable-of-supporting-unicode due to the general Write/WriteLn Unicode problematic. TLDR as summary: BESEN misuses the ansistring datatype for to containing UTF8 data, so that it works also on older Delphi and FreePascal versions, and you must convert it to a real unicode string datatype in a raw way for printing it to the screen.
gxlmyacc commented 2016-05-27 02:13:05 +00:00 (Migrated from github.com)

thanks for you answer, but I mean is that:
in delphi7,
if test.js is ansi encoding, I use BESENConvertToUTF8(BESENGetFileContent('test.js')) to covert ‘哈哈’ to utf8, it will be '鹿镁鹿镁', and ‘哈哈’ will print.
but if test.js is utf8 encoding (with bom), then I use BESENConvertToUTF8(BESENGetFileContent('test.js')) to covert ‘哈哈’ to utf8, it will be '鍝堝搱', and ‘哈哈’ will not print.
So I think that: BESENConvertToUTF8 function may be has some issue which lead to encoding wrong?

thanks for you answer, but I mean is that: in delphi7, if test.js is ansi encoding, I use `BESENConvertToUTF8(BESENGetFileContent('test.js'))` to covert ‘哈哈’ to utf8, it will be '鹿镁鹿镁', and ‘哈哈’ will print. but if test.js is utf8 encoding (with bom), then I use `BESENConvertToUTF8(BESENGetFileContent('test.js'))` to covert ‘哈哈’ to utf8, it will be '鍝堝搱', and ‘哈哈’ will not print. So I think that: BESENConvertToUTF8 function may be has some issue which lead to encoding wrong?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
BeRo1985/besen#8
No description provided.