Implementing the GB2312 Encoding

What is GB2312

GB2312 is an encoding for Chinese characters. It is most widely encoded in EUC-CN form.
In EUC-CN, each GB2312 character is represented by two bytes. The first byte ranges from 0xa1 to 0xf7, the second byte ranges from 0xa1 to 0xfe.
In addition, ASCII characters can combine with GB2312 characters in their usual encoding. Specifically, if we read a byte less than 0x7f, it’s considered as ASCII character. Otherwise it’s the first byte of a gb2312 character, and we expect another byte greater than 0x7f to be the second byte of the character.
GB2312 has 7445 characters in total. So only some combinations of two bytes are valid. I looked up the original document (and a link to pdf file here) from Chinese authority for this information. The mapping between GB2312 and Unicode I used is ‘unicode.org-mappings/EASTASIA/GB/GB2312.TXT’ from here.

About Implementing

No simple conversion formula exists between gb2312 and unicode. So we need a conversion table for all gb2312 characters. The corresponding unicode index is less than 65536.
The time complexity for a big switch depends on the complier and can be O(n), where n is the number of characters. So I use a two-dimensional array for conversion table from gb2312 to unicode, and an array for unicode to gb2312. Thus, I speed up the look-up process and reduce the code complexity at a cost of 288KB(~0.29MB) memory.
To reduce code length, I use 0 for non-existent value in the array for unicode to gb2312, and then change it to GB2312_NULL in a function.
There’re only 7445 GB2312 characters, but their corresponding Unicode values have a range of more than 60000. So, there’re large sections of zeros (non-existent characters) in the mapping from Unicode to GB2312, and it’s necessary to compress them and shift the index in the query function. The four largest sections contain 41131 zeros and this saves memory for 41131 int32(~161KB). The remaining sections are small (333, 316, 295, 246, and 234) so further compressing might not compensate increased time coefficient (if-else branches) of the conversion function.
Take a look at the conversion table here, and also the generating code. gb2312_conversion_table_gen.cpp reads from GB2312.TXT and generates two conversion tables.

There’s a ‘replacement’ feature in the encode function. When the .encode reads a unicode character that is not in gb2312, it puts a given string in the position.

About Testing

Testing file is at rakudo/t/spec/S32-str/gb2312-encode-decode.t, you can find it here at github.
The first test decodes gb2312_sample_in_gb2312.txt. This file contains three subtests. The first two consist of only Chinese characters, while the third one is Chinese characters mixed with ASCII character(English letters, punctuations, numbers). This test ensures the decodestream function works correctly.
The second and third test encodes and decodes a string to check encode and decode function.
The last two check whether ‘replacement’ in encode function works correctly.

Current Progress

The most difficult part in this project is testing. I had an overflow bug (described below). Because I use unsigned int for chars, it was not easy to find out. After that, with a comprehensive test suite Roast, the whole testing process goes easier than I expected.

All three functions – decode, encode, decodestream – work well. The next step is to merge them into MoarVM, Rakudo, and Roast.

Conversion table is updated by using a source file – unicode.org-mappings/EASTASIA/GB/GB2312.TXT from the archive. A few codepoints are different from the original python-generated one.

The conversion table is compressed to remain large sections of zeros. The query function is changed accordingly to shift the indexes.

My PRs: MoarVM, Conversion table update in MoarVM, Rakudo, Roast. (All merged)

Some Takeaways

Be careful with types and uses a bigger one when appropriate. When trying to save a gb2312 codepoint into a single variable, I use ‘256*char_1+char_2’ to compress it. This gives values greater than 40000, which exceeds an int16 type. Using uint16 carefully is fine but an int32 type guarantees safety in this case.

When I cloned from my forked repository and tried to build, it gives error messages. Then I cloned from the original repository and built it successfully. So, I gradually changed files in the original repositories to my modified files, and it worked. The reason is there exists new commits to the original repository after I forked it and I didn’t pull it to my forked one.

For slow (or unstable) internet connections, it would be better to use configure SSL for ‘git clone’ commands to avoid getting disconnected with the server.

Be aware of the balance between memory complexity and time complexity!

Advertisements

留下评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google photo

You are commenting using your Google account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s

在 WordPress.com 上创建您自己的网站
开始
%d 博主赞过: