HOWTO convert Chinese MP3 for ID3 v2.3 standard

id3python

Amarok developers probably barely thought about the response from the Chinese users when they eventually dropped the id3 tag codec detection, and enforced ID3v2 specification. “Amarok is dead”, claimed in linuxfans.org, the community-powered Magic Linux support forum. Why? Quite a few MP3 files are encoded in GB2312 on id3v1 in China and even worse, some files are encoded with GB2312 in ID3 v2.3 format. What a mess!

I respect their decision, the player has no responsibility to clean the shit of lousy encoders, but we need to face the reality by all means. Here is my imperfect life: Amarok is preferred in Linux, occasionally I am using mpg123 in console mode; using foobar2000 in Windows, sometimes Windows Media Player; portable MP3 player is Creative Zen Micro. No Mac, no iPod. To make things even worse, the locale in Linux is utf8, while in Windows, it is utf16-le. Last but not the least, I do respect specification.

So ID3v1 is not considered, it only supports ISO8859-1, that make it impossible to hold CJK characters. For ID3v2, the most popular version is v2.3, unfortunately, it does not support utf8 encoding. v2.4 supports this codec, but it is seldom picked up by the hardware manufacturer or the application developers.

Let’s start from the latest specification. ID3 v2.4:

The first bad news is a de facto id3v2 implementation, id3v2-0.1.11 does not support v2.4. That cost several hours to figure out why the newly added v2.4 disappeared mystically, the answer is id3v2 is even unable to recognize v2.4 tags. EyeD3 is the remedy, this pure python library provides a very neat command line utility to manipulate id3 v2.4 tags. The good news is Creative Zen Micro support v2.4. In fact, I am not quite sure whether the honor goes to Creative Lab, or the libnjb developers.

Another option is v2.3, most popular implementation so far. Unfortunately, it only supports unicode-LE(i.e the default locale of Microsoft Windows), unicode-be and latin-1, no UTF-8 support. To make it even worse, id3v2 writes to the tag regardless the locale, that is really horrible! Here is my effort to address this problem, eyeD3conv, as the name suggest, it depends on eyeD3 library. This small utility will convert mistaken-encoded tags to standard Unicode16-LE ID3 v2.3 tag.

And you need to apply this patch to fix the encoding bug in eyeD3-0.6.14. The patch has been submitted to the upstream.

Update: thanks to the author of eyeD3, Travis’ quick response, according to the specification, the url is supposed to be encoded in ascii, so we can simply ignore the URLFrame. Forget the patch, and use the updated-version.

Other mis-encoded frames may throw an UnicodeDecode exception when frame is read/written that cancels the succeeding file rename action. Here are some pragmatic tips to work around this issue:

# remove all comments
eyeDe --remove-comments foo.mp3
# remove WXXX frame
eyeDe --set-text-frame="WXXX:" foo.mp3

No idea which application inserts such crap into the tag.