Know more and connect with me on [Linkedin Profile].

Saturday, May 23, 2015

Discarding Arabic Diacritics From Text


To strip out diacritics from Arabic text, this is a little code in Python that I like a lot.

    import unicodedata
    return filter(lambda c: unicodedata.category(c) != 'Mn', s)

Look how it is elegant and small. Just one line with Lambda expression and a check of which the letter is unicode diacritics or not.

The same job can be easily done in regular expressions with javascript. Look at this Javascript code:

function stripAccents(text) {
    return text.replace(new RegExp('[\u064B-\u065F]*', 'g'), '');
}

It is one line that find one or more vowels in the range from u064B to \u065F and just replace it with no thing.

Thanks to lambda and regular expressions. Both are really smart and helpful.



Java and Arabic Support

[THIS IS AN OLD ARTICLE, I republish for the sake of benefit to friends]

Java uses Unicode as native encoding, so any text will be converted to Unicode for proper handling. Java already has support almost to all known encodings, see: http://java.sun.com/products/jdk/1.1/docs/guide/intl/encoding.doc.html.

Our involvement is how to adjust the input and the output; the Input will be from request parameters (in case of web development), files, Properties, and JDBC. The output will be to the browser through the HttpServletResponse object or a file, ...


Converting text strings:

  • String class already has support to conversion to Unicode; see String constructors that take encoding as a parameter.
  • String class can convert to any encoding; see getBytes() function that take encoding parameter.
  • String class content at anytime must be Unicode, so it can convert non-Unicode input to Unicode or give you the non-Unicode bytes upon request by getBytes().
  • A running examples is available at http://java.sun.com/docs/books/tutorial/i18n/text/string.html
  • You can also use Charset, CharsetDecoder and CharsetEncoder.

Accessing files: (Input/Output)

You can access files, using normal Java classes but be aware that if you did not specify certain encoding, the file input classes will read the system property file.encoding and convert the file content to Unicode based on it.

To know the system file encoding, read System.getProperty("file.encoding")

So if you are writing I18n (Internationalization) applications, you should specify the encoding when you are reading or writing to files:

Use InputStreamReader and InputStreamWriter to specify encoding wanted, see Character and Byte Streams at: http://java.sun.com/docs/books/tutorial/i18n/text/stream.html

Request Input Arabic Parameters:

If you want to pass Arabic text in URLs, you will get the Arabic as the default system encoding, Cp1252 in a Unicode string, you should convert the parameter back to Cp1252, then to Unicode as an Cp1256, see the example:

String aParUnicode = request.getParameter("apar");

byte [] by1252 = aParUnicode.getBytes("Cp1252");

aParUnicode = new String(by1252, "Cp1256");

By the introduction of Servlet 2.3 you can set the encoding of the request and get the parameters as Unicode correctly.

request.setCharacterEncoding("Cp1256");

Response and Locale Object:

Each running application already have Locale object, server applications should specify Locale that matched its language, this will give support in automatic conversion to strings based on this locale, please look at Locale class. The ServletHttpResponse already have a member to set locale called, setLocale

By default Locale is set to en (English) even if your default Windows language is Arabic, so all Unicode will be converted to 8859_1 (Latin1). When you set Locale to ar (Arabic), all Unicode will be converted to 8859_6 (ISO Latin/Arabic Alphabet) and will be displayed correctly as Arabic (ISO).

ISO8859_6 is better than Cp1256, as its compatible with Unix/Mac and Windows, not just Windows.

Properties class:

Properties class are hard coded to read files that encoded as 8859_1 any other characters should be written as Unicode escape character sequence. You can still write Arabic in key values, by it will be converted to Unicode as an 8859_1, so you should convert the Unicode back to 8859_1 and convert to Unicode as Cp1256 or whatever the file real encoding.

String Arabic1256 = new String(latin8859_1.getBytes("8859_1"), "Cp1256");

The above example convert the string from Unicode back to 9959_1 and then back to Cp1256, and assumes the properties file was written using ANSI Cp1256.

  • An alternative solution is to use NetBeans IDE to write Arabic in keys and the NetBeans automatically will convert it to Unicode escape character sequences back and forth, so you always see it Arabic and always stored in the properties file as an escape Unicode character sequences.

JDBC:

Once you have the characters as String Unicode , its already converted to Unicode, so you can convert back to the original encoding and convert to Unicode correctly, some JDBC drivers give you the option to set the charset for the JDBC to make proper conversion like MySQL, the SQL server driver convert to Unicode correctly based on the dbase encoding as selected when you create the dbase.

Be careful if you are using JDBC-ODBC bridge with Access dbase, the driver usually failed to convert Arabic characters to Unicode if your default Windows language is not Arabic.

References:

Java Tutorial Internationalization:
http://java.sun.com/docs/books/tutorial/i18n/index.html

Good ole' ASCII :
http://czyborra.com/charsets/iso646.html

The ISO 8859 Alphabet Soup:
http://czyborra.com/charsets/iso8859.html

J2sdk1.4.0 documentationa and API doc:
http://java.sun.com/j2se/1.4/docs/guide/intl/index.html

>Originally written in 22/5/2002

Python Language

With Python, I write one quarter of the equivalent Java code !!
Java code is a waste of life :)

I think the same applies for C# if not worser than java.

Python is really a simple design and Agile programming language, IMHO.