Unicode Bangla: a few unmeaning arguments (original) (raw)
A Unicode chart stand in Ekushey book fair. — Abu Jar M Akkas
AS I walked around inside the Bangla Academy, in the second week of the Ekushey book fair this year, two tall stands readily caught my eyes. They stood between the Nazrul Mancha and the building that houses Agrani Bank. Set apart by a distance, they had the Unicode character code chart for Bangla, probably in Version 11.0, printed on them, from end to end, with a large blue circle in the upper end.
Both the circles have texts in Bangla, which read ‘Bangla Unicode Paramarsha Grahan’, which could mean both that visitors leave their advice about Bangla Unicode or visitors have advice on Bangla Unicode from there. As I veered left, I saw a stall with the sign, reading ‘Bangla Unicode’ with a half-title, reading ‘Enhancement of Bangla Language in ICT through R&D project’, both of them in Bangla, with logos, and respective names, of the Information and Communications Technology Division and the Bangladesh Computer Council.
It is customary for such agencies, more so if they are government, to give out advice. But the stall, with two young men sitting inside, was there to receive advice from visitors. As I enquired about, they both said that the Unicode Consortium had some problems in the code chart for Bangla and the government wanted to have them corrected for which they had launched a signature campaign.
As I asked them about the issues that they had with the code chart for Bangla, they handed me a sheet of printed paper with the code chart and the explanations printed. I could see that four code points — characters, or even letters, for the less initiated — were marked red, one point was marked yellow and two were marked blue. The two young men were by no means experts in Unicode; they were sat there to request people to sign the campaign.
As I asked them about the issues, one of them, visibly the senior one, said that the Unicode standards allow people to write three Bangla letters with under-dot, Da-e-shunya ra, Dha-e-shunya rha and Antastha ya, which is not the right way to write. Unicode uses the more pervasive term nukta, which is basically bindu or shunya that we learnt when we read primers in our young childhood, and they think that these three letters should not be written with nukta. What they wanted to say was that these three letters were single letters and they should be treated as single letters and they do not need to be composed with two letters — Da with nukta or shunya, Dha with nukta and Ya with shunya.
I told them that I could write the three letters by the stroke of one key each on the keyboard that I designed. Users of all other keyboard systems that use Unicode also write the letters with single keystrokes. Unicode assigns 09DC, 09DD and 09DF to the three letters and explains them by adding the nukta or 09BC to 09A1, 09A2 and 09AF as alternatives. Earlier it was only a letter-plus-nukta affair, but now there are unique codes for the three canonical characters. The storing and rendering may treat them differently, but during search, both the formats are treated alike. A search on Google with both the formats will prove this. If any software does not do that, it can be safely said that the software is not fully Unicode-compliant. They appeared least bothered.
Apart from the three extended Bangla letters, the other code point that was in red was the nukta — the dot below Da-e-shunya ra, Dha-e-shunya rha and Antastha ya. The sheet of paper distributed to the visitors says that there is no place for nukta in the Bangla script system and this should be removed from the chart. But even if the three Bangla consonants can be written with single keystrokes by calling single code points, which is very much the case in point, there is a strong reason for nukta to be used in the script system, at least in its deep structure, for extension of letters. There may be reservations about the title of the section, which is ‘Additional consonants’, that describes the three consonants in the explanatory notes given after the chart proper. But the three letters are, in fact, the latest additions to the Bangla alphabet that has historically been modelled on the Devanagari alphabet. The three letters had not even been properly normalised during the initial days of the Bangla print that started developing in the last quarter of the eighteenth century or the early years of the nineteenth century. They were additionally constructed and then came to be normalised in the alphabet in course of time.
Buddhadeb Bose devised two characters with one dot and two dots below the ‘ja’ letter: with a single dot to signify the sound of the English letter ‘z’ and with two dots to represent the sound of ‘zha-i-farsi’, which is found in the English word ‘pleasure’, French word ‘je’, meaning I, and the Russian word ‘zhizn’, meaning life. The instances can be found in the translation of Doctor Zhivago, by Minakshi Datta and Manabendra Bandyopadhyay, published in 1960; in Buddhadeb Bose’s translation of Charles Baudelaire’s poems, published in a collection in 1961; and in Hayat Mamud’s Gerasim Stepanovich Lebedev, published in 1985, to name a few. The instance is also there in Buddhadeb Bose’s Japani Journal, published in 1962. Much earlier, John Mendies in his Companion to Johnson’s Dictionary: Bengali and English, published in 1851, used two dots below the letter ‘ja’ to signify the sound that Buddhadeb Bose meant with a single dot.
The use of under-dot or dot-below is not rare in dictionaries — it has been profusely used by Jnanendramohan Bandyopadhyay in his benchmark Bangla dictionary Bangala Bhashar Abhidhan — and scholarly works. Of late, the Desh magazine, published from Kolkata, and some other publishing houses also almost regularly use ‘ja’ with a dot below to represent the sound of the English letter ‘z’. Desh started using this sparingly in the second half of the 1990s, but increased the frequency of its use by the turn of the past century, now coming to be almost a regular phenomenon, having a spillover effect on the publications of other houses. The first use of the extended ‘ja’ that I can remember was in a letter published in Desh in 1997; it was about the correct pronunciation of Père Lachaise, I cannot remember if it was the cemetery or the Paris Métro station, which was written with a dot below the letter ‘ja’. A further extension to the Bengali alphabet for the transliteration of other languages may very well call for a more frequent use of the method, in the scholarly world though, in the script system. The call for efforts to dispense with this appears merely to be a futile exercise.
The code point in yellow in the chart was 09E4, which is yet unassigned, or reserved as a control code, in the Bangla code block. The people in the Unicode stall and others trying to throw their weight behind the project’s view think that there should be a ‘danri’, or ‘danda’, which is a period or full stop as is commonly known in English, in the point. The reason for the demand is that while the Devanagari code block has a ‘danda’ character, there is no ‘danda’ or ‘danri’ for Bangla. The consortium sets out that all other Indic languages, using the Devanagari alphabet scheme, should use the ‘danda’ from the Devanagari block. This is the case with ‘dui danri’ or double ‘danda’. A few programs do not so do, but this appears to be issues with the programs and software developers, not the Unicode scheme. But then, the use of ‘danri’ from the Devanagari code block happens only in the deep structure, the actual fonts that software use do have a ‘danda’, designed and crafted keeping to shapes, weights and forms of other Bangla letters where a Unicode-compliant font keeps several hundred conjunct, or conjoined, letters.
It is still not clear if the reservation about using the ‘danda’ from the Devanagari code block or any consequent resentment, as the project tries to put it, stems from a necessity that is not adequately attended to in the Unicode scheme or from a form of a national pride of having a ‘danda’ of Bangla’s own. Is the resentment specific to the speakers of Bangla from Bangladesh or the speakers of Bangla from across the world? Unicode, which was in its first version in 1991, has come a long way, now with billions of Bangla words forming a huge unattended corpus online or in other digital forms.
No one has so far claimed, with logic, to be facing issues with writing the Bangla full stop, which is ‘danri’, all these years. Besides, text written in Bangla Unicode draws all other punctuation marks such as comma, semicolon, colon, quotation marks, hyphens, dashes, ellipsis, etc from the Basic Latin block, which is meant to write languages that use Roman letters including English. If text written in Bangla can use punctuation marks from the Basic Latin block, there should be no problem for it to use the ‘danda’ from the Devanagari block. Even if there is a ‘danda’ for Bangla in the code block, it still needs to use ‘dui danri’ or double ‘danda’, from the Devanagari code block if there is the need to digitise all texts of the Bangla literature from the mediaeval to early modern age. If double ‘danda’, which is scarce in modern text but for embellishment purposes, can be drawn from the Devanagari block, the single ‘danda’ can also be so drawn. Users, in both the cases, remain unaware of where the ‘dandas’ are coming from and how.
The sheet of printed paper given to visitors at the Unicode stall in the book fair had two code points in blue — the old Bangla taka mark and the taka sign. But the explanation to the chart says that they are the rupee mark and the rupee sign. The word ‘rupee’, with its Hindi variants ‘rupiya’ or ‘rupya’ and Urdu variant ‘rupaya’, derives from Sanskrit ‘rupyaka’, meaning silver coin, which has come from Sanskrit ‘rupya’, meaning wrought gold or silver. Emperor Sher Shah Suri, who reigned the Suri empire in the first half of 1540s, introduced this, which later became a common name for the monetary currencies of areas of the subcontinent and surroundings. But in the changed context that evolved in modern times, the word came to signify the monetary currency of both India and Pakistan. There may be reservations about the word on this point, as the whole of Bengal, even till this day, uses the word ‘taka’ to denote the monetary currency. The nomenclature could, and perhaps should, be changed, but the essence — the mark and the sign intended — still remains the same.
All this renders the efforts to run a signature campaign for the Unicode Consortium to make changes in its policy — to drop nukta from the block, to incorporate the three consonants at hand as canonical characters even when they are so with a single alternative for each of them and to include a ‘danri’ when the block does not provision for other punctuation marks — largely meaningless.
Inside the stall at the book fair, stands the wall with a large poster mounted on it, saying that work on the development of 16 software is on. Another large poster mounted beside names the software — spell-checking and grammar correction, optical character recognition, the conversion of handwriting into text, machine translation, a treebank, the national keyboard, keyboards for languages of national minorities, speech-to-text engine, text-to-speech engines, software for people with disabilities, a screen reader, Bangla Unicode, a Bangla styleguide, sentiment analysis, IPA, and a font converter. The list names a bunch of initiatives, with a few of them still being considered unnecessary.
People are badly in need of software for spelling and grammar correction which could further writing in general by a large extent, optical character recognition which would help in the digitisation of a huger number of books considered important but no longer accessible and machine translation which could automate, at least help in automating, many of the urgent tasks. Engines for speech-to-text and text-to-speech, a screen reader and other software for people with disabilities could greatly advance the cause of such people. A Bangla styleguide could keep the preparation of all public, and even private, documents in sync with a preferred set of rules. A treebank, which is a parsed text corpus, and software for sentiment analysis could help further linguistic research on syntactic structures.
Keyboard for languages of national minorities would be of utmost importance if they can be developed based on scientific principles. This will speed up typing in the languages and greatly help the preparation of texts in them. But what sounds odd in this is the development of a national keyboard. A national standard keyboard layout, National Keyboard, designed by the Bangladesh Computer Council, has been around since 1993. But it could hardly gain currency among ordinary people and probably not among people employed in government agencies. This was a fixed keyboard layout, but was hardly designed to be efficient. It was focused more on the ease of use in terms of knowing how to type, not in terms of how fast text can be typed. The Bangla computing scene is still left wanting an efficient keyboard layout, fixed and designed on scientific principles and tailored to the way Unicode input mechanism demands.
The project, as the list suggests, aims at developing an engine for Bangla font interportability. The name suggests the use of one set of fonts in programs that do not support them. The government with the help of the Bangla Academy and graphic design students of the University of Dhaka, released a typeface named Shapla, under a series called Amar Barnamala, in 2013; no new member of the series has since then ever turned up. While Shapla was hardly graceful as a text typeface, nothing further could be heard of the initiative later. The government, or the Election Commission for that matter, in 2008 released a typeface called Nikosh, which expands to Nirbachan Commission Secretariat in Bangla. But that too was not graceful enough for use as text typeface. Efforts to develop an engine to make fonts interportable across software (or platform?) could raise hopes, but such efforts would most likely mean the re-use of already developed typefaces most of which are not graceful and aesthetic. What is badly needed is a bunch of fresh typefaces specifically designed to be used as text types. The metal types that were once cast have also been lost, completely by the turn of the century, with no type foundries now having been in operation. It is imperative that the faces should be digitally replicated.
Bangla computing in Bangladesh started marching forward in the late 1980s; it so did in West Bengal a little later. But everything that is related to Bangla computing is still in its infancy, with processes, private and public, having stumbled more often than expected in all these years. There were efforts that have failed to properly work out. There were hopes that have only been dashed as days passed by. They should not have so done.
Abu Jar M Akkas is deputy editor at New Age.