Building a Cross Script Kashmiri Converter: Issues and Solutions Aadil Amin Kak, Nazima Mehdi and Aadil Ahmad Lawaye
[email protected];
[email protected];
[email protected] University of Kashmir Abstract Kashmiri is a new entrant in the realm of Natural Language Processing. Efforts in this direction are only now taking place by developing different NLP tools. The paper in question talks about the development of a Persio-Arabic Devanagari converter. Here the main focus is on handling some issues which were faced while developing the converter. 1.0 Introduction Kashmiri language is primarily spoken in the Kashmir province and some parts of the Jammu province of the state of Jammu and Kashmir State and by migrant populations in the rest of India and abroad. Various scripts such as Sharda, Devanagari, Roman, and Perso- Arabic have been used for Kashmiri. The earliest script used for writing Kashmiri is the Sharda script which is now only used by some Kashmiri pundits for writing horoscopes and a large number of Sanskrit literary works, and old Kashmiri works were written in this script. Presently, the official script of Kashmiri is the modified Persio-Arabic script with additional diacritic marks to represent Kashmiri specific sounds. Alternative scripts like the modified Devanagari script with additional diacritic marks is used by writers and researchers for representing the Kashmiri text related to language, literature, and culture in Hindi. In addition to these scripts Roman script is also used for writing Kashmiri. Regarding the modified Persio-Arabic script, it is written from right to left. It has two modes: nasakh or the type script, and nastalikh, the handwritten version. 1.1 Need for Developing a Converter No manual work is needed i.e. people who are unaware of Persio-Arabic or the Devanagari script can automatically convert the text from Persio-Arabic to Devanagari and vice versa. It is very useful for mutually exclusive people who can use only one script and are unaware of the other. The converter was built by developing different rules on the basis of character combinations and character positions (Initial, medial and Final). There is no one to one mapping in both the scripts i.e.