\documentstyle{article}
\addtolength{\textwidth}{1.5cm}
\addtolength{\textheight}{1.5cm}
\input tlsyllable
\newcommand{\bs}{$\backslash$}
\title{TeluguTeX\footnote{\copyright 1991\ \ Lakshmi V. S. Mukkavilli}}
\author{Lakshmi V. S. Mukkavilli}
\date{}
\begin{document}
\maketitle
%\tableofcontents
There are seven sections in this article.
The first section explains the Telugu script. For someone who can speak
Telugu this section should help learn Telugu script.
On most of the computers there is no facility for inputting Telugu
text. We propose a romanization scheme for inputting Telugu.
This facilitates text entry in Telugu. This is the subject of the
second section.
We provide examples of Telugu text typeset using our system in the third
section.
In the fourth section we explain the problems involved in typesetting
telugu text.
The fifth section deals with implementation and the sixth section contains
samples of telugu text typeset using various sizes/styles.
Last section summarizes our work.
\section{Telugu Script}
Like English, Telugu is written from left to right. Telugu text consists of
sentences. A sentence is a sequence of words. A word is a sequence
of syllables called Aksharas. An Akshara can be of two types.
\begin{enumerate}\label{syldef}
\item a stand-alone vowel
\item Consonant + Consonant +\ldots+Consonant+Vowel
\end{enumerate}
Let $C_1$ be the first consonant, $C_2$ be the second consonant and
$C_n$ be the $n^{th}$ consonant. Let the vowel be denoted by $V$.
An Akshara is composed as follows: $C_1$ is considered the base
consonant. $C_2$ to $C_n$ are called consonant conjuncts. Consonant
conjuncts are optional. The Vowel modifies the base consonant.
The vowel modifier may be absent. Such a situation is indicated
by a symbol called {\em pollu} or {\em halant} ({\tlc\QQQ\zzJJ X}).
When this symbol is attached to a consonant we get what is called
the half consonant or the pure consonant ( so called because
of the absence of any vowel sound).
The form of Base Consonant + Vowel(henceforth refered to as C+V)
varies depending on both
the consonant and the vowel. The vowel modifiers can appear
to the right, on the top or at the bottom of the base consonant.
In many cases a completely different symbol is needed. After
Consonant + Vowel is formed, consonant conjuncts are attached
in the order they appear in the syllable. Consonant Conjuncts
are placed to the right or below C+V.
Consonant Conjuncts have a different form than either base consonant
or C+V's.
An Akshara can optionally have accent(s). Accents could appear
at the top or on the side or at the bottom of a syllable.
Table~\ref{t31} lists stand-alone vowels and corresponding modifiers.
The blob of ink stands for consonant base. The table indicates the
placement of vowel modifiers in relation to base consonant. But
one should understand that there are very many exceptions.
\begin{table}
\centering
{\tlc
\begin{tabular}{|lll|}\hline
\rm Vowel&\rm Vowel&\rm Vowel\\
&\rm (in roman)&\rm Modifier\\ \hline
\QQQ a&\rm a&{\QQQ\zzJJ a}\\
\QQQ A&\rm A&{\QQQ\zzJJ A}\\
\QQQ i&\rm i&{\QQQ\zzJJ i}\\
\QQQ I&\rm I&{\QQQ\zzJJ I}\\
\QQQ u&\rm u&{\QQQ\zzJJ u}\\
\QQQ U&\rm U&{\QQQ\zzJJ U}\\
\QQQ rx&\rm ro&{\QQQ\zzJJ rx}\\
\QQQ Rx&\rm Roo&{\QQQ\zzJJ Rx}\\
\QQQ lx&\rm lo&{\QQQ\zzJJ lx}\\
\QQQ Lx&\rm Loo&{\QQQ\zzJJ Lx}\\
\QQQ e&\rm e&{\QQQ\zzJJ e}\\
\QQQ E&\rm E&{\QQQ\zzJJ E}\\
\QQQ y&\rm y&{\QQQ\zzJJ y}\\
\QQQ o&\rm o&{\QQQ\zzJJ o}\\
\QQQ O&\rm O&{\QQQ\zzJJ O}\\
\QQQ ow&\rm ow&{\QQQ\zzJJ ow}\\ \hline
\end{tabular}
}
\caption{Vowels in Telugu\label{t31}}
\end{table}
Table~\ref{t32} presents various forms of consonants. In the first column
consonants appear in pure form. In the second column consonants
appear with an implicit {\tlc\QQQ a}. The last column indicates
the shape of consonant conjuncts and the positioning of consonant
conjuncts. One may recall that a syllable can have several consonant
conjuncts. There is no consonant conjunct form for {\tlc\QQQ Xha}.
\begin{table}
\centering
{\tlc
\begin{tabular}{|llll|}\hline
\rm Consonant&\rm Consonant&\rm Consonant&\rm Consonant\\
\rm (without ``a''&\rm (with ``a''&\rm (in roman)&\rm Conjunct\\
\rm modifier)&\rm modifier)&&\\ \hline
\QQQ kX&\QQQ ka&\rm k&{\QQQ \zzJJ ka}\\
\QQQ khX&\QQQ kha&\rm kh&{\QQQ \zzJJ kha}\\
\QQQ gX&\QQQ ga&\rm g&{\QQQ \zzJJ ga}\\
\QQQ ghX&\QQQ gha&\rm gh&{\QQQ \zzJJ gha}\\
\QQQ NGX&\QQQ NGa&\rm NG&{\QQQ \zzJJ NGa}\\
\QQQ cX&\QQQ ca&\rm c&{\QQQ \zzJJ ca}\\
\QQQ chX&\QQQ cha&\rm ch&{\QQQ \zzJJ cha}\\
\QQQ jX&\QQQ ja&\rm j&{\QQQ \zzJJ ja}\\
\QQQ jhX&\QQQ jha&\rm jh&{\QQQ \zzJJ jha}\\
\QQQ nxX&\QQQ nxa&\rm nx&{\QQQ \zzJJ nxa}\\
\QQQ TX&\QQQ Ta&\rm T&{\QQQ \zzJJ Ta}\\
\QQQ ThX&\QQQ Tha&\rm Th&{\QQQ \zzJJ Tha}\\
\QQQ DX&\QQQ Da&\rm D&{\QQQ \zzJJ Da}\\
\QQQ DhX&\QQQ Dha&\rm Dh&{\QQQ \zzJJ Dha}\\
\QQQ NX&\QQQ Na&\rm N&{\QQQ \zzJJ Na}\\
\QQQ tX&\QQQ ta&\rm t&{\QQQ \zzJJ ta}\\
\QQQ thX&\QQQ tha&\rm th&{\QQQ \zzJJ tha}\\
\QQQ dX&\QQQ da&\rm d&{\QQQ \zzJJ da}\\
\QQQ dhX&\QQQ dha&\rm dh&{\QQQ \zzJJ dha}\\
\QQQ nX&\QQQ na&\rm n&{\QQQ \zzJJ na}\\
\QQQ pX&\QQQ pa&\rm p&{\QQQ \zzJJ pa}\\
\QQQ phX&\QQQ pha&\rm ph&{\QQQ \zzJJ pha}\\
\QQQ bX&\QQQ ba&\rm b&{\QQQ \zzJJ ba}\\
\QQQ bhX&\QQQ bha&\rm bh&{\QQQ \zzJJ bha}\\
\QQQ mX&\QQQ ma&\rm m&{\QQQ \zzJJ ma}\\
\QQQ YX&\QQQ Ya&\rm Y&{\QQQ \zzJJ Ya}\\
\QQQ rX&\QQQ ra&\rm r&{\QQQ \zzJJ ra}\\
\QQQ RX&\QQQ Ra&\rm R&{\QQQ \zzJJ Ra}\\
\QQQ lX&\QQQ la&\rm l&{\QQQ \zzJJ la}\\
\QQQ LX&\QQQ La&\rm L&{\QQQ \zzJJ La}\\
\QQQ vX&\QQQ va&\rm v&{\QQQ \zzJJ va}\\
\QQQ SX&\QQQ Sa&\rm S&{\QQQ \zzJJ Sa}\\
\QQQ ShX&\QQQ Sha&\rm Sh&{\QQQ \zzJJ Sha}\\
\QQQ sX&\QQQ sa&\rm s&{\QQQ \zzJJ sa}\\
\QQQ HX&\QQQ Ha&\rm H&{\QQQ \zzJJ Ha}\\
\QQQ XhX&\QQQ Xha&\rm Xh& \\ \hline
\end{tabular}
}
\caption{Consonants in Telugu\label{t32}}
\end{table}
Table~\ref{t33} presents the accents used in Telugu text. The first column
indicates the shape of accents. The second column indicates where the
accents are placed relative to the syllable. The accents that go at the
top or at the bottom appear over/below the base consonant. A syllable can
have multiple accents. Many of the accents presented in the table do not
appear in ordinary Telugu text. They occur mostly in Sanskirt text
transliterated in Telugu.
There are two symbols ({\tlc \zzFB\ \& \zzFC\ }) that are
used as sentence/line delimiters in poetry
and in Sanskrit text.
Table~\ref{t38} displays all the symbols in the font. The font is available
in several styles and sizes.
All the symbols that also exist in ASCII character set are in the same position
as in ASCII. The codes for non-ASCII symbols are determined by us
because there are no established standards.
But that does not mean that they can not be changed. All the codes are
referred to by using variable(symbolic) names. All these variable
assignments are
placed in one file. If a different code assignment is desired, then
it is a simple matter of changing the numbers in this file. Then
METAFONT programs should be rerun. As explained later,
when we genarate a font, a file containing character codes
(among other things) is generated.
Text composition software and \TeX\ read character codes from this file.
In summary, if we want to change the codes for symbols in the font, we
need to do that in one place and our implementation propagates the
changes across the system.
\begin{table}
\centering
{\tlc
\begin{tabular}{||c||c|c|c|c|c|c|c|c||}\hline\hline
\openup0.2ex
& '0 & '1 & '2 & '3 & '4 & '5 & '6 & '7 \\ \hline\hline
'00\_ & \char'000 & \char'001 & \char'002 & \char'003 & \char'004 & \char'005 &
\char'006 & \char'007 \\ \hline
'01\_ & \char'010 & \char'011 & \char'012 & \char'013 & \char'014 & \char'015 &
\char'016 & \char'017 \\ \hline
'02\_ & \char'020 & \char'021 & \char'022 & \char'023 & \char'024 & \char'025 &
\char'026 & \char'027 \\ \hline
'03\_ & \char'030 & \char'031 & \char'032 & \char'033 & \char'034 & \char'035 &
\char'036 & \char'037 \\ \hline
'04\_ & \char'040 & \char'041 & \char'042 & \char'043 & \char'044 & \char'045 &
\char'046 & \char'047 \\ \hline
'05\_ & \char'050 & \char'051 & \char'052 & \char'053 & \char'054 & \char'055 &
\char'056 & \char'057 \\ \hline
'06\_ & \char'060 & \char'061 & \char'062 & \char'063 & \char'064 & \char'065 &
\char'066 & \char'067 \\ \hline
'07\_ & \char'070 & \char'071 & \char'072 & \char'073 & \char'074 & \char'075 &
\char'076 & \char'077 \\ \hline
'10\_ & \char'100 & \char'101 & \char'102 & \char'103 & \char'104 & \char'105 &
\char'106 & \char'107 \\ \hline
'11\_ & \char'110 & \char'111 & \char'112 & \char'113 & \char'114 & \char'115 &
\char'116 & \char'117 \\ \hline
'12\_ & \char'120 & \char'121 & \char'122 & \char'123 & \char'124 & \char'125 &
\char'126 & \char'127 \\ \hline
'13\_ & \char'130 & \char'131 & \char'132 & \char'133 & \char'134 & \char'135 &
\char'136 & \char'137 \\ \hline
'14\_ & \char'140 & \char'141 & \char'142 & \char'143 & \char'144 & \char'145 &
\char'146 & \char'147 \\ \hline
'15\_ & \char'150 & \char'151 & \char'152 & \char'153 & \char'154 & \char'155 &
\char'156 & \char'157 \\ \hline
'16\_ & \char'160 & \char'161 & \char'162 & \char'163 & \char'164 & \char'165 &
\char'166 & \char'167 \\ \hline
'17\_ & \char'170 & \char'171 & \char'172 & \char'173 & \char'174 & \char'175 &
\char'176 & \char'177 \\ \hline
'20\_ & \char'200 & \char'201 & \char'202 & \char'203 & \char'204 & \char'205 &
\char'206 & \char'207 \\ \hline
'21\_ & \char'210 & \char'211 & \char'212 & \char'213 & \char'214 & \char'215 &
\char'216 & \char'217 \\ \hline
'22\_ & \char'220 & \char'221 & \char'222 & \char'223 & \char'224 & \char'225 &
\char'226 & \char'227 \\ \hline
'23\_ & \char'230 & \char'231 & \char'232 & \char'233 & \char'234 & \char'235 &
\char'236 & \char'237 \\ \hline
'24\_ & \char'240 & \char'241 & \char'242 & \char'243 & \char'244 & \char'245 &
\char'246 & \char'247 \\ \hline
'25\_ & \char'250 & \char'251 & \char'252 & \char'253 & \char'254 & \char'255 &
\char'256 & \char'257 \\ \hline
'26\_ & \char'260 & \char'261 & \char'262 & \char'263 & \char'264 & \char'265 &
\char'266 & \char'267 \\ \hline
'27\_ & \char'270 & \char'271 & \char'272 & \char'273 & \char'274 & \char'275 &
\char'276 & \char'277 \\ \hline
'30\_ & \char'300 & \char'301 & \char'302 & \char'303 & \char'304 & \char'305 &
\char'306 & \char'307 \\ \hline
'31\_ & \char'310 & \char'311 & \char'312 & \char'313 & \char'314 & \char'315 &
\char'316 & \char'317 \\ \hline
'32\_ & \char'320 & \char'321 & \char'322 & \char'323 & \char'324 & \char'325 &
\char'326 & \char'327 \\ \hline
'33\_ & \char'330 & \char'331 & \char'332 & \char'333 & \char'334 & \char'335 &
\char'336 & \char'337 \\ \hline
'34\_ & \char'340 & \char'341 & \char'342 & \char'343 & \char'344 & \char'345 &
\char'346 & \char'347 \\ \hline
'35\_ & \char'350 & \char'351 & \char'352 & \char'353 & \char'354 & \char'355 &
\char'356 & \char'357 \\ \hline
'36\_ & \char'360 & \char'361 & \char'362 & \char'363 & \char'364 & \char'365 &
\char'366 & \char'367 \\ \hline
'37\_ & \char'370 & \char'371 & \char'372 & \char'373 & \char'374 & \char'375 &
\char'376 & \char'377 \\ \hline
\hline
\end{tabular} }
\caption{\label{t38}Font Table}
\end{table}
\section{Transliteration of Telugu Text\label{c2}}
Since most computers do not have facilities for entering Telugu text and are
designed to work with English input we have developed a scheme
for inputting Telugu text in English.
Typically text entry for Indian languages (most of them have similar
phonetic structure) is done in one of two ways. One is called the graphical
approach. And the other is called phonetic approach. We do not
provide a detailed discussion of these approaches because these
approaches are widely debated and documented. In the graphical approach,
each syllable is viewed as a collage of various symbols. Keystrokes
are used to place constituent symbols on the screen. This method
is clumsy, inefficient and language specific.
This is particulaly true of Telugu. In the phonetic
approach, text is entered the way we speak. We do not worry about
composition at all. That is left to the computer.
In other words we key in constituent consonants/vowel of the syllables
that we speak. This approach to inputting Indian(Asian) languages
is very elegant, simple and language independent.
We have chosen the phonetic approach to inputting Telugu text. This approach
makes life easy for the user but for the software developer there is
considerable work.
We modified slightly the transliteration scheme for inputting
Telugu proposed by Prof. Donald Becker
at the University of Wisconsin, Madison.
The modifications result mainly from our decision not to use any non-alphabetic
characters in the scheme
and inclusion of many new characters.
Table~\ref{t35} presents the transliteration system used by us for inputting
Telugu text. The English letters are chosen so that they are close
to their Telugu equivalents in pronunciation.
But one should remember that English
is not a phonetic language and hence the same letter can be pronounced
in several ways.
No transliteration is provided for some rarely used symbols. They can
be obtained by using control sequences given in the table.
In the next section we provide several examples of Telugu text.
In the transliteration scheme, some letters require inputting two
or more characters. But this does not mean that multiple
keystrokes are really needed. Most editors/word processors
provide facilities for keyboard macros. We can define macros with
one letter names. In Telugu, we have 36 consonants and 16 vowels.
That means a total of 52 letters. We can remap the alphabet part
of a keyboard to facilitate Telugu input.
One can buy a keyboard overlay (or skin) and mark the keytops
with Telugu letters. Thus we can have a Telugu keyboard (almost!).
One can map control characters to accents. By using a pair of macros we
can switch between Hindu-Arabic and Telugu digits. So regardless
of which digits one wants one would use the usual numeric keys.
\TeX\ ignores spaces after control words. So if a word ends in a control
word then we should type a control space
(the escape character followed by a blank space) following the control word.
This rule is useful for us particularly when we are entering Sanskrit
text in Telugu. Many words in Sanskrit end in control sequences denoting
various accents.
Use the macro \bs QQQ to enable transliteration and the macro \bs Q to
disable transliteration. Use the macro \bs zzCB to switch to roman digits(this
is the default) and use the macro \bs zzBC to switch to telugu digits.
What if somebody does not like the scheme that we have used? This is
really no problem, since the software that interprets the user's input
is separate from text composition software. So it is easy to
adapt to another unambiguous transliteration scheme.
\section{Examples}
In this section we will give several examples of Telugu input/output. We use (,)
to delimit syllables or Aksharas.
We will also provide transliteration of each example.
Table~\ref{t39} contains several examples of transliteration of Telugu
words using the proposed transliteration scheme.
\begin{figure}
{\tlc\QQQ
Hariah OmX\zzFC\ Sa\zzFJ taM jI\zzBFva Sa\zzFJ radO\zzFJ\ vardha\zzBF mAnaSSa\zzFJ taM
HE\zzBFma\zzFJ ntAnxcha\zzFJtamu\zzBF\ vasa\zzFJ ntA\zzCF\zzFB\
Sa\zzFJ tami\zzBF ndhrA\zzFJ gnI na\zzBF vi\zzFJ tA brx\zzBF Ha\zzFJ npatiSSa\zzFJ\
tAYu\zzBF ShA Ha\zzFJ viShE\zzFJ maM puna\zzBF rduah\zzFC\
rxksaMHitA\zzFC\
{\zzBC 8-8-20} vaga\zzCJ\zzFC\
asYa mantrasYa niruktamX. ---
SataM jIva SadadO vardhamAna itYapi nigamO bhavati\zzFB\
Satamiti SataM dIgha\zzCJ mAYuma\zzCJruta EnA vadha\zzCJ Yanti
Satamonamona SatATmAnaM bhavati SatamanantaM bhavati SatamySvarYaM
bhavati Satamiti SataM dEgha\zzCJ mAYuah\zzFB\ Hariah OmX\zzFC
}
\caption{\label{egt1}Text from {\em rigveda}- in Telugu, transliteration in
Figure~\protect\ref{egr1}}
\end{figure}
\begin{figure}
\begin{verbatim}
{\tlc\QQQ
Hariah OmX\zzFC\ Sa\zzFJ taM jI\zzBFva Sa\zzFJ radO\zzFJ\ vardha\zzBF mAnaSSa\zzFJ taM
HE\zzBFma\zzFJ ntAnxcha\zzFJtamu\zzBF\ vasa\zzFJ ntA\zzCF\zzFB\
Sa\zzFJ tami\zzBF ndhrA\zzFJ gnI na\zzBF vi\zzFJ tA brx\zzBF Ha\zzFJ npatiSSa\zzFJ\
tAYu\zzBF ShA Ha\zzFJ viShE\zzFJ maM puna\zzBF rduah\zzFC\
rxksaMHitA\zzFC\
{\zzBC 8-8-20} vaga\zzCJ\zzFC\
asYa mantrasYa niruktamX. ---
\section{Problems in Typesetting Telugu Text}
When we try to adapt technologies developed for English, to a vastly
different language we face many difficulties. And more so in case of Telugu.
We are somewhat spoiled by the simplicity of English text.
In this section we will enumerate the difficulties faced in developing
a typesetting system for Telugu.
\begin{description}
\item[Tradition]English has a long tradition of typography.
English typography is widely studied and researched. There are well
established styles of typefaces. This is not the case with Telugu.
Much of the composition in Telugu is still done by hand. Use of
computers to typeset Telugu is really a recent phenomenon.
\item[Character Set]In English the character set is well defined.
In fact there are effective standards in place.
Telugu character set is still evolving. Some letters that were used
a few decades ago are not in use any more.
Some symbols have been replaced with new symbols.
There is no unanimity with regard
to symbols used in transliteration of Sanskrit text. Our
intent has been to provide all the symbols that have been used over
the past 200 years. We were fortunate to be able to obtain facsimili
of type specimens of a Telugu font cut in 1802 from a museum in London.
We also looked at various publications produced since then.
\item[Complexity of Script]English text is composed by laying letter after
letter in a linear manner. Composition of Telugu is very different.
Composition is done by laying syllable after syllable. But composing
a syllable (see page~\pageref{syldef} for the definition
of a syllable) is very complex. The total number of possible syllables is
very large. So we cannot store the images of all the syllables.
Syllables need to be composed
on the fly. Composing a syllable means juggling symbols, stacking symbols,
juxtaposing symbols.
Composing a syllable requires several character lookahead. The first step in
composing a syllable is to modify the base consonant (first consonant) by
the vowel which appears after the intervening consonants (consonant conjuncts).
This problem of building up a syllable brings
us to the next item.
\item[Standards] In case of Telugu, we should distinguish between the
coding system for text storage/transmission ( we will call this
Information Interchange code IIC) and the coding system for the font
that is used for rendering (printing/ displaying) text. The font
contains graphical elements that can be used to compose
syllables. Many of the symbols in the font may not look like any
letter in the alphabet. IIC contains codes for consonants, vowels,
accents, digits and punctuation marks. Each syllable is stored
as a sequence of constituent consonants, vowels and accents.
In Telugu there are 16 vowels,
36 consonants and 11 accents. There is a standard for Telugu
information interchange. In fact it is the same for most of the languages
used in India since they all have the same phonetic structure.
But there is no standard for the font. It is very difficult to
define a standard for assigning codes to elements of a font.
It is difficult to decide what should be in the font.
The structure of each font depends on the font designer.
A font designer (for Telugu) would use various criteria i.e. number
of symbols permitted in a font, adherence to tradition, choice
of alphabet, complexity of composition,
features available in the font development system, features available in
the composition system etc. to determine what primitive strokes
or forms should be included in the font.
In English, information interchange code would correspond directly
to encoding used in the font. This makes life a lot easy. It also
explains why it is a very difficult task to adapt a text composition
system developed primarily for English to Telugu.
\item[Letterforms]Most of the existing Telugu letterfoms are heavy along
base line. It makes it very hard on the eyes of the reader. Many letterforms
lack in consistency. This may be due to the fact that Telugu fonts
are typically designed by metalsmiths/calligraphers not very well skilled in
modern principles of type design and do not have access to modern
technology.
We have designed our letter forms from scratch. No attempt was made to
imitate an existing font. In fact we have not found any font that we thought
was good. We are fortunate in being able to use a very sophisticated
font development system called METAFONT.
\item[Parametrization]One of the major strengths of METAFONT is its ability
to generate fonts for a variety of devices in a variety of styles and
sizes. In order to realize this facility we need to identify key parameters
of the font. Then we can play with these parameters to obtain various
fonts. We have tried to define various parameters for Telugu script.
\item[Font Metric Data]In case of Telugu we need to provide certain measurements
about location of some key points in a symbol. These measurements need to be
stored along with the font. The measurements are used by the text composition
system to allign/position the symbols during syllable construction.
Font metric files generally do not have any provision for storing optional
font metric data. To overcome this problem we write the measurements in
the form of \TeX\ macros to the log file. After font genaration the macros are
extracted and placed in a seperate file(called offsets file). The measurements
(offsets) are generated in relative units ({\it em} units). This means that
only one offset file is needed(regardless of size/style).
\item[Speed]Syllable building is a very CPU intensive job. Each syllable is
pieced together from constituent parts. Typesetting Telugu text is a slow process.
\item[Hyphenation]
\TeX\ hyphenation algorithms are not applicale to Telugu text.
We use the rule that a syllable should not be broken
across lines. Since a syllable is built in a box and boxes are not
broken up, this rule is satisfied. But then we are not giving \TeX\
much freedom to do line breaking. We insert a discretionary hyphen
between two
syllables. Our experience indicates that this works fine.
\end{description}
\section{Implementation}
The system has two components. First is the font developed using METAFONT.
These symbols are used for composing Telugu text.
The second component is used for Telugu text composition. This is
written in \TeX\ macro language. Telugu is a phonetic language. The user enters
Telugu as he/she would speak (i.e. phonetically). But the script is not
linear. Unlike English, the keys on the keyboard do not correspond with the
symbols in the font. We need a mechanism that interprets the phonetic input
and composes aksharas (or syllables). This requires considerable amount of
processing. This is what is done by the second component. Phonetic
transcription is interpreted and syllables are formed.
As mentioned earlier,
there are two very
different approaches to inputting Telugu (and other Indian languages). One
is phonetic (as ours) and the other is called Graphical approach. In this
approach, the user picks individual symbols from the font and composes the
syllable. For Telugu, the Graphical approach would be tedious and very
difficult.
We have tried to make the second component applicable for other Indian
languages as well. Since all Indian languages have similar phonetic structure,
this component can be easily adapted to other Indian languages.
Next we provide an outline of our implementation.
Subsection~1 deals with font development and Subsection~2
deals with text composition.
\subsection{Font Development\label{fona}}
The font is developed using METAFONT program. The font has over 230 symbols.
In addition to symbols used to compose Telugu text, the font includes
Telugu as well as Hindu-Arabic digits, punctuation symbols, non-alphabetic
characters from ASCII, accents and symbols needed for transliteration
of Sanskrit.
One may wonder why worry about symbols that are available in Computer Modern
Roman. The answer is that we want the text composed to look consistent
and harmonious. Since Telugu characters have a rounded appearance,
we decided that we should use circular pens. In olden days Telugu
used be written on palm leaves using a metal stylus with a round nib.
Following are the steps in developing the font:
\begin{enumerate}
\item Identify all the symbols that are needed to compose syllables.
\item Define a grid framework for drawing the symbols. In this grid
we define a baseline, ascender height, descender depth,
top shoulder, bottom shoulder and a few other lines.
\item Assign codes to each symbol. All code references are
made symbolically and codes are defined in one separate file.
It would be very easy to change code assignment. We have used the
same code as ASCII for the symbols that also occur in ASCII.
\item Make rough sketches of all these symbols and other elements of
the font on a graph paper using a consistent scale.
\item Enumerate the parameters of the font. This is a delicate job. The
parameters are modified to obtain various styles/sizes of the font.
\item Define font dimensions (see METAFONT book).
\item Identify all the strokes/shapes that occur more than once. Write
macros to draw these strokes/shapes.
\item Write a METAFONT program for each symbol in the font taking care
to use macros whenever possible and refer to parameters.
First we would look at the sketch and identify control points.
Specify coordinates in relative terms. That is, the coordinates
are defined in terms of font parameters and box dimensions (width,
height and depth). Box dimensions are defined in terms of parameters,
u (unit width) and uh (unit height).
Parameters are either unit free(e.g. ratios, scaling factors etc.,) or
are defined in terms of other parameter(s) and/or u and/or uh.
u and uh are defined interms of the design size. One exception
to this rule is pen\_width (diameter of the circular pen). This
is specified as an absolute quantity.
In order to generate the font we need to provide the design size,
pen size/shape, slant and optionally, the magnification factor.
\item METAFONT produces dimensions of the bounding box of each symbol in the
font and these dimensions are used by text composition software. For
Telugu we have some special requirements. When a consonant is modified
by a vowel, vowel modifier is attached to the consonant base. The point
of contact varies from consonant to consonant. The offset to this
point should be known to the syllable building mechanism. This offset
(in em units) is generated as a \TeX\ macro and written onto the log file.
Similarly, when various symbols are stacked on the top/bottom of a
C+V, the syllable building mechanism needs to be aware of the offset
by which to shift the symbols. This offset is
also generated as a \TeX\ macro
and written onto the log file. All the macros are extracted from
the logfile and copied into another file that is read by syllable
building software. The offsets are very sensitive to changes to the
font. Because of the way we handle offsets, we are free to change the font
and new offsets are automatically recorded. As stated earlier,
there is no standard for the
assignment of codes in the Telugu font. We have decided on a particular
encoding scheme. But the codes are not hardwired into the syllable
building mechanism. The codes are generated as \TeX\ macros and written
onto the log file. If somebody wants to change the codes, all that
he/she needs to do is alter the entries in the code file and generate
the font.
\item Assign space on the left and/or on the right of symbols. Since
the symbols are pieced together to form syllables, we need to
be very careful about placing white space on the sides.
\item Prepare ligature tables.
\end{enumerate}
By varying the font size, the pen size/shape and the slant
we can obtain several
fonts. All font file names have ``tel'' prefix. This prefix is followed
by a number indicating the font size (in points). Font size is defined
as the distance between the top edge of the top shoulder and the
bottom edge of the bottom shoulder.
If the size in a font filename is followed by 's' then it indicates
a slanted font.
If the size in a font filename is followed by 'b' then it indicates
a bold font.
\subsection{Text Composition\label{composea}}
The programs for syllable building are written in \TeX\ macro language.
There are two modules. One is used to interpret user input. This module
implements the transliteration scheme proposed in Section~\ref{c2} (see
page~\pageref{t35}). If a different transliteration scheme or input
system is to be used then this module will need to be modified.
The second module identifies the components of a syllable and builds the
syllable. Every syllable is output as a horizontal box. Each syllable
box may have many constituent boxes. The main activity in this part
involves box manipulation. Boxes are stacked, shifted, attached to obtain
the complete syllable.
\subsubsection{Transliteration Module(tlxlate.tex)}
This module reads one or two characters and identifies the corresponding
letter (vowel or consonant or accent) in Telugu and invokes corresponding
macro in the Syllable module. The logic is:
\begin{enumerate}
\item Read a character
\item If more than one transliteration codes can begin with this
character then lookahead one character.
\item Identify the letter ( vowel/consonant/accent)
\item Invoke the corresponding macro.
\end{enumerate}
The principal mechanism for implementing transliteration is the concept of
{\em active} characters. The main advantage of this technique is that it does
not require any extra effort on the part of the user.
All the characters that can begin a transliteration code are declared
as active characters. There is a macro for each active character.
These macros take one of the following two forms:
\begin{enumerate}
\item Invoke the macro corresponding to the letter (If the character cannot
begin another transliteration code).
\item Else look at the next character to identify the letter.
\end{enumerate}
To input a non-alphabetic character, one need only enter the
corresponding symbol
(subject to \TeX\ input rules).
The user can choose between Hindu-Arabic digits and Telugu digits. It is easy
to switch between these two kinds of digits. Two macros are provided
to make this switch. To start with, Hindu-Arabic digits are provided. The
macro used to switch to Telugu digits does the following:
\begin{enumerate}
\item Declare all the digits as active characters.
\item In the macro for a digit, output the corresponding Telugu digit.
\end{enumerate}
The macro for switching to Hindu-Arabic digits declares all digits
to be non-active characters. The transliteration module does not bother
about non-alphanumeric characters because \TeX\ would automatically
handle those symbols (just as in case of English text). All
non-alphanumeric characters
and Hindu-Arabic digits in Telugu font appear at the same positions
as in ASCII.
\subsubsection{Syllable Module(tlsyllable.tex)}
In this module we try to build a syllable from constituent
consonant(s) and the vowel. The focus of macros in this module is on composing a
syllable from constituent letters.
There are two subsections in this section. First we give a brief introduction
to Syllables in Telugu script and then
we outline our implementation.
\subsubsection*{Introduction}
We may recall that a syllable (or Akshara) can take one of the two forms.
$C_1$ is considered the base
consonant. $C_2$ to $C_n$ are called consonant conjuncts. Consonant
conjuncts are optional. The Vowel modifies the base consonant.
The vowel modifier may be absent.
This is indicated by a special code which appears in place of the vowel
modifier. For the purpose of implementation we will treat this code
as another vowel modifier.
The form of Base Consonant + Vowel (henceforth referred to as C+V)
varies depending on both
the consonant and the vowel. The vowel modifiers can appear
to the right, on the top or at the bottom of the base consonant.
In many cases a completely different symbol is needed. After
Consonant + Vowel is formed, consonant conjuncts are attached
in the order they appear in the syllable. Consonant Conjuncts
are placed to the right or below C+V.
Consonant Conjuncts have a different form than either base consonant
or C+V's.
An Akshara can optionally have accent(s). Accents could appear
at the top or on the side or at the bottom of a syllable.
For each consonant, the conjunct form is different from the base form.
The base consonant takes a different form depending on the vowel modifier.
The vowel modifies the base consonant. The way the consonants are modified by
the vowel is not governed by any uniformly applicable rules. This lack of
consistency contributes to the complexity of the Telugu script. Some consonant
conjuncts are stacked beneath the base consonant+vowel, whereas others appear
on the side.
The logic to process a syllable is :
\begin{enumerate}
\item Identify the base consonant ($C_1$)
\item Scan for the next vowel while saving the intervening consonant conjuncts
($C_2$\ldots$C_n$).
\item Once the vowel modifier is identified, compose $C_1$+$V$.
This is the messy part. Though many of the consonants are modified
uniformly by a given vowel, there are too many exceptions.
\item Generate consonant conjuncts.
\item Apply accents.
\item Add space following a syllable.
\item If preceeded by a box then insert a discretionary hyphen.
\item Release the syllable box.
\end{enumerate}
\subsubsection*{Implementation}
Next we will discuss the implementation of the syllable
building mechanism.
The main strategy to build a box containing the syllable involves
piecing together boxes containing constituent symbols from the font.
Boxes are shifted, lowered, raised and stacked to obtain the right
appearance for the syllable. In summary, syllable building is
an exercise in box juggling or box manipulation.
But all the shifts and kerns must be carefully
controlled.
All the box dimensions must be accurately specified and we cannot
have any part of the symbol spill over the bounding box. This
calls for extreme care in specifying box dimensions in the METAFONT
programs.
Else we will have malformed syllables.
In this module we have an extensive collection of macros to
shift/lower/raise boxes and to recompute box dimensions affected
by shifts/kerns.
After processing, \bs syl box contains the syllable. The consonant conjunct
tokens are built up in the token list \bs cctok. The name of the base consonant
is saved in \bs cbtok. In this module various letters of Telugu are identified
by unique names.
These names are local to this module.
The names we use for various letters are significant. They
cannot be changed arbitrarily. The names are used in defining macro names.
For example, the macro to process consonant conjunct x is \bs tlccx. The macro
to process C(x)+V(y) is \bs tlcvxy.
It is
unlikely that the user will invoke the macros defined here directly.
When a vowel x is to be processed, the macro \bs tlvox is invoked. When
a consonant x is to be processed, the macro \bs tlcbx is invoked.
These macros are invoked from the transliteration module.
The consonant processing macro determines if the consonant is the base
consonant or a conjunct; if base consonant, the syllable processing is
initiated. If conjunct, the macro to form the conjunct is added to the token
list \bs cctok.
The vowel processing macro determines if the vowel is a stand-alone vowel
or a modifier. If it is a stand-alone vowel, it just outputs the character.
If it is a modifier, then the following actions are taken :
\begin{itemize}
\item Some C+V combinations need special treatment. The macro to process
consonant x and vowel y is \bs tlcvxy. Otherwise the macro to process
the vowel modifier( \bs tlvmy) is called.
\item Process consonant conjuncts.
\end{itemize}
How does the syllable module know which (C+V) combinations need special
treatment? Vowels are partitioned into groups such that if a
consonant needs special handling when combined with a vowel in a
group then the consonant also needs special treatment when combined
with other vowels in the same group. The groups are {\tlc\QQQ
\{A\}, \{i,I\}, \{u,U\}, \{o,O\}, \{ow\}} and others. We assign a prime number to
each group. We also assign a number to each consonant which is the
product of the prime numbers associated with the vowel groups that
need special tratment when combined with this paticular consonant.
If a consonant needs no special treatment then it gets a distinct
prime number. Given a consonant(C) and a vowel(V) if the number
for V devides the number for C evenly then this C+V combination
needs special care.
This module contains macros to process accents. There is a macro for
each accent. Processing accents involves retrieving the last syllable
box and placing the accent. The placement could be at the top or at the bottom
or on the side depending on the particular accent. A syllable can have
multiple accents. It is the responsibility
of the user to ensure that a proper combination
of accents is used.
\ifundefined{tla}\else{\vspace{0.5cm}Specimen of tla (tel10)\\\tla
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlanx}\else{\vspace{0.5cm}Specimen of tlanx (tel10nx)\\\tlanx
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlany}\else{\vspace{0.5cm}Specimen of tlany (tel10ny)\\\tlany
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlab}\else{\vspace{0.5cm}Specimen of tlab (tel10b)\\\tlab
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlas}\else{\vspace{0.5cm}Specimen of tlas (tel10s)\\\tlas
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlb}\else{\vspace{0.5cm}Specimen of tlb (tel11)\\\tlb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlbnx}\else{\vspace{0.5cm}Specimen of tlbnx (tel11nx)\\\tlbnx
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlbny}\else{\vspace{0.5cm}Specimen of tlbny (tel11ny)\\\tlbny
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlbb}\else{\vspace{0.5cm}Specimen of tlbb (tel11b)\\\tlbb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlbs}\else{\vspace{0.5cm}Specimen of tlbs (tel11s)\\\tlbs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlc}\else{\vspace{0.5cm}Specimen of tlc (tel12)\\\tlc
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlcnx}\else{\vspace{0.5cm}Specimen of tlcnx (tel12nx)\\\tlcnx
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlcny}\else{\vspace{0.5cm}Specimen of tlcny (tel12ny)\\\tlcny
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlcb}\else{\vspace{0.5cm}Specimen of tlcb (tel12b)\\\tlcb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlcs}\else{\vspace{0.5cm}Specimen of tlcs (tel12s)\\\tlcs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tld}\else{\vspace{0.5cm}Specimen of tld (tel15)\\\tld
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tldb}\else{\vspace{0.5cm}Specimen of tldb (tel15b)\\\tldb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlds}\else{\vspace{0.5cm}Specimen of tlds (tel15s)\\\tlds
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tle}\else{\vspace{0.5cm}Specimen of tle (tel18)\\\tle
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tleb}\else{\vspace{0.5cm}Specimen of tleb (tel18b)\\\tleb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tles}\else{\vspace{0.5cm}Specimen of tles (tel18s)\\\tles
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlf}\else{\vspace{0.5cm}Specimen of tlf (tel20)\\\tlf
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlfb}\else{\vspace{0.5cm}Specimen of tlfb (tel20b)\\\tlfb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlfs}\else{\vspace{0.5cm}Specimen of tlfs (tel20s)\\\tlfs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlg}\else{\vspace{0.5cm}Specimen of tlg (tel25)\\\tlg
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlgb}\else{\vspace{0.5cm}Specimen of tlgb (tel25b)\\\tlgb
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlgs}\else{\vspace{0.5cm}Specimen of tlgs (tel25s)\\\tlgs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlh}\else{\vspace{0.5cm}Specimen of tlh (tel30)\\\tlh
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlhs}\else{\vspace{0.5cm}Specimen of tlhs (tel30s)\\\tlhs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tli}\else{\vspace{0.5cm}Specimen of tli (tel35)\\\tli
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlis}\else{\vspace{0.5cm}Specimen of tlis (tel35s)\\\tlis
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ceDipOtAru''.}}\fi
\ifundefined{tlj}\else{\vspace{0.5cm}Specimen of tlj (tel40)\\\tlj
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ''}}\fi
\ifundefined{tljs}\else{\vspace{0.5cm}Specimen of tljs (tel40s)\\\tljs
{\QQQ ``parulanu mosagiMci HAnicEYa talapeTTinavAru tAmE ''}}\fi
\ifundefined{tlk}\else{\vspace{0.5cm}Specimen of tlk (tel55)\\\tlk
{\QQQ ``parulanu mosagiMci ''}}\fi
\ifundefined{tlks}\else{\vspace{0.5cm}Specimen of tlks (tel55s)\\\tlks
{\QQQ ``parulanu mosagiMci ''}}\fi
\ifundefined{tll}\else{\vspace{0.5cm}Specimen of tll (tel72)\\\tll
{\QQQ ``parulanu ''}}\fi
\ifundefined{tlm}\else{\vspace{0.5cm}Specimen of tlm (tel100)\\\tlm
{\QQQ ``parulanu ''}}\fi
\ifundefined{tln}\else{\vspace{0.5cm}Specimen of tln (tel172)\\\tln
{\QQQ parula }}\fi
\ifundefined{tlspa}\else{\vspace{0.5cm}Specimen of tlspa (telspa)\\\tlspa
{\QQQ ``parulanu ''.}}\fi
\ifundefined{tlspb}\else{\vspace{0.5cm}Specimen of tlspb (telspb)\\\tlspb
{\QQQ ``parulanu ''.}}\fi
\ifundefined{tlspc}\else{\vspace{0.5cm}Specimen of tlspc (telspc)\\\tlspc
{\QQQ ``parulanu ''.}}\fi
}
\section{Summary}
In this project, we have undertaken to produce a typesetting system for
a very complex script. We have used two top quality
programs (\TeX\ and METAFONT) developed
by Prof. Donald E. Knuth of Stanford University. There are two parts
to the project. One is the development of fonts using METAFONT and
the other is writing programs (i.e. macros) in \TeX. METAFONT is a very
difficult language to program in. \TeX\ is not really a good programming
language. In order to realize the full potential of these programs
we need to do extensive ground work. Parameters need to be defined, we
have to write macros for frequently occuring shapes and also write a
program for each symbol of the font.
This is an iterative process.
All these are time consuming and highly skilled tasks. When developing
the font, we had to ensure that our font would blend harmoniously with
Roman fonts when used in a document that contains both Roman and Telugu
fonts. This is a very delicate job because we are dealing with two
radically different languages. Printers producing Telugu use hundreds of
different types (metal blocks) with the image of a symbol cut on the
face. But METAFONT and \TeX\ allow no more than 256 symbols in a font.
By identifying the independent forms and exploiting the capabilities
of \TeX\ we have managed to keep the font size well within this limit.
The quality of the font we have developed is satisfactory but will
need to be refined as we gain experience with the system. Telugu input to
\TeX / \LaTeX\ is presented in the form of Roman transliteration. Our
macros interpret this english text, identify syllables and compose the
syllables. All these tasks are very CPU intensive and hence typesetting
Telugu takes considerably more time than typesetting English text.
Kannada is another language spoken in South India. Kannada script is very
similar to Telugu script. Our typesetting system can be easily adapted
to Kannada. Parts of our system can be used for extending \TeX\ to other
South Asian \& Southeast Asian languages.
Our system consists of well defined and independent modules. This
confers a lot of flexibility on the typesetting system.