Part C - Computers

Character Sets and Fonts

Present a brief overview of the history of computers
Describe the elements of textual communication

History | ASCII | EBCDIC | ISO 8859
UCS | Unicode | Windows | Fonts | Exercises



Computer technology has evolved dramatically over the last half-century.  Most computers used during the infancy of computing, just half a century ago, are now found in museums.  Today's computing environment is substantially different. 

Initially, the character sets used in computing were limited to those that represent English language scripts.  This has also changed substantially.  To render characters in all of the scripts throughout the world, we use a variety of character sets and fonts.  Each set encodes a sequence of letters and other symbols into numeric codes.  We store the characters and transmit them across networks using these numeric codes.  We use fonts to convert the numeric odes into visual symbols that appear on screens and in print.  Several different fonts may represent the same character or numeric code. 

Each character set may be described by a code page.  The terms character set, character-encoding, character map, and code page are synonomous in HCI.  The character sets that represent some of the scripts in the world are all tied to the structure of computer memory:

  • ASCII
  • EBCDIC
  • ISO 8859
  • UCS
  • Unicode
  • Windows

In this chapter, we very briefly review the history of computers.  We describe the variety of character sets that have become available and the differences between them.  We distinguish the character sets from the fonts used to represent various characters. 


History of Computers

In the 1960s computers operated in batch environments.  They occupied many rooms and rented for hundreds of thousands of dollars per month.  Computer operators entered data and programs using card readers that accepted decks of cards and paper tape readers that accepted paper tape.  Jobs queued to run in priority sequence.  Several hours elapsed before execution results queued for printing.  Computer operators removed the print outs from line printers and filed them by user name.  The programmer or user collected the results only at that point.  The turnaround time was typically between four and eight hours.  Card readers, paper tape readers, and line printers were the standard interfaces.  The computers themselves were housed in air-conditioned rooms and only trained operators were allowed access to the consoles, disk drives, tape drives, and core memory.  The lease of a computer often included operator staff trained by the manufacturer to run and maintain these machines.

On-site batch processing was followed by timeshare batch processing where users at different locations shared time on the same remote computer and dialed into that computer over telephone lines.  Timesharing used the already well-established teletype technology of the day.  An operator prepared a batch job on a teletypwriter that punched paper tape locally.  Once the paper tape was ready, the operator sent the tape to the remote mainframe computer over a dedicated telephone line.  Depending upon the timesharing contract, the remote computer would return the results in minutes or hours.  The teletypwriter punched the results on paper tape, which the operator would re-run locally to print a readable hardcopy, using the same teletypewriter. 

The introduction of personal computers in the 1980s reduced the turnaround time from hours to seconds.  IBM introduced its 8-bit PC - the 5150 in 1981.  The 5150 included two 5-1/4" floppy drives, a monochrome monitor, and a standard IBM-keyboard modelled on the IBM selectric typewriter.  One floppy contained the operating system and the other floppy contained the program and the input and output data.  The system included 128Kb of primary memory. 

Within two years, IBM produced and shipped a newer model called the XT, which included a 10Mb hard drive.  The hard drive replace the second floppy drive and could store several programs and data alongside the operating system.  The XT cost in the order of $10,000 at that time.  Two years later, IBM produced and shipped its first 16-bit computer, called the AT (which stood for advanced technology). 

By the 1990s, the prices of personal computers had dropped significantly, response time had narrowed to fractions of a second and the available memory had expanded many times over.

Although some of the devices introduced over these years were considered peripherals, many are now integral parts of our modern-day systems.


ASCII

The American Standard Code for Information Interpretation (ASCII) was the first character set adopted in computing.  It was published in 1963 and originally used with teleprinters and paper tape.  In 1968, President Johnson mandated that all computers purchased by the United States government support the ASCII character set. 

The ASCII character set uses the seven lower-order bits of a byte to represent 128 English letters, digits, and puncuation characters.  The eight or highest order bit was used for parity checking, since transmission errors were quite frequent. 

The ASCII table covers 94 printable characters, 33 non-printable control characters, and the space character.  Most modern encoding schemes are compatible with ASCII. 

When 16 and 32-bit computers replaced 18 and 36-bit computers, some manufacturers introduced extensions to ASCII that made use of all 8 bits of a byte and could represent 256 characters.  For example, see one possible extension.


EBCDIC

IBM introduced Extended Binary Coded Decimal Interchange Code (EBCDIC) in 1963 for pragmatic shipping reasons.  It originated with punched cards.  IBM's System/360 used EBCDIC and the system was hugely successful.  Hence, so was EBCDIC.  EBCDIC is the encoding set used by IBM mainframes and midrange systems such as the AS/400. 

The EBCDIC table uses all 8 bits of a byte so that it represents 256 characters.  Hence, parity checking is not available.  There are four main blocks in the EBCDIC code page:

  • 0000 0000 to 0011 1111 - control characters
  • 0100 0000 to 0111 1111 - punctuation
  • 1000 0000 to 1011 1111 - lowercase characters
  • 1100 0000 to 1111 1111 - uppercase characters and numbers

EBCDIC is incompatible with ASCII and ASCII derivatives such as Unicode.  EBCDIC's primary advantage over ASCII was increased durability of punched cards: it limited the number of hole punches for uppercase letters and numbers to 2 holes per column.  EBCDIC includes the cent sign (¢) that standard ASCII does not.  Standard versions of EBCDIC do not include the ASCII characters []\{}^~|.

There are many different dialects of EBCDIC.  Most tend to differ in punctuation coding. 


ISO 8859

The European Computer Manufacturers Association (ECMA) designed what became ISO 8859 to address ASCII's inability to represent some characters in non-English, European languages that use the Latin alphabet, such as German, Spanish, Swedish, and Hungarian to name a few. 

ISO 8859 describes a family of 8-bit character sets for representing different language groups and was the default character set for use on the web until 2004.  In June 2004, the ISO 8859 working group was disbanded. 

There are 15 language groupings identified by ISO 8859.  Each grouping supports languages that often borrow from one another:

Number Group Name Languages
1 Latin-1 Western Europe Danish, Dutch (partial[1]), English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, and Swedish
2 Latin-2 Central Europe Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian, and Hungarian
3 Latin-3 Southern Europe Turkish, Maltese, and Esperanto
4 Latin-4 Northern Europe Estonian, Latvian, Lithuanian, Greenlandic, and Sami
5 Latin/Cyrillic Belarusian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian
6 Latin/Arabic Arabic
7 Latin/Greek Greek
8 Latin/Hebrew Hebrew
9 Latin-5 Turkish Icelandic, Turkish
10 Latin-6 Nordic Nordic
11 Latin/Thai Thai
12 Latin/Devanagari non-existent
13 Latin-7 Baltic Rim Baltic Languages
14 Latin-8 Celtic Gaelic and Breton
15 Latin-9 French, Finnish, Estonian
16 Latin-10 South-Eastern European Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene

Universal Character Set

The Universal Character Set (UCS) is defined by ISO 10646.  ISO 10646 assumed the responsibility for character encoding when ISO 8859 was disbanded.  UCS has over 1.1 million code points available for use, currently contains nearly one-hundred-thousand letters, numbers, symbols, ideograms, and logograms, and continues to evolve as more characters are added. 

Code Points

A code point is a numerical value that is part of a code space.  For example, ASCII comprises 128 code points from 0x00 to 0x7F.  A code space consists of one or more code pages.

Code points are assigned to abstract characters that are units of textual data, but not particular glyphs.  Code points are used to distinguish the numerical value from its encoding as a sequence of bits and the abstract character from any graphical representation. 

Basic MultiLingual Plane

The first set of 65,536 code points from 0x0000 to 0xFFFF covers most of the common languages and is called the Basic Multilingual Plane (BMP). 

The simplest encoding form that permits a binary representation of every code point in the BMP is called UCS-2.  UCS-2 is a fixed-width format where the 2 stands for the number of bytes.

UCS-4

In 2000 and later in 2005, the People's Republic of China required all computer systems intended for sale in the PRC should support the GB18030 code page (Chinese National Standard GB18030-2005:Information technology - Chinese coded character set").  This standard supports both simplified and traditional Chinese characters.  Some of the code points lie outside the BMP, which means that software can no longer work only with 16-bit fixed width encodings (UCS-2) and must process data through either variable-width format or move to a larger fixed width format such as UCS-4. 

UCS-4, which uses four bytes for every character, enables the simple encoding of all characters.  Since UCS-4 is a fixed-width format, it is not backwardly compatible with ASCII.  Moreover, since non-BMP characters are extremely rare in most texts, UCS-4 is extremely inefficient. 


Unicode

Unicode is a standard developed in parallel with ISO10646 and originating in the US.  Unicode and ISO10646 have an identical repertoire and numbers - the same characters with the same numbers exist on both standards.  Unicode differs from ISO10646 in that Unicode adds rules that are outside the scope of ISO10646.  These rules include ones for collating, normalization of forms and bidirectional algorithm for scripts like Hebrew and Arabic. 

Unicode consists of 17 planes of 65,536 code points, which totals to 1,114,112 code points.  Unicode has various encodings depending upon the amount of storage available.  The standard encoding is 16-bit, which accommodates the BMP. 

Mappings tables from the ISO 8859 character sets to Unicode are available here.

Unicode simplifies many problems with internationalization.  It is supported by many modern software platforms (Java, XML, Modern OS's).

UTF-8

Eight-bit Unicode Transmission Format (UTF-8) is a variable-width encoding for Unicode which has the special property of being backwardly compatible with ASCII.  It is the dominant character encoding for files, e-mail, web pages and software.  UTF-8 represents all universal characters in 1-4 bytes.  Prefix code on the first byte indicates which encoding is being used. 

UTF-8 is more compact than using either 16-bit or 32-bit encodings.  The first 128 characters need 1 byte, the next 1920 characters need 2 bytes, the rest of the BMP needs three bytes, and the other code planes of Unicode, which include the Chinese Japanese Korean Ideographs and various historical scripts need 4 bytes. 

UTF-16

Sixteen-bit Unicode Transmission Format (UTF-16) is a variable-width encoding for Unicode which is capable of encoding the entire 17 plane repertoire of Unicode.  For characters in the BMP, the encoding is a single code unit equal to the code point.  For characters outside the BMP, UTF-16 uses an unassigned portion of the 16-bit space. 


Windows

Windows has its own character sets, which differ slightly from the ISO 8859/10646 character sets.

Number Group Name
1250 Central Europe
1251 Cyrillic
1252 Western Languages
1253 Greek
1254 Turkish
1255 Hebrew
1256 Arabic
1257 Baltic
1258 Vietnamese

Full list


Fonts

A character set connects an abstract description of a character with a numeric code.  We specify a font to connect the numeric code with a glyph. 

A wide variety of fonts have been created since Johannes Gutenberg first assembled the printing press around 1440. 

There can be any number of fonts associated with a character set, allowing for different shapes and styles of symbols.  Different languages have fonts with different shapes and requirements. 

Glyphs

A glyph is an element of writing: a mark that contributes meaning to what is written.  Glyphs include not only various representations of characters, but also diacritics that differentiate those characters from similar ones.  While the dot above an 'i' is not a glyph, the cedilla in Spanish and the ogonek and forward slash in Polish are. 

A sequence of two characters is sometimes represented by a single glyph.  For example, the sequence 'æ' or 'œ' or some Roman numerals. 

Typography

In typography, a glyph is defined as a property of a typeface.  It is a graphical unit.  In both typography and computing, the range of glyphs is broader than in written languages.

Categories

Font categories include the following:

  • fixed-pitch (monospace) - each letter has the same width
  • variable-pitch (proportional) - each letter has its own width
  • serif - with small decorations to aid the eye
  • sans serif - without small decorations
  • cursive - decorative, hard to read



Serif Fonts (source: Stannered Wikipedia 2007 CC-BY-SA)



Proportional and Monospace Fonts (source: BANZ11 Wikipedia 2007 PD)

Readability

  • variable-pitch fonts are easier to read than fixed-pitch
  • serif fonts are better for the text of a document - reduce eye strain and improve reading speed - but only work well on high resolution displays
  • words in lower case are easier to read than in UPPER CASE
  • some text - such as flight numbers, transit routes - are easier to read with capitals than with lower case
  • cursive - decorative fonts are difficult to read

Some Recommendations

  • provide enough space between lines for different ascenders and descenders
  • do not use ornate fonts that will obscure accents used by some languages
  • choose fonts that support all of the required accents
  • do not assume that any particular font will be available on the target platform
  • many non-Latin languages require proportional spacing

Exercises




Previous Reading  Previous: Icons Next: Devices   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo