This paper derives from the time that Berkeley Associates sued IBM for
$300 million. Because it involved the ALT key on PCs, which is fairly
similar to an Escape key, I was scheduled to be IBM's star witness.
A question arose on how to explain the dispute to a jury that, at least
at that time, might be unfamiliar with computers. I put the following
explanations down to help do this.
NOTE: This story is not finished yet.
THEORY OF CODE SETS (for people not in that business)
Codes are things that stand for other things. Remember the old gag
about the comedian's club where everyone laughed when one of them said
"46", because they all knew what joke number 46 was in the comedian's
jokebook? To children the number 25 may stand for the letter "Y", it
being the 25th letter in the Roman alphabet. A picture of a heart may
stand for love, a picture of a dagger for hate. That's the basis of the
old rebus puzzles.
Codes are used for many reasons -- secrecy, ability to represent things
in other ways, and (opposite to secrecy) ensuring that everyone
understands some things in the same way.
Token is another name, perhaps more general, for a code. Tokens also
stand for items, actions, anything. Macintosh users sometimes call a
picture token an icon, which is really a Russian religious painting.
Webster's Dictionary says that a white flag is a token of surrender (an
action).
Computers usually handle data in fixed sizes. One hears the term
"byte" for a group of (so many) bits. For 8 bits it is better to
call it an "octet". Each byte can hold a coded
character. So what is the difference between a coded character
representation and a token? A token may be made up of several coded
representations (character bytes). Like Chinese ideographs. For
example, putting three characters (two characters for "woman" under one
character for "roof") gives an ideograph for "trouble".
Sets are groups of things with certain properties in common. Such as
paper money of the United States. We have bills for 1, 5, 10, 20, 50,
100, 500 (etc.) dollars. That's the "paper money set". Another set is
the set of $1 money; it has two members, the dollar bill and the dollar
coin. Note that the number of members in a set is not the number of
examples of any one that you have. With five one-dollar bills and four
one-dollar coins I have nine dollars, but the set still has only two
members.
The more members a set has, the easier it is to find subsets of the
members. If we only had bills for one, fifty, and a thousand dollars,
paying a debt of $45 dollars would take 45 copies of the dollar bill.
As opposed to two twenties and a five. So when variety is needed,
larger sets (those with more members) are easier to use.
Other examples of sets include alphabets (Roman, Cyrillic, Hebrew,
etc.), the Arabic digits 0 through 9, punctuation marks, dominoes, suits
in a deck of cards, Mah Jongh tiles, Monopoly houses and hotels and
properties.
To use computers, and typewriters, we must have a set of keys/codes that
includes letters of the alphabet, digits, punctuation, and (not least) a
space. Children's typewriters often have only capital letters. Adult's
will have capitals and lower case (not separately on the keyboard, but
achieved by means of a shift key). The old Linotype for newspapers had
even larger sets, with italic and bold letters, and even letters of
different sizes and design.
The size of a set (its number of members) plays a great part in the
flexibility of using it. In the old Linotype, with a larger set, we
used italic letters for emphasis; on a typewriter we "make do" with the
smaller set by underlining the regular letters.
EXTENSION and EXPANSION of SETS
When one "makes do" with an existing set to create more combinations,
that is called "SET EXTENSION". In the case of coded sets, it is called
"CODE EXTENSION". When the existing set is felt to be totally
inadequate, new members may be added. This is called "SET EXPANSION",
or "CODE EXPANSION". For example, the government may promote the two
dollar bill, and add a new bill for two hundred dollars. The paper
money set is expanded.
For another example of this distinction, imagine a set consisting of
thirteen cards -- Ace, 2, 3, 4 ... 10, J, Q, K. Each is marked with a
number or letter, and each has a number of black circles on it to match
the count. Not the pips that you normally associate with cards, but
just black circles. Suppose we have many of these sets of thirteen cards.
How can we play bridge or poker?
We will have to "make do", by "extending the set". This can be done in
several ways (but not by using different colors on the backs, which
would be a giveaway to the other players). One way would be to write a
big "S" for "spades" on one set, and "H", "D", and "C" on another three
sets. Then put all four sets together. Another way would be to mark
the upper left corner for spades, upper right for hearts, lower left ...
etc. Still other methods could be devised.
But that 13-card set is awkward, which is why we use the current set of
52 cards, where the pips have colors and shapes for uniqueness. Note
that here we doubled the set size twice. Could it be done by expansion
that doubles the set size only once? Yes. Have a set of 26 cards, 13
with black circles and 13 with red circles. Now we need just two sets of
those cards, and in each case we need to distinguish only between
spades-clubs (for the black circles) and hearts-diamonds (for the red).
SETS FOR COMPUTERS
Most everyone knows or has heard that computers work by recognizing 1's
and 0's, as usually represented by ON-OFF, punch/no-punch in a hole
position, or some other means having only two states. No pictures, no
colors, etc., enter into the encoding of information (although the
reverse is true -- the bits (2-state items) can create colors and
shapes, as in video games and movies).
Years ago, 6 bits were used to create the character set. Nowadays 8
are common. But those bits are NEVER arranged in anything other than
what is effectively a straight line. When sent by telephone line or
satellite they go either a) one after the other in "serial"
transmission, or else b) side-by-side on multiple lines in
"parallel" transmission. In other words, each bit goes either in its own
position (like second) or on its own channel, like the fourth parallel
wire. In army terms, by file, or by rank and file.
Not like dominoes. In dominoes you can turn around a 2-4 tile and get
a 4-2 tile. In computers the 11001111 is different from 11110011. So
the number of members in any set defined by two states (like ON-OFF) is
2 times 2 times 2 ... as many times as you have bits. 6 bits give 64
members in the set; 8 bits give 256 members.
Now for some surprises about set extension and expansion. Computers
and communications did not used to be so interconnected. Once we had
TeleTypes and TELEXes that were not themselves computers, but used sets
of codes in the same way. But until about 1960 they used codes of 5
bits, coded in 5 tracks of punched round holes in paper tape. By our
previous formula, how many members to the set? 32. But wait, how did
they send 26 letters and 10 digits? That's 36 (not to mention the space
and a few others), which is more than 32. It was done by "code
extension", using the fact that hole combinations can be assigned to
represent "control" as well as "text" characters.
For plain typewriters, ask "if this key is depressed will it print
something?". If it does, it is a text character. Not so the "backspace"
key, which is a control. We used it for underlining, which extends the
set. And for overstriking, which also extends the set. Nor does the
shift key itself print. It just changes lower case to upper (capitals),
digits to punctuation, etc.
The shift key is quite special. It can serve to affect only the next
key you type, or it can remain shifted, affecting all the keys until it
is released. That is why there are two keys, SHIFT and CAPS LOCK. The
first is "nonlocking" and the second is "locking". Remember this for
later.
Teletypewriters also use a backspace key to back up both print element
and paper tape to hit the "delete" character, which does an "editing"
function by overstriking any printed character with a black rectangle,
and also by punching holes in all tracks. Then, no matter what character
used to be there, it is now just the single delete character, which the
reader on the other end, driving the receiving teletypewriter, just
ignores. OK, our set really has 29 members, not 32, because Delete,
BackSpace, and Shift can't count.
If one code is assigned to shift the ribbon color from black to red,
the control set increases, and the text set decreases to 28. But in
small sets, a useful balance must be struck. The lesson is: if
controls are available, each may be used to "extend" the set in some
way. Such a color code essentially doubles the set size, for now we have
a black "A" and red "A".
PICTURES ARE EASIER TO UNDERSTAND
Here is a made-up paper tape code on 5 tracks, as we might have adapted
today's internal code of personal computers.
<------ direction of paper tape movement
-------------------------------------------------------------------
/ o o o o o o o o o o o o o o o o / 16
/ o o o o o o o o o o o o o o o o / 8
/ o o o o o o o o o o o o o o o o / 4
/ o o o o o o o o o o o o o o o o / 2
/ o o o o o o o o o o o o o o o o / 1
/------------------------------------------------------------------/
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ; _ BS SH DEL
" # $ % & ? ( ) * + , - . / ! 0 1 2 3 4 5 6 7 8 9 : < > BS SH DEL
direction of reading the paper tape ------>
See those powers of 2 at the right? Give each punched hole the shown
value, and add those values -- e.g., "G" has a value 7 (1+2+4), so it's
the 7th letter in the alphabet -- the basis for this sample code. Suppose
we want to send:
/-----------------------------------------------------/
/ o o o /
/ o o o o o o o o /
/ o o o o o /
/ o o o o o /
/ o o o o o o o o o o o o o o /
/-----------------------------------------------------/
H A V E A N I C E D A Y , M A Y 3 1
Assume the SHIFT key is LOCKING. First we have to shift out to get the
comma. Then the shift back into the alphabet. Finally shift out again
for the two digits:
/----------------------------------------------------------/
/ o o o o o o /
/ o o o o o o o o o o o /
/ o o o o o o o o /
/ o o o o o o o o /
/ o o o o o o o o o o o o o o /
/----------------------------------------------------------/
H A V E A N I C E D A Y , M A Y 3 1
: : :
SHIFT SHIFT SHIFT
OUT IN OUT
It took three codes here to get the comma. You can see why most early
teletypewriters did not use extension. Remember old style messages?
HAPPY BIRTHDAY STOP SEND MONEY STOP -- (no shifting necessary)
Now assume the SHIFT key is NON-LOCKING. We shift for the comma and
two digits:
/--------------------------------------------------------/
/ o o o o o o /
/ o o o o o o o o o o o /
/ o o o o o o o o /
/ o o o o o o o o /
/ o o o o o o o o o o o o o o /
/--------------------------------------------------------/
H A V E A N I C E D A Y , M A Y 3 1
: : :
SHIFT SHIFT SHIFT
All of this had to do with Code EXTENSION. Now let's do it with Code
EXPANSION. We redesign the tape reader and its logic so as to have SIX
tracks, not FIVE. Now we follow the NON-LOCKING mode, but instead of
preceding those three characters by the SHIFT, we indicate that shifted
quality by a hole in the 6th track. That looks like this:
/-------------------------------------------------/
/ o o o / <--- the shift
/ o o o / track
/ o o o o o o o o /
/ o o o o o /
/ o o o o o /
/ o o o o o o o o o o o o o o /
/-------------------------------------------------/
H A V E A N I C E D A Y , M A Y 3 1
Now we shall have to rethink seriously. The 5-track code had 29
combinations devoted to TEXT characters, and 3 (Shift, Backspace, and
Delete) devoted to CONTROL characters. So each row (character) was
either TEXT or CONTROL.
But in our 6-track example the 6-bit characters are split into two parts
-- 5 tracks for the TEXT part and 1 track for the CONTROL part. The
five tracks were punched according to what text key was down, and the
sixth track was punched whenever the shift key was down. And when read
at the other end, the sixth track was processed ahead of time to actuate
the shift key before the real text character was printed.
The shift key code in the 5-track examples is called a "precedence"
code. It is a signal to "treat the next character in another way".
ASCII (American Standard Code for Information Interchange), the code of
Personal Computers (and more), has lots of these precedence codes in it
-- ESCape, CANcel, four Device Control codes, Data Link Escape, four
information separators, etc. That is because ASCII was for a long time
a 7-bit code, with 128 combinations, and a lot of extension had to be
done. Now there are many 8-bit variants of ASCII, and it shows again
the general principle of:
"What is done by a precedence code in an extended set may be done
exactly the same way by an added bit in an expanded set."
WHAT DOES THIS MEAN?
- Code extension has been practiced since the beginning of the Chinese
language!
- Devices like shift keys were used for code extension since the first
use of typewriters!
- Reserved characters have been used for code extension since the first
use of teletypewriters!
- Characters have been split into two classes, TEXT and CONTROL, since
at least 1957. For computers, such characters did have effect upon
programmed branching, thus allowing alternate actions.
- A precedence code can be mapped into the extra bits of a larger set,
keeping the text or control meaning identical. Mapping in the reverse
direction is also equivalent.
- Reserved characters for precedence codes, with meaning dependent upon
the character following them, were defined in the pre-ASCII proposals
of 1960. Among those meanings for following characters could be:
a) a different text character,
b) a different control characters, or
c) to put an ENTIRELY NEW set of text and control characters
into force.
- The rationale for point 6 was published in 1960 by Bob Bemer [1], while
employed by IBM. It was not patented by IBM then. Even if a patent
application had been submitted, it would have had to refer to previous
precedence code technology, although this was the furthest that
precedence code concept had been extended!
- The first standards proposals (mid-1969) for controls for video
terminals and keyboards enumerated these controls (as we see them
today) in the 8-bit expanded set, but with the clear proviso that
they could also be done with ESCape sequences in the extended
(smaller) set.
APPENDIX
Information Separators of ASCII
Paper tape and teletypewriters have been used in our examples, but
similar methods were also used in computers. An early example is the
IBM 1401, circa 1957. It had a 6-bit internal code, but there were
actually seven bits for each character. The seventh was called the Word
Mark. When it was 1, not 0, the computer circuitry was signalled that
it was both a TEXT and a CONTROL character, and that the character was
the last one in the word being read. This self-delimiting process
permitted variable length words.
In fact, when ASCII was in the standardization process, I was quite
familiar with how this worked. So the four information separators of
ASCII were derived from the Word Mark principle, in the REVERSE
process, going from a bit in a larger set to a separate character in a
smaller set.
Note on Limited Sets
This quote was found as a telegram in a novel [2]:
OUR SAFE MANUFACTURED BY EMPIRE SAFE CABINET COMPANY, MODEL G-23,
DATED 1887 STOP SCOTLAND YARD INVENTORY ITEMS IN PAMELA'S ROOM
MORNING AFTER BURGLARY ... IMMEDIATELY IF HELPFUL STOP
Why isn't it authentic?
Well, if there was no period in the telegraph code, thus forcing two uses
of "STOP" to represent the period, where did the apostrophe, hyphen. and
comma come from?)
A Reminder of the 8x16 code "ASCII"
NUL DLE SP 0 @ P ` p
SOH DC1 ! 1 A Q a q
STX DC2 " 2 B R b r
ETX DC3 # 3 C S c s
EOT DC4 $ 4 D T d t
ENQ NAK % 5 E U e u
ACK SYN & 6 F V f v
BEL ETB ' 7 G W g w
BS CAN ( 8 H X h x
HT EM ) 9 I Y i y
LF SUB * : J Z j z
VT ESC + ; K [ k {
FF FS , < L \ l |
CR GS - = M ] m }
SO RS . > N ^ n ~
SI US / ? O _ o DEL
REFERENCES
- R.W.Bemer, "ESCape - a proposal for character code compatibility",
Commun. ACM 3, No. 2, 71-72 (1960 Feb)
- Elliott Roosevelt,"Murder and the First Lady", Readers Digest
Condensed Books, 1984 Vol. 4, p. 326:
- Jukka Korpela, "A tutorial on character code issues",
See
it on the Web.
A superb paper. The best I've seen. You'll have to use your brain,
your English language training, and all your education to understand
this master teacher from Finland. But he gives all the other references
you could need.
Back to History Index
Back to Home Page