Bob Bemer, Interviewer
It is kind of you to say that I might have had some responsibility for e-mails. I suppose that might be so, taking into account not only the character coding for interchange, but also what I did in BSI, ISO and CCITT on data transmission, open systems interconnection, and 'messaging'. But those are the foundations. What I so greatly dislike is the edifice that wizz-kids have built upon them.
So when I try to compose an email to you, in reply to yours, it suddenly disappears. I suppose some idiot has perpetuated the disaster that when one keys CAP+C expecting C but by mistake hits CONTR+C it clears the screen. Or, with YAHOO, when I key "Dear Bob NEWLINE" it sends off only "Dear Bob" to you. So I now turn to my dear old computer with 15-year old software made in UK, which is sufficiently foolproof for me to use.
... Hugh, I get asked hundreds of times -- "Why is the letter A in the second position?" In a lengthy (yet unpublished) story about ASCII I said:
"Many have wondered why Ross should have proposed the alphabet to start there. I think it derived from the change the computing field caused to start numbering with 0 instead of 1, as we all do now. Thus hex numbering is 0-9, A-F. And if "A" is the 1st letter it should go into the 1st position, not the 0th! Specious? Perhaps."
This is why the coded alphabet starts at one. In all European alphabets A(a) was the first letter, I later found it also for Greek, Turkish, Russian. Letters are thought of as first, second, third... ninth, tenth, eleventh... More recently I learnt all Indic alphabets say first, second etc.; in fact all alphabets do it. There never has been a 'noneth' letter. At that time the only idea that would have occurred to anyone was to code letters as binary-one, binary-two, ... binary-nine, binary-ten, binary-eleven ...
What is much more interesting is why numerals start at zero in our codes. At that time everyone thought numerals started at one and went nine, ten, eleven. In Britain, our currency made us think one, two, ... nine, ten, eleven, twelve... nineteen, twenty. All typewriters, telephone dials, card punches, started with one and put zero after nine. Today, the keyboard given to the world by IBM does it twice, push-button telephones and mobile 'phones do it. That is how people thought, and still think, decimal numerals are arranged.
My coding standardization work started from my colleagues in Ferranti. 1 don't know whether it was them or the academics at Manchester who decided this feature. They were then mostly interested in technical applications, and worked with mathematicians and boffins who knew about radices. So they put zero before one, and coded zero at binary-zero.
Our coding of letters was common-sense, but for numerals required skill.
I also worked a lot on keyboards, and wrote some ISO standards for them. I had to follow the public inertia, and have only last week realised that I could have done much better to promote two keys for zero, one to the left of one and the other to the right of nine.
I have recently looked at the histories of Unicode that their people have written for Internet consumption. For some reason I got an uneasy suspicion. I found your name there only once. Yet I remember you vividly as a major force in the total creation and promulgation of codes of more than 8 bits.
Recently the erudite member Paul Hill of the CALENDR-L discussion group came up with the phrase " ... so again we wander about in the lunacy of the world of the self-published Internet." Examples:
With this in mind I adjoined the Unicode history material and made counts of who was mentioned and how often. Which, one might think, would indicate the real drivers of the work. The majors were: Lee Collins (30), Joe Becker (23), Mark Davis (16) down to you and a Peter Fenwick at one mention each. Can I believe this?
- That I myself was deeply involved in the early COBOL world is undeniable. Yet the canned history that goes along with any COBOL standard is far from reflecting reality to me.
- The official history that Computer Sciences Corporation puts on the Internet would have you believe that Fletcher Jones and Roy Nutt were their only two founders. But Bob Patrick was the third founder, and by luck he had some money to carry them for a while. CSC may not know of this, but I was there!
You ask about my role with Unicode. My recollection is that that work started from Peter Fenwick, who saw that a 16-bit code would be necessary. However, he also saw that it must be extensible. Just a 16-bit code would not do, primarily to permit the correct writing of Chinese ideograms. I recall his telling me of long transatlantic 'phone calls with Mark Davis (I think) to convince him. At that time I was still concentrating on 8-bit coded sets, devised to suit regional uses. But within two or three months I joined with Peter's work.
The most important element in British coding work was that, whenever one of us had a good idea, the whole team backed it up. All our work was directed towards ISO 10646, which the Unicode people later picked up -- that of course being immensely important. The only significant difference I had with Peter was that he thought software would easily accommodate 'construction' of accented letters, whereas my experience where we had done that in ISO 6937, but without it being taken up, had taught me that there are significant advantages in specifying and coding each accented letter as a single character. I extended this to all syllabic scripts, and supported Microsoft when they demanded it for Korean Hangul.
I was the first editor of what became ISO 10646 (negotiating that number as a nostalgic reminder of ISO 646) and a few years ago noted that 75% of the definitive text of ISO 10646 (i.e. excluding the explanatory appendices) is still mine, and that excluding the Chinese, Japanese and Korean characters (which I never concerned myself with), 80% of the characters had been identified and coded by me or with active co-operation with others. I also went through the papers of SC2/WG2 at that time and, having excluded all the administrative ones, I found that about half of the total had been done by our British group, and that I had submitted twice as many as any other single author.
No more need be said. I see why you are mostly ignored by the Unicode history. My research shows that the ISO work generally preceded the Unicode movement by about two years, and at some later time the two efforts were, if not merged, at least subjected to agreement that they would not be in contradiction to each other, and that they would and could
coexist without overt contrivance.
Unicode to remain basically as a few sets of 16-bit codes, mainly for PCs. ISO 10646 to also handle up to 32-bit characters, mainly as needed for larger equipment.
And being the prior work, your ISO 10646 was treated as old hat, with the new Unicode work exuding the glamour of reinvention.
Compare to COBOL subsuming UNIVAC's FlowMatic and IBM's (my) Commercial Translator. The Picture Clause I invented for COMTRAN was taken into COBOL as an "old-hat" device, with no specific mention as to source. Had it been "newly" derived by the COBOL Committee, I suspect that the inventor would have been named in their history.
My aim was to establish the characters needed for the correct spelling of each script and language -- no less (or the language could not be correctly written) and no more (so code-space and presentation systems were not overloaded). It was important to take into account only present-day usage (for that is what most users want). For some languages this was difficult, and I hoped that ISO 10646 would become a useful reference source for other users.
You might like me to expand on this idea of the characters needed for correct spelling. The topic may be of interest because it has never previously been explained, nor has it been told how we did the work. The concept and activity started with ISO 6937 where we were tackling the characters needed for correct spelling of European languages. Loek Zeckendorf of Holland was a primary contributor, and we were unofficially taking advantage of work done by bibliographers in the great libraries of France, Germany and Britain, who also shared this concept about correct spelling. I extended this to the accented letters of Greek, at that time the most difficult script we had tackled. For ISO 10646 we could subsequently take advantage of all that.
For ISO 10646 John Clews, a valued member of the British team, established a more formal dialogue with the bibliographic experts, and did much to break down a very strange antipathy between character coders for computers and for bibliography.
Apart from using all that in an orderly way in ISO 10646, my interest extended to many other languages and scripts. I collected data about the characters used for their writing from a great variety of sources, often being helped by Peter Fenwick. I cannot of course now remember them all, but the details will be in my voluminous working papers that still exist in my office. I was especially appreciative of the help given by experts in SOAS, London University, over my special interests in writing African, Indic, and Southeast Asian languages.
Because there is ever so much junk out there written by eccentric authors, it was of vital importance to assess for each individual source its authenticity or validity. As an example, for correctly spelled British English I scanned the full Oxford English Dictionary (in 12 volumes!), together with Fowler's 'Modern English Usage', for foreign words now assimilated into English. This showed that -- whereas American English uses just 2 x 26 letters (a thru z, and capitals) -- British English uses a further 32 special or accented letters). Fortunately, all those were already in ISO 10646 being used by one or another of the European languages.
It was necessary to discern whether a character was really 'new' (i.e. not already in ISO 10646), to avoid duplication. For example, Vietnamese uses many of the European accented vowels, but adds many others of its own.
Another tricky matter was to determine whether a character was 'new' or was merely a graphical variant. This was made more difficult because many sources for minority languages show badly printed or even hand-written glyphs.
This work also extended to the symbols used for the correct writing of the various languages. For example, British English uses four different lengths of a horizontal line: hyphen, en-dash, minus, em-dash. French and Greek use a longer dash for quotations. And the correct writing of quotation marks is a veritable trap for the unwary.
We have now established a unique body of new work that you prepared and contributed to Unicode by way of the ISO 10646 work preceding Unicode. Could any of your work be ascribed uniquely to the Unicode project?
I have never contributed anything to the text of the Unicode standard document itself, which I greatly admire for its informative content, it being much more than I could do.
One thing that does still rather please me is the concept of 'levels' in ISO 10646, which was devised independently and simultaneously by an IBM man and myself, while I worked it up in detail and got it into ISO 10646. It uses your concept of Escape sequences. My experience with ISO 646 where we allowed 'alternative' characters to be allocated to specific code values (an idea got from dear old telex ITA2) showed it to be calamitous for reliable interchange -- which is after all the prime purpose of making character code standards. ISO 10646 and hence Unicode got rid of that.
However, it has always seemed to me that it is merely the other side of the same coin to permit alternative coding of any character as used by human beings. Inevitably it leads to complications in software, and for some scripts these can be quite severe.
Level 2 of ISO 10646 rules out those troubles and gives truly non-ambiguous coding. That levels feature is omitted from Unicode, but there does not yet seem to be evidence of the importance of avoiding what to me seems to be a calamitous path.
Having got somewhere with each of the character coding standards, I've always been active in promoting them by publications. For the ISO 646 (et al) work I published:
When ISO 10646 became established, I and a colleague ran a bi-monthly journal 'Universe of Characters'. It ran for three years, and contained much technical data and information about writing systems, scripts and character sets that is not available elsewhere.
- 'Considerations in choosing a character code ...', Comp J 1961, P 202
- 'The I.S.O. Character Code', Comp J 1964 P 19
- 'The British Standard Data Code and how to exploit it', Comp J 1970 P 223
- 'Character Sets and Coding', NCC Guide No 9 1981, ISBN 0850123089 (also covering ISO 6937)
It is rather strange that, although character sets and their coding is of such fundamental importance for computers, for hardware (your 8-bit codes) and for software and for interchange, it has always been a minority interest. One thing that pleases me is that Michael Everson is still active, for he has all the right ideas.
P.S. I think in one respect you may have a greater disappointment with your coding work than I ever did. You devised that concept of Escape sequences. But it seems your American colleagues must have let you down, for whenever later anyone proposed a character set under the Registration Scheme the Americans invariably voted against it. It put the rest of us in all other countries in the position of having to disregard the Americans, and it was quite difficult for us to reach that attitude of mind.