Chen-Ho Encoding and Densely Packed Decimal (original) (raw)
[Next] [Up/Previous]
In addition to speed, another objection to the use of decimal arithmetic on computers has been storage efficiency; this can be addressed through using Chen-Ho encoding, which we will now discuss.
One form of Chen-Ho encoding allows three decimal digits to be represented in ten binary bits, which may have up to 1,024 possible different values, and which therefore can encode the 1,000 possibilities for three digits with only a little waste.
Of course, one can do this simply by converting a three-digit number to binary; but Chen-Ho encoding avoids the need for performing multiplication or even addition.
Most of the time, a decimal digit can fit into three bits, when its value is from 0 through 7. Only the two digits 8 and 9 would not fit. And three times three is nine, which is one less than ten. Could there be a way to use that extra bit to indicate that one is handling a three-digit number containing some 8s and 9s as a special case?
Let us take a three-digit number and represent its digits either as aaa, bbb, and ccc respectively, if they are from 0 through 7, or as A, B, or C respectively, if they are either 8 and 9, according to the following code:
aaa/bbb/ccc A/B/C
0 000 1 001 2 010 3 011 4 100 5 101 6 110 7 111 8 0 9 1
then, the following chart indicates a possible form of Chen-Ho encoding that would successfully encode all the possibilities for three digits in ten bits:
0aaabbbccc (3 digits with 3 bits - 1 way) 100aaabbbC (2 digits with 3 bits, 1 digit with 1 bit - 3 ways) 101aaaBccc 110Abbbccc 11100aaaBC (1 digit with 3 bits, 2 digits with 1 bit - 3 ways) 11101AbbbC 11110ABccc 1111100ABC (3 digits with 1 bit - 1 way)
with 24 unused combinations remaining.
This method of encoding was patented by IBM, in U.S. patent 3,842,414. This patent was filed on June 18, 1973, and granted on October 15, 1974. Also, it was described in a paper in the January 1975 issue of_Communications of the Association for Computing Machinery_. (I thought I first encountered this method in a paper in the IBM Systems Journal, but my memory must be playing tricks on me.)
It was also in 1973 that a 7070/7074 emulation feature for the 370/165 and 168 used a form of Chen-Ho encoding to convert decimal addresses to binary addresses, thus minimizing wasted storage.
The form shown preserves the order of digits, so as to be easier to understand. In practice, to cut down on the circuitry required by minimizing the changes in how a bit is used, a scheme like this might be used instead:
0aaabbbccc 100Abbbccc 101Baaaccc 110Caaabbb 11100ABccc 11101ACbbb 11110BCaaa 1111100ABC
In fact, the actual coding used by IBM simplifies the required circuitry further, at the cost of making the format somewhat more difficult to understand. The format given above is similar, although not identical, to that described by Dr. T. C. Chen in his 1971 memo first describing the basic technique.
IBM has since proposed a modification of Chen-Ho encoding which allows truncating the leading three bits from the 10-bit code to make a 7-bit code for two digits, and truncating the leading six bits from the 10-bit code to make a 4-bit code for one digit, to allow taking an arbitrary number of the least-significant digits from an encoded number without re-encoding. This scheme is called Densely Packed Decimal, and it will be more fully explained below.
Here is a table showing IBM's official formats for Densely Packed Decimal and Chen-Ho encoding:
BCD digits Chen-Ho encoding Densely Packed Decimal encoding
0abc 0pqr 0uvw 0abpquvcrw abcpqr0uvw 0abc 0pqr 100W 110pqabcrW abcpqr100W 0abc 100R 0uvw 101abuvcRw abcuvR101w 100C 0pqr 0uvw 100pquvCrw uvCpqr110w 0abc 100R 100W 11110abcRW abc10R111W 100C 0pqr 100W 11101pqCrW pqC01r111W 100C 100R 0uvw 11100uvCRw uvC00R111w 100C 100R 100W 1111100CRW 00C11R111W
0pqr 0uvw 0pqruvw pqr0uvw
0pqr 100W 111rpqW pqr100W
100R 0uvw 100Ruvw uvR101w
100R 100W 110R00W 10R111W
Densely Packed Decimal
As noted above, a modification of Chen-Ho encoding has been produced at IBM which is regarded as an improvement; this improved coding is the subject of U.S. patent 6,525,579 which is currently in force. Its inventor, Michael Cowlishaw, an IBM Fellow, is responsible for aninteresting web page with quite a bit of information about decimal arithmetic.
To explain the principle behind Densely Packed Decimal encoding, a different formulation than the official one shown in the table above, but which is equivalent in terms of the desired properties, is used for illustrative purposes.
Single digits can, of course, be represented in binary using BCD encoding:
0 0000 5 0101 1 0001 6 0110 2 0010 7 0111 3 0011 8 1000 4 0100 9 1001
The original paper describing Chen-Ho encoding noted that the same principle can be used to encode two digits in 100 of the 128 possible combinations of seven bits:
Once again, let ooo or ppp represent a digit from 0 to 7 represented by the following code:
000 0 001 1 010 2 011 3 100 4 101 5 110 6 111 7
and let d or e represent a digit from 8 to 9 in the following code:
0 8 1 9
Then, two digits can be represented in seven bits (in Chen-Ho encoding, not in the Densely Packed Decimal coding yet to be described) as:
0oooppp 100dppp 101oooe 11000de
In the Densely Packed Decimal coding, the format of the Chen-Ho codes are modified so that the prefix bits indicating the coding are no longer located at the front. In this way, the last four bits of the ten-bit coding for a number of the form 00x are the BCD coding of the digit x, with all preceding bits being zero, and the last seven bits of the ten-bit coding for a number of the form 0xy are a seven-bit coding (also modified from the seven-bit Chen-Ho encoding shown above to permit nesting within it) for the digits xy, with the preceding three bits equal to zero.
Given a three digit number, whose digits are ooo/d, ppp/e, and qqq/f, a coding of the Densely Packed Decimal type (not precisely the one specified by IBM, as noted above) can be arranged as follows:
oooppp0qqq oooppp100f oooqqq101e pppqqq110d ooo00e111f ppp01d111f qqq10d111e 00d11e111f
Note that despite this coding expanding gradually to the left, it still does not preserve the collating sequence of the decimal digits it encodes.
The advantage of this coding is that decimal numbers which do not fill the full length of a field can always be copied to a shorter field in which they would fit by truncation while still in coded form, without requiring reconversion.
Thus, two digits, represented by ppp/e and qqq/f, are encoded to a seven bit sequence using the following coding:
ppp0qqq ppp100f qqq101e 00e111f
which is included in the 10-bit coding above on the right side, and the following coding for a single bit, included in the two codings above,
0qqq 100f
produces the normal BCD code.
Of course, this applies to positive numbers only, so it assumes that negative decimal numbers are always indicated using sign-magnitude format, never by ten's complement notation.
Modified Densely Packed Decimal
Can the Densely Packed Decimal encoding be modified so as to be applicable to decimal numbers in a complement notation? The same encoding should work for both nine's complement and ten's complement decimal numbers, so the design will be derived as if I had nine's complement numbers in mind, although in practice it would be ten's complement decimal numbers which would be the application of such a coding.
Given the type of symmetry that would be needed by such a modified encoding, it seems reasonable to encode a single four-bit digit at the beginning of an encoded number, when the remainder of the number is entirely composed of complete groups of three digits, as follows:
0 0000 5 1011 1 0001 6 1100 2 0010 7 1101 3 0011 8 1110 4 0100 9 1111
To allow truncation of short negative numbers as well as short positive numbers, it is intended that, where the digit x encodes to the four bits yyyy using the code above, numbers of the form 0x shall encode to 000yyyy in binary, and numbers of the form 9x shall encode to 111yyyy in binary, when two digits are encoded to seven bits, and, similarly, when three digits are encoded to ten bits, numbers of the form 0xx shall encode to 000zzzzzzz in binary and numbers of the form 9xx shall encode to 111zzzzzzz in binary, where zzzzzzz is the seven-bit encoding of the two digits xx.
This suggests that the basis encoding of decimal digits should be such that the first, second, and third digit will encode to fff, ggg, or hhh respectively from this table,
0 000 1 001 2 010 3 011 6 100 7 101 8 110 9 111
or to p, q, or r respectively from this table:
4 0 5 1
First, let us develop the encoding of two digits to seven bits with the desired property.
This encoding can be described as follows:
ggg00hh ggg0100 (?4) qhh0101 q000111 (?4) q111000 (?5) qHH1010 ggg1011 (?5) ggg11HH
where hh and HH refer to two halves of the table for hhh:
hh HH 0 00 6 00 1 01 7 01 2 10 8 10 3 11 9 11
and the items in parentheses show the value for the last digit which is in four cases represented by a zero-length code, being determined instead from the prefix bits.
This clearly enough has the desired property, since it precedes four bit codes that correspond to the four-bit code for digits given above with ggg in the four applicable cases of the six given.
The encoding of three digits to ten bits can then be derived as an expansion of the seven bit encoding, so that both properties are obtained:
fffggg00hh [0..3,6..9][0..3,6..9][0..3] fffggg0100 (??4) [0..3,6..9][0..3,6..9][4] fffqhh0101 [0..3,6..9][4..5] [0..3] pggghh0110 [4..5] [0..3,6..9][0..3] fffq000111 (??4) [0..3,6..9][4..5] [4] pggg010111 (??4) [4..5] [0..3,6..9][4] pqhh100111 [4..5] [4..5] [0..3] pq00101111 (??4) [4..5] [4..5] [4] pq11010000 (??5) [4..5] [4..5] [5] pqHH011000 [4..5] [4..5] [6..9] pggg101000 (??5) [4..5] [0..3,6..9][5] fffq111000 (??5) [0..3,6..9][4..5] [5] pgggHH1001 [4..5] [0..3,6..9][6..9] fffqHH1010 [0..3,6..9][4..5] [6..9] fffggg1011 (??5) [0..3,6..9][0..3,6..9][5] fffggg11HH [0..3,6..9][0..3,6..9][6..9]
in which table, fff, which is 000 for 0, and 111 for 9, precedes all the combinations in the table for the 7 bit code, and p precedes code combinations which are, in their last 7 bits, distinctive from anything defined in the 7-bit code.
Sign-Magnitude Numbers
The modified form of Densely Packed Decimal shown above was an attempt to deal with one problem in the original Densely Packed Decimal encoding, the fact that it made no provision for the sign of a number. In a computer the storage cells of which contain three decimal digits, being ten bits in length, if an additional bit is needed to encode sign, then one sign bit would need to be provided for the smallest possible signed numeric data format and extra sign bits would be wasted for signed numbers with a higher precision.
Storage efficiency, however, was only one of the reasons for using Chen-Ho encoding, or a related method of storing numbers, in a computer. The reason for using decimal arithmetic, instead of binary arithmetic, is so that data stored in a computer remains readily and directly comprehensible. The use of ten's complement form for negative numbers (or nine's complement form) tends to compromise the achievement of this goal.
Given that sign-magnitude encoding of signed numbers, because it most closely resembles how numbers are displayed in printed form, is the desired representation, can we produce an encoding which has the truncation property of Densely Packed Decimal, but which is applicable to sign-magnitude numbers that are wholly contained in raw decimal digits, without an auxilliary sign bit?
The basic coding for individual digits to be used suggests itself immediately as the following:
0 +0 000 1 +1 001 2 +2 010 3 +3 011 4 +4 0 5 -0 100 6 -1 101 7 -2 110 8 -3 111 9 -4 1
based on our experience with the encodings above, where the three-bit codes include the digits to be truncated, and the one-bit codes are at the opposite end of the series.
Our desired truncation behavior is that +0 0 2 would truncate to +0 2 and then to +2, while -0 0 2 would truncate to -0 2 and then to -2. Thus, when we are performing truncation, the overall sign stays the same, but the next digit after the area of truncation may change to incorporate within itself the sign of the number.
The coding I am going to describe below will work by elision instead of truncation. If it was rotated one bit to the left, it would have the desired truncation property, but without the rotation, one can either remove the three or six bits following the first bit of the code to shorten a signed number, or remove the first three or six bits to shorten an unsigned number.
First, the four-bit code for single digits would look like this:
0000 0 +0 0001 1 +1 0010 2 +2 0011 3 +3 0100 4 +4 1000 5 -0 1001 6 -1 1010 7 -2 1011 8 -3 1100 9 -4
The seven-bit code, being three bits longer, is to reduce to the four-bit code by removal of the second, third, and fourth bits for a signed number, and, because the sign is kept in the first place, it should also reduce to the four-bit code by removal of the first three digits for an unsigned number. This is not too challenging a requirement, because what we are aiming at is that the final digit will change by being given the sign bit of the first digit during signed truncation, and only digits from 0 through 4 can do this in a valid manner.
Again using ggg and hhh for three-bit codes, and q and r for one-bit codes, for the two digits, the arrangement:
gggh0hh gggr100 qhhh101 q00r111
would allow any combination of two decimal digits to be distinguished, and, for the top two lines, where ggg is 000 or 100, allow, by the removal of the first three bits, the second digit to be revealed in the four-bit code, and, by removal of the three bits after the first bit, allow the sign of the first digit to be affixed to the second digit, which, provided that digit is in the range from 0 to 4, produces a useful result.
Since only the first two lines are critical for the desired property, there is some flexibility in how the code is designed.
The full ten-bit code, where the three digits can either be fff, ggg, or hhh, or p, q, or r respectively, then becomes:
fffgggh0hh fffgggr100 fffqhhh101 pggghhh110 fffq00r111 pggg01r111 pqhh10h111 p00q11r111
Some examples of how this works are shown below:
1000000011 -003 1000111010 -037 1000000100 -004 1000011 -03 1111010 -37 1000100 -04 1011 -3 1100 -4
1000011100 -029 1000101101 -046 1011100 -29 1101101 -46
Replacing the leading 1 by a leading 0 gives examples for positive numbers.
And What About Floating-Point?
If a computer that uses decimal digits internally is to be a fully general-purpose computer, in addition to working with integers, it will work with floating-point numbers.
Floating-point numbers are generally thought of as being in one of a limited number of fixed formats, and so the need to shorten a floating-point number by removing three bits or six bits, instead of by removing a complete ten-bit group, is clearly less severe.
However, if floating-point numbers are to be so shortened, they would be shortened by removal of the least significant digits. And this suggests making room for a sign bit by forcing the last digit of the mantissa portion of a floating-point number to be even. Thus, the following coding for digits suggests itself:
0 0+ 000 1 0- 001 2 2+ 010 3 2- 011 4 4+ 100 5 4- 101 6 6+ 110 7 6- 111 8 8+ 0 9 8- 1
One could develop a code in which trailing zeroes could be truncated by reversing the pattern of the code discussed immediately above. However, unlike an integer where removing the first few digits only produces the same result if the digits are removed are zero, what is desired when a floating-point number is shortened is to be able to remove any nonzero digits.
This cannot be achieved through truncation or elision of bits, as with Densely Packed Decimal and the modification of it shown above, and instead requires explicit conversion to decimal digits, followed by re-encoding. Thus, a special coding for the mantissa portions of floating-point numbers, even if it would still be applicable to _lengthening_short floating-point numbers by padding them with trailing zeroes, does not seem to be required.
In addition, as now noted above in the discussion of IBM's decimal floating-point format, the use of decimal digits is being taken advantage of for other purposes besides avoiding the time it takes to convert to and from binary, and, therefore, unnormalized values are put to use in this format.
But here the coding is anyways:
0fffggghhh 100pggghhh 101fffqhhh 110fffgggr 11100pqhhh 11101pgggr 11110fffqr
Yes, as it turns out, an encoding that lends itself to truncation (or elision for the sign-bit at the end case) from the right is very similar to the classic Chen-Ho encoding.
And the seven-bit code would be:
0fffggg 100pggg 101fffq 11100pq
with the four-bit code being:
0000 0 0+ 0001 1 0- 0010 2 2+ 0011 3 2- 0100 4 4+ 0101 5 4- 0110 6 6+ 0111 7 6- 1000 8 8+ 1001 9 8-
thus, 0100000001 at the end of a signed mantissa stands for 400-, and by elision becomes 0100001, standing for 40-, and by elision again becomes 0101, standing for 4-. Just as this fails if a digit being truncated is not zero, it fails if the last bit of the preceding digit, which is also truncated, due to where elision takes place, is not zero. Once again, the second line of the format is important to ensure it works properly even for the trailing digit 8: thus, 1000000001 is 800-, 1000001 is 80-, and 1001 is 8-.
A Single Code?
Is it possible to have a code which allows padding with zeroes on the right as well as truncation of zeroes on the left?
Perhaps even if it is not quite possible, some simple bit manipulation, simpler than translating the code into decimal digits, and then back again to the binary representation, could accomplish shortening from both ends.
My first attempt to construct a bidirectional code:
Digit Leftmost Middle Rightmost
0 +0 000 0000 000 0+ 1 +1 001 0001 001 0- 2 +2 010 0110 010 2+ 3 +3 011 0111 011 2- 4 0010 100 4+ 5 -0 100 1001 101 4- 6 -1 101 1000 110 6+ 7 -2 110 1111 111 6- 8 -3 111 1110 9 1101
was not successful; attempting to allow truncation and elision from both directions was simply too complicated, however I might try to attempt it.
However, elision from the right is only needed if one is going to keep the sign of the mantissa with the mantissa. Since it would be desirable for the exponent to occupy a single three-digit code group, rather than being only two digits long, while an exponent range from -49 to +49 might be acceptable, one from -499 to +499 is clearly more than satisfactory, so placing the sign of the number together with the sign of the exponent, and making the exponent range -249 to +249 would be entirely acceptable.
Then, a compression format ideally should support left truncation, left elision, and right truncation. Since 10 is not divisible by 4, there seems no way to permit left double elision, to permit exponents to be shortened, but left single elision, which truncates the exponent, would work if the exponent were, as is the commonest convention, in excess-250 format, which would shrink to excess-25 format after left elision which would preserve the sign of the number.
Let us try to construct such a code, beginning with the following table of substitutions:
Digit Leftmost Middle Rightmost
0 +0 000 0000 000 1 +1 001 0010 001 2 +2 010 0100 010 3 +3 011 0110 011 4 0111 5 -0 100 1000 100 6 -1 101 1010 111 7 -2 110 1100 110 8 -3 111 1110 111 9 1111
To allow truncation from the left, the code, in the middle digit, for 0, must be 0000; and that is the same value as is required for truncation from the right. For elision from the left, 5 must be represented by 1000 in the middle digit. So far, so good.
In the case where the rightmost digit is a 4 or a 9, to allow elision and truncation from the left, we will need to reserve the codes 0001 and 1001; this can be done.
In the case where the leftmost digit is a 4 or a 9, to allow truncation from the right without special logic, the code 1000 will have to be reserved, and that is not possible. So it appears that a bidirectional code may be possible, but for truncation only. Since the versatility of handling floating-point numbers requires the ability to handle signed numbers as a precondition, this does not appear useful.
But here it is anyways:
Digit Leftmost Middle Rightmost
0 000 0000 000 1 001 0010 001 2 010 0011 010 3 011 0100 011 4 100 0101 100 5 101 1001 101 6 110 1010 110 7 111 1011 111 8 1100 9 1101
0 000 0001 00
1 001 0001 01
2 010 0001 10
3 011 0001 11
4 100 0110 00
5 101 0110 01
6 110 0110 10
7 111 0110 11
8 0111 00 0
9 0111 01 1
0 00 1000 000 1 01 1000 001 2 10 1000 010 3 11 1000 011 4 00 1110 100 5 01 1110 101 6 10 1110 110 7 11 1110 111 8 0 00 1111 9 1 01 1111
0 00 0111 10 1 01 0111 10 2 10 0111 10 3 11 0111 10 4 00 0111 11 5 10 1111 00 6 10 1111 01 7 10 1111 10 8 0 10 1111 11 0 9 1 11 1111 00 1
With this code, I think one can get unique 7-bit codes for each two-digit group, and unique 4-bit codes for each digit, trimming off one or two zeroes respectively, where each zero is represented by 000, from the right or the left. But the shorter codes will be different for each direction.
In that case, though, a code suitable to a modified form of elision, where the positions of the bits to be removed when truncating or eliding from the right are not the rightmost three and six bits necessarily, should be possible, by shrinking each digit through sequestering one bit from each digit, somewhat the same basic technique as was used to create the code that allowed truncating a ten's complement number from the left. In this regard, it might be noted that the official form of Densely Packed Decimal sequesters the _least_significant bit of each digit; if the digits were recoded so that their bit sequence is reversed, a scheme of handling signed numbers with it may be possible.
Of course, storing the digits of the mantissa in reverse order, in a code already supporting left truncation, would be much simpler than using a bidirectional code that was significantly more complex than the unidirectional code supporting both truncation and elision given above. And complexities and exceptions appear unavoidable even when supporting right truncation only in addition to left truncation.
[Next] [Up/Previous]