================================= a flexible numeric literal syntax ================================= by Andrew Main (Zefram) 2005-09-17 abstract -------- Many computer programs require the user to be able to input numeric values in a textual format. A wide variety of numeric syntaxes exist for this purpose; the variability is detrimental to user comprehension. Many programs have a numeric syntax that is inconveniently restrictive. As a remedy for these problems, this paper proposes a single, flexible, general-purpose syntax for the textual representation of numbers. The use of this syntax, for both input and output, is encouraged in all situations for which it is relevant. table of contents ----------------- 0. introduction 0.0. rationale 0.1. scope 0.2. semantics 1. syntax 1.0. lexemes 1.1. unsigned integer syntax 1.2. signed integer syntax 1.3. fractional number syntax 2. examples 2.0. unsigned integers 2.1. fractional numbers 3. references 0. introduction =============== 0.0. rationale -------------- There is a need in many languages and applications for a syntax for numeric literals. Historically these syntaxes have varied enormously. It seems preferable to design a single flexible syntax once, and subsequently reuse that syntax. 0.1. scope ---------- The objective is to represent numerical values as strings of graphic ISO-646 invariant characters [ISO-646]. The restriction to ISO-646 invariants, a common subset character set, ensures that the syntax does not need to be varied on systems with unusual character sets. For general use, we need to express real numbers, both positive and negative. We wish to be able to express any integral value, and arbitrarily close approximations to non-integral values. Some applications require only integers, or other subsets of these capabilities, so a distinct syntax for the different categories allows useful implementation of subsets of the syntax. This paper defines distinct syntaxes for: * unsigned integer * signed integer * unsigned fractional number * signed fractional number This paper does not address the exact representation of arbitrary non-integral rational values, any irrational values, infinities, or any non-real numbers. These are less common requirements, and the specialised programs that require them are justified in having a specialised syntax for them. Using this paper's syntax as a base for more specialised syntaxes is encouraged. 0.2. semantics -------------- This paper specifies only the mapping between character strings and the numerical value expressed thereby. Semantics beyond this are determined by the application. In particular, the syntax presented imposes no limit on the magnitude or precision of numbers expressed, though most applications will impose some limit of their own. 1. syntax ========= 1.0. lexemes ------------ The syntax defined herein expresses real numerical values as character strings. To maximise usability, only graphical ISO-646 invariant characters are employed. The characters used are: signs: + - decimal digits: 0 1 2 3 4 5 6 7 8 9 Latin majuscules: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Latin minuscules: a b c d e f g h i j k l m n o p q r s t u v w x y z underscore: _ period: . The Latin letters are recognised without regard to case; case will henceforth be ignored. Particular applications may choose to restrict the syntax to either single case if using the syntax in circumstances that justify it. If there is no overriding technical requirement, case should be ignored on input, and lowercase is preferred on output. The underscore is insignificant in the representation, intended for use as a visual separator. It is permitted anywhere within a number except certain places at the beginning, as noted below. The syntax is partly specified using the augmented BNF defined in [ABNF], including the `core rules' given in appendix A thereof. The ABNF does not express all the syntactic rules; the accompanying text must be consulted. 1.1. unsigned integer syntax ---------------------------- The syntax of an unsigned integer consists of a decimal digit followed by zero or more alphanumeric characters. If the string consists solely of digits, then the digits represent the value in conventional decimal place value. The most significant digit comes first, and the units digit last. uint = 1*DIGIT If there is at least one letter, then the first such letter and the preceding digits determine the base in which the integer is written; the remaining alphanumerics after the first letter (of which there must be at least one) are then digits in the chosen base. The first letter may be: b binary (base 2) d decimal (base 10) o octal (base 8) x hexadecimal (base 16) r arbitrary radix (base 2 to 36) If the letter is "b", "d", "o" or "x", then the preceding digits must all be "0". If the letter is "r", then the preceding digits are read as a decimal number, which must have a value in the range [2, 36], and that value is the radix that is used. Following the initial digits and letter which specify the radix, the remainder of the string must consist of one or more digits of the radix chosen. The decimal digits "0" to "9" have their usual values 0 to 9, and the letters "a" to "z" have the values 10 to 35 respectively. Digits out of range for the selected radix are illegal. uint =/ 1*"0" "b" 1*BIT uint =/ 1*"0" "d" 1*DIGIT uint =/ 1*"0" "o" 1*("0"/"1"/"2"/"3"/"4"/"5"/"6"/"7") uint =/ 1*"0" "x" 1*HEXDIG uint =/ 1*DIGIT "r" 1*(DIGIT/ALPHA) Underscores are permitted anywhere except the very beginning. It follows that the first character of a is always a digit, and the remaining characters are alphanumerics plus underscore. Prepending a "0" to any valid yields a semantically identical . 1.2. signed integer syntax -------------------------- In contexts where the sign of a number must be represented as part of the numeric literal, a sign is optionally prepended to the unsigned integer format to yield the signed integer format. opt-sign = [ "+" / "-" ] sint = opt-sign uint Underscores are not permitted before the sign or between the sign and the first digit of the . 1.3. fractional number syntax ----------------------------- Where non-integral numbers must be expressed, there is a fractional syntax available. Here the integer syntax is augmented by appending a radix point (".") and one or more digits in the radix determined by the start of the integer syntax. Where unusually large or small magnitudes are required, an exponent is permitted. This is a signed integer, appended to the fractional number, separated by another ".". The base for exponentiation is the radix in which the fractional number is expressed. The exponent itself may be expressed in any radix supported by the signed integer syntax, independent of the radix of the fractional number. A leading sign is also permitted in cases where it is useful. ufrac = uint "." 1*(DIGIT/ALPHA) [ "." sint ] sfrac = opt-sign ufrac Underscores are permitted anywhere except the very beginning or between sign and first digit. Note on semantics: whether "-0.0" and "+0.0" (which have the same numerical value) are treated as semantically distinct is an application matter. Similarly, whether an integer expressed in fractional syntax is treated distinctly from the same integer expressed in integer syntax is an application matter. 2. examples =========== 2.0. unsigned integers ---------------------- The following are all valid s and represent the same value: 64009403 0d64009403 10r64009403 0__6__4__0__0__9__4__0__3__ 36r123xyz 3_6_r_123XYz 036r0123xyz 000x3d0b4bb 0x3d0b4bb 0b11110100001011010010111011 0o364132273 The following are not valid s (but the first two are valid s): +64009403 -64009403 _64009403 10d64009403 0r0 1r0 37r0 123xyz 5rz 0x 2.1. fractional numbers ----------------------- The following are all valid s and represent the same value: 384.0 384.0.0 384.00000.-0 38.4.1 +0__3__8__.__4__.__+__1__ 0.00384.0b101 3840.0.-1 0x180.0 +0x1.8.2 0b11.0.7 0b1.1.0o10 3. references ============= [ABNF] D. Crocker, Ed., P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [ISO-646] International Organization for Standardization, "Information technology -- ISO 7-bit coded character set for information interchange", ISO/IEC 646:1991.