=================================
                     a flexible numeric literal syntax
                     =================================

                 by Andrew Main (Zefram) <zefram@fysh.org>

                                2005-09-17

abstract
--------

Many computer programs require the user to be able to input numeric
values in a textual format.  A wide variety of numeric syntaxes exist
for this purpose; the variability is detrimental to user comprehension.
Many programs have a numeric syntax that is inconveniently restrictive.
As a remedy for these problems, this paper proposes a single, flexible,
general-purpose syntax for the textual representation of numbers.
The use of this syntax, for both input and output, is encouraged in all
situations for which it is relevant.

table of contents
-----------------

0. introduction
0.0. rationale
0.1. scope
0.2. semantics
1. syntax
1.0. lexemes
1.1. unsigned integer syntax
1.2. signed integer syntax
1.3. fractional number syntax
2. examples
2.0. unsigned integers
2.1. fractional numbers
3. references

0. introduction
===============

0.0. rationale
--------------

There is a need in many languages and applications for a syntax for
numeric literals.  Historically these syntaxes have varied enormously.
It seems preferable to design a single flexible syntax once, and
subsequently reuse that syntax.

0.1. scope
----------

The objective is to represent numerical values as strings of graphic
ISO-646 invariant characters [ISO-646].  The restriction to ISO-646
invariants, a common subset character set, ensures that the syntax does
not need to be varied on systems with unusual character sets.

For general use, we need to express real numbers, both positive
and negative.  We wish to be able to express any integral value,
and arbitrarily close approximations to non-integral values.
Some applications require only integers, or other subsets of these
capabilities, so a distinct syntax for the different categories allows
useful implementation of subsets of the syntax.

This paper defines distinct syntaxes for:

    * unsigned integer
    * signed integer
    * unsigned fractional number
    * signed fractional number

This paper does not address the exact representation of arbitrary
non-integral rational values, any irrational values, infinities, or any
non-real numbers.  These are less common requirements, and the specialised
programs that require them are justified in having a specialised syntax
for them.  Using this paper's syntax as a base for more specialised
syntaxes is encouraged.

0.2. semantics
--------------

This paper specifies only the mapping between character strings and the
numerical value expressed thereby.  Semantics beyond this are determined
by the application.  In particular, the syntax presented imposes no
limit on the magnitude or precision of numbers expressed, though most
applications will impose some limit of their own.

1. syntax
=========

1.0. lexemes
------------

The syntax defined herein expresses real numerical values as character
strings.  To maximise usability, only graphical ISO-646 invariant
characters are employed.  The characters used are:

    signs: + -
    decimal digits: 0 1 2 3 4 5 6 7 8 9
    Latin majuscules: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    Latin minuscules: a b c d e f g h i j k l m n o p q r s t u v w x y z
    underscore: _
    period: .

The Latin letters are recognised without regard to case; case will
henceforth be ignored.  Particular applications may choose to restrict
the syntax to either single case if using the syntax in circumstances
that justify it.  If there is no overriding technical requirement,
case should be ignored on input, and lowercase is preferred on output.

The underscore is insignificant in the representation, intended for use
as a visual separator.  It is permitted anywhere within a number except
certain places at the beginning, as noted below.

The syntax is partly specified using the augmented BNF defined in [ABNF],
including the `core rules' given in appendix A thereof.  The ABNF does not
express all the syntactic rules; the accompanying text must be consulted.

1.1. unsigned integer syntax
----------------------------

The syntax of an unsigned integer consists of a decimal digit followed by
zero or more alphanumeric characters.  If the string consists solely of
digits, then the digits represent the value in conventional decimal place
value.  The most significant digit comes first, and the units digit last.

    uint     =  1*DIGIT

If there is at least one letter, then the first such letter and the
preceding digits determine the base in which the integer is written; the
remaining alphanumerics after the first letter (of which there must be at
least one) are then digits in the chosen base.  The first letter may be:

    b  binary (base 2)
    d  decimal (base 10)
    o  octal (base 8)
    x  hexadecimal (base 16)
    r  arbitrary radix (base 2 to 36)

If the letter is "b", "d", "o" or "x", then the preceding digits must
all be "0".  If the letter is "r", then the preceding digits are read
as a decimal number, which must have a value in the range [2, 36],
and that value is the radix that is used.

Following the initial digits and letter which specify the radix,
the remainder of the string must consist of one or more digits of the
radix chosen.  The decimal digits "0" to "9" have their usual values 0
to 9, and the letters "a" to "z" have the values 10 to 35 respectively.
Digits out of range for the selected radix are illegal.

    uint     =/ 1*"0" "b" 1*BIT
    uint     =/ 1*"0" "d" 1*DIGIT
    uint     =/ 1*"0" "o" 1*("0"/"1"/"2"/"3"/"4"/"5"/"6"/"7")
    uint     =/ 1*"0" "x" 1*HEXDIG
    uint     =/ 1*DIGIT "r" 1*(DIGIT/ALPHA)

Underscores are permitted anywhere except the very beginning.  It follows
that the first character of a <uint> is always a digit, and the remaining
characters are alphanumerics plus underscore.

Prepending a "0" to any valid <uint> yields a semantically identical
<uint>.

1.2. signed integer syntax
--------------------------

In contexts where the sign of a number must be represented as part of
the numeric literal, a sign is optionally prepended to the unsigned
integer format to yield the signed integer format.

    opt-sign =  [ "+" / "-" ]
    sint     =  opt-sign uint

Underscores are not permitted before the sign or between the sign and
the first digit of the <uint>.

1.3. fractional number syntax
-----------------------------

Where non-integral numbers must be expressed, there is a fractional
syntax available.  Here the integer syntax is augmented by appending a
radix point (".") and one or more digits in the radix determined by the
start of the integer syntax.

Where unusually large or small magnitudes are required, an exponent
is permitted.  This is a signed integer, appended to the fractional
number, separated by another ".".  The base for exponentiation is the
radix in which the fractional number is expressed.  The exponent itself
may be expressed in any radix supported by the signed integer syntax,
independent of the radix of the fractional number.

A leading sign is also permitted in cases where it is useful.

    ufrac    =  uint "." 1*(DIGIT/ALPHA) [ "." sint ]
    sfrac    =  opt-sign ufrac

Underscores are permitted anywhere except the very beginning or between
sign and first digit.

Note on semantics: whether "-0.0" and "+0.0" (which have the same
numerical value) are treated as semantically distinct is an application
matter.  Similarly, whether an integer expressed in fractional syntax
is treated distinctly from the same integer expressed in integer syntax
is an application matter.

2. examples
===========

2.0. unsigned integers
----------------------

The following are all valid <uint>s and represent the same value:

    64009403
    0d64009403
    10r64009403
    0__6__4__0__0__9__4__0__3__
    36r123xyz
    3_6_r_123XYz
    036r0123xyz
    000x3d0b4bb
    0x3d0b4bb
    0b11110100001011010010111011
    0o364132273

The following are not valid <uint>s (but the first two are valid <sint>s):

    +64009403
    -64009403
    _64009403
    10d64009403
    0r0
    1r0
    37r0
    123xyz
    5rz
    0x

2.1. fractional numbers
-----------------------

The following are all valid <sfrac>s and represent the same value:

    384.0
    384.0.0
    384.00000.-0
    38.4.1
    +0__3__8__.__4__.__+__1__
    0.00384.0b101
    3840.0.-1
    0x180.0
    +0x1.8.2
    0b11.0.7
    0b1.1.0o10

3. references
=============

[ABNF]       D. Crocker, Ed., P. Overell, "Augmented BNF for Syntax
             Specifications: ABNF", RFC 2234, November 1997.

[ISO-646]    International Organization for Standardization, "Information
             technology -- ISO 7-bit coded character set for information
             interchange", ISO/IEC 646:1991.