std.encoding - Phobos documentation

License: Boost License 1.0.
Authors: Janice Caron
Source:: std/encoding.d

enum AsciiChar: ubyte;
alias AsciiString = immutable(AsciiChar)[];

Defines various character sets.

enum Latin1Char: ubyte;

Defines an Latin1-encoded character.

alias Latin1String = immutable(Latin1Char)[];

Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).

enum Windows1252Char: ubyte;

Defines a Windows1252-encoded character.

alias Windows1252String = immutable(Windows1252Char)[];

Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).

bool isValidCodePoint(dchar c);

Returns true if c is a valid code point

Note that this includes the non-character code points U+FFFE and U+FFFF, since these are valid code points (even though they are not valid characters).

Supercedes:

This function supercedes std.utf.startsValidDchar().

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c

the code point to be tested

string encodingName(T)();

Returns the name of an encoding.

The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Examples

assert(encodingName!(Latin1Char) == "ISO-8859-1");

bool canEncode(E)(dchar c);

Returns true iff it is possible to represent the specifed codepoint in the encoding.

The type of encoding cannot be deduced. Therefore, it is necessary to explicitly specify the encoding type.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Examples

assert(canEncode!(Latin1Char)('A'));

bool isValidCodeUnit(E)(E c);

Returns true if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

E c	the code unit to be tested

bool isValid(E)(const(E)[] s);

Returns true if the string is encoded correctly

Supercedes:

This function supercedes std.utf.validate(), however note that this function returns a bool indicating whether the input was valid or not, wheras the older funtion would throw an exception.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s

the string to be tested

size_t validLength(E)(const(E)[] s);

Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s

the string to be tested

immutable(E)[] sanitize(E)(immutable(E)[] s);

Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.

If the input string is already valid, this function returns the original, otherwise it constructs a new string by replacing all illegal code unit sequences with the encoding's replacement character, Invalid sequences will be replaced with the Unicode replacement character (U+FFFD) if the character repertoire contains it, otherwise invalid sequences will be replaced with '?'.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

immutable(E)[] s

the string to be sanitized

size_t firstSequence(E)(const(E)[] s);

Returns the length of the first encoded sequence.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s

the string to be sliced

size_t lastSequence(E)(const(E)[] s);

Returns the length the last encoded sequence.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s

the string to be sliced

ptrdiff_t index(E)(const(E)[] s, int n);

Returns the array index at which the (n+1)th code point begins.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Supercedes:

This function supercedes std.utf.toUTFindex().

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s	the string to be counted
int n	the current code point index

dchar decode(S)(ref S s);

Decodes a single code point.

This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Supercedes:

This function supercedes std.utf. decode(), however, note that the function codePoints() supercedes it more conveniently.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

S s	the string whose first code point is to be decoded

dchar decodeReverse(E)(ref const(E)[] s);

Decodes a single code point from the end of a string.

This function removes one or more code units from the end of a string, and returns the decoded code point which those code units represent.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

const(E)[] s

the string whose first code point is to be decoded

dchar safeDecode(S)(ref S s);

Decodes a single code point. The input does not have to be valid.

This function removes one or more code units from the start of a string, and returns the decoded code point which those code units represent.

This function will accept an invalidly encoded string as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

S s	the string whose first code point is to be decoded

size_t encodedLength(E)(dchar c);

Returns the number of code units required to encode a single code point.

The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c

the code point to be encoded

E[] encode(E)(dchar c);

Encodes a single code point.

This function encodes a single code point into one or more code units. It returns a string containing those code units.

The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

Supercedes:

This function supercedes std.utf. encode(), however, note that the function codeUnits() supercedes it more conveniently.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c

the code point to be encoded

size_t encode(E)(dchar c, E[] array);

Encodes a single code point into an array.

This function encodes a single code point into one or more code units The code units are stored in a user-supplied fixed-size array, which must be passed by reference.

The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

Supercedes:

This function supercedes std.utf. encode(), however, note that the function codeUnits() supercedes it more conveniently.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c	the code point to be encoded
E[] array	the destination array

Returns

the number of code units written to the array

void encode(E)(dchar c, void delegate(E) dg);

Encodes a single code point to a delegate.

This function encodes a single code point into one or more code units. The code units are passed one at a time to the supplied delegate.

The input to this function MUST be a valid code point. This is enforced by the function's in-contract.

The type of the output cannot be deduced. Therefore, it is necessary to explicitly specify the encoding as a template parameter.

Supercedes:

This function supercedes std.utf. encode(), however, note that the function codeUnits() supercedes it more conveniently.

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c	the code point to be encoded
void delegate(E) dg	the delegate to invoke for each code unit

CodePoints!E codePoints(E)(immutable(E)[] s);

Returns a foreachable struct which can bidirectionally iterate over all code points in a string.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

You can foreach either with or without an index. If an index is specified, it will be initialized at each iteration with the offset into the string at which the code point begins.

Supercedes:

This function supercedes std.utf.decode().

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

immutable(E)[] s

the string to be decoded

Examples

string s = "hello world";
foreach(c;codePoints(s))
{
    // do something with c (which will always be a dchar)

}

Note that, currently, foreach(c: codePoints(s)) is superior to foreach(c;s) in that the latter will fall over on encountering U+FFFF.

CodeUnits!E codeUnits(E)(dchar c);

Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.

Supercedes:

This function supercedes std.utf.encode().

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

dchar c

the code point to be encoded

Examples

dchar d = '\u20AC';
foreach(c;codeUnits!(char)(d))
{
    writefln("%X",c)
}
// will print

// E2

// 82

// AC

size_t encode(Tgt, Src, R)(in Src[] s, R range);

Encodes c in units of type E and writes the result to the output range R. Returns the number of Es written.

void transcode(Src, Dst)(immutable(Src)[] s, out immutable(Dst)[] r);

Convert a string from one encoding to another. (See also to!() below).

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Supercedes:

This function supercedes std.utf.toUTF8(), std.utf.toUTF16() and std.utf.toUTF32() (but note that to!() supercedes it more conveniently).

Standards

Unicode 5.0, ASCII, ISO-8859-1, WINDOWS-1252

Parameters

immutable(Src)[] s	the source string
immutable(Dst)[] r	the destination string

Examples

wstring ws;
transcode("hello world",ws);
    // transcode from UTF-8 to UTF-16


Latin1String ls;
transcode(ws, ls);
    // transcode from UTF-16 to ISO-8859-1

class EncodingException: object.Exception;

The base class for exceptions thrown by this module

abstract class EncodingScheme;

Abstract base class of all encoding schemes

static void register(string className);

Registers a subclass of EncodingScheme.

This function allows user-defined subclasses of EncodingScheme to be declared in other modules.

Examples

class Amiga1251 : EncodingScheme
{
    shared static this()
    {
        EncodingScheme.register("path.to.Amiga1251");
    }
}

static EncodingScheme create(string encodingName);

Obtains a subclass of EncodingScheme which is capable of encoding and decoding the named encoding scheme.

This function is only aware of EncodingSchemes which have been registered with the register() function.

Examples

auto scheme = EncodingScheme.create("Amiga-1251");

abstract const string toString();

Returns the standard name of the encoding scheme

abstract const string[] names();

Returns an array of all known names for this encoding scheme

abstract const bool canEncode(dchar c);

Returns true if the character c can be represented in this encoding scheme.

abstract const size_t encodedLength(dchar c);

Returns the number of ubytes required to encode this code point.

The input to this function MUST be a valid code point.

Parameters

dchar c

the code point to be encoded

Returns

the number of ubytes required.

abstract const size_t encode(dchar c, ubyte[] buffer);

Encodes a single code point into a user-supplied, fixed-size buffer.

This function encodes a single code point into one or more ubytes. The supplied buffer must be code unit aligned. (For example, UTF-16LE or UTF-16BE must be wchar-aligned, UTF-32LE or UTF-32BE must be dchar-aligned, etc.)

The input to this function MUST be a valid code point.

Parameters

dchar c	the code point to be encoded
ubyte[] buffer	the destination array

Returns

the number of ubytes written.

abstract const dchar decode(ref const(ubyte)[] s);

Decodes a single code point.

This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent.

The input to this function MUST be validly encoded.

Parameters

const(ubyte)[] s

the array whose first code point is to be decoded

abstract const dchar safeDecode(ref const(ubyte)[] s);

Decodes a single code point. The input does not have to be valid.

This function removes one or more ubytes from the start of an array, and returns the decoded code point which those ubytes represent.

This function will accept an invalidly encoded array as input. If an invalid sequence is found at the start of the string, this function will remove it, and return the value INVALID_SEQUENCE.

Parameters

const(ubyte)[] s

the array whose first code point is to be decoded

abstract const @property immutable(ubyte)[] replacementSequence();

Returns the sequence of ubytes to be used to represent any character which cannot be represented in the encoding scheme.

Normally this will be a representation of some substitution character, such as U+FFFD or '?'.

bool isValid(const(ubyte)[] s);

Returns true if the array is encoded correctly

Parameters

const(ubyte)[] s

the array to be tested

size_t validLength(const(ubyte)[] s);

Returns the length of the longest possible substring, starting from the first element, which is validly encoded.

Parameters

const(ubyte)[] s

the array to be tested

immutable(ubyte)[] sanitize(immutable(ubyte)[] s);

Sanitizes an array by replacing malformed ubyte sequences with valid ubyte sequences. The result is guaranteed to be valid for this encoding scheme.

If the input array is already valid, this function returns the original, otherwise it constructs a new array by replacing all illegal sequences with the encoding scheme's replacement sequence.

Parameters

immutable(ubyte)[] s

the string to be sanitized

size_t firstSequence(const(ubyte)[] s);

Returns the length of the first encoded sequence.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Parameters

const(ubyte)[] s

the array to be sliced

size_t count(const(ubyte)[] s);

Returns the total number of code points encoded in a ubyte array.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Parameters

const(ubyte)[] s

the string to be counted

ptrdiff_t index(const(ubyte)[] s, size_t n);

Returns the array index at which the (n+1)th code point begins.

The input to this function MUST be validly encoded. This is enforced by the function's in-contract.

Parameters

const(ubyte)[] s	the string to be counted
size_t n	the current code point index

class EncodingSchemeASCII: std.encoding.EncodingScheme;

EncodingScheme to handle ASCII

This scheme recognises the following names: "ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "IBM367", "ISO646-US", "ISO_646.irv:1991", "US-ASCII", "cp367", "csASCII" "iso-ir-6", "us"

class EncodingSchemeLatin1: std.encoding.EncodingScheme;

EncodingScheme to handle Latin-1

This scheme recognises the following names: "CP819", "IBM819", "ISO-8859-1", "ISO_8859-1", "ISO_8859-1:1987", "csISOLatin1", "iso-ir-100", "l1", "latin1"

class EncodingSchemeWindows1252: std.encoding.EncodingScheme;

EncodingScheme to handle Windows-1252

This scheme recognises the following names: "windows-1252"

class EncodingSchemeUtf8: std.encoding.EncodingScheme;

EncodingScheme to handle UTF-8

This scheme recognises the following names: "UTF-8"

class EncodingSchemeUtf16Native: std.encoding.EncodingScheme;

EncodingScheme to handle UTF-16 in native byte order

This scheme recognises the following names: "UTF-16LE" (little-endian architecture only) "UTF-16BE" (big-endian architecture only)

class EncodingSchemeUtf32Native: std.encoding.EncodingScheme;

EncodingScheme to handle UTF-32 in native byte order

This scheme recognises the following names: "UTF-32LE" (little-endian architecture only) "UTF-32BE" (big-endian architecture only)

dchar INVALID_SEQUENCE;

enum AsciiChar: ubyte; alias AsciiString = immutable(AsciiChar)[];

enum Latin1Char: ubyte;

alias Latin1String = immutable(Latin1Char)[];

enum Windows1252Char: ubyte;

alias Windows1252String = immutable(Windows1252Char)[];

bool isValidCodePoint(dchar c);

string encodingName(T)();

bool canEncode(E)(dchar c);

bool isValidCodeUnit(E)(E c);

bool isValid(E)(const(E)[] s);

size_t validLength(E)(const(E)[] s);

immutable(E)[] sanitize(E)(immutable(E)[] s);

size_t firstSequence(E)(const(E)[] s);

size_t lastSequence(E)(const(E)[] s);

ptrdiff_t index(E)(const(E)[] s, int n);

dchar decode(S)(ref S s);

dchar decodeReverse(E)(ref const(E)[] s);

dchar safeDecode(S)(ref S s);

size_t encodedLength(E)(dchar c);

E[] encode(E)(dchar c);

size_t encode(E)(dchar c, E[] array);

void encode(E)(dchar c, void delegate(E) dg);

CodePoints!E codePoints(E)(immutable(E)[] s);

CodeUnits!E codeUnits(E)(dchar c);

size_t encode(Tgt, Src, R)(in Src[] s, R range);

void transcode(Src, Dst)(immutable(Src)[] s, out immutable(Dst)[] r);

class EncodingException: object.Exception;

abstract class EncodingScheme;

static void register(string className);

static EncodingScheme create(string encodingName);

abstract const string toString();

abstract const string[] names();

abstract const bool canEncode(dchar c);

abstract const size_t encodedLength(dchar c);

abstract const size_t encode(dchar c, ubyte[] buffer);

abstract const dchar decode(ref const(ubyte)[] s);

abstract const dchar safeDecode(ref const(ubyte)[] s);

abstract const @property immutable(ubyte)[] replacementSequence();

bool isValid(const(ubyte)[] s);

size_t validLength(const(ubyte)[] s);

immutable(ubyte)[] sanitize(immutable(ubyte)[] s);

size_t firstSequence(const(ubyte)[] s);

size_t count(const(ubyte)[] s);

ptrdiff_t index(const(ubyte)[] s, size_t n);

class EncodingSchemeASCII: std.encoding.EncodingScheme;

class EncodingSchemeLatin1: std.encoding.EncodingScheme;

class EncodingSchemeWindows1252: std.encoding.EncodingScheme;

class EncodingSchemeUtf8: std.encoding.EncodingScheme;

class EncodingSchemeUtf16Native: std.encoding.EncodingScheme;

class EncodingSchemeUtf32Native: std.encoding.EncodingScheme;

enum AsciiChar: ubyte;
alias AsciiString = immutable(AsciiChar)[];