Contents

Binary
Textual

Structured text
Semi-structured text
Unstructured text

Text selectors

regex
bounds
scanf

Filing

Contexts

Context providers are structs which provide a new view of the data. They are packable structs themselves which have their own keys for extracting the context, then they transform it (probably) and provide the new data for its children to use. They define locator keys based on what kind of context they provide.

Because contexts completely change how the structs work, they block all inheritance by their children. That is, their children cannot inherit keys from the context provider or higher. However, a context provider's grandchild can inherit keys from its parent just fine.

While there could be any number of context types, we currently define three: binary, textual, and filing. The root may be any type the applcation requests. For Exchange, it will be binary if the user interacts with a ROM (or whatever) and filing if the user interacts with a folder.

Binary

As the name implies, they provide access to binary data. Struct types which can be used in a binary context are called serializable types. They must define the methods serialize and unserialize in order to pack and unpack, respectively. The basic binary context provider is the bin type.

The locator keys defined by binary contexts are:

base: The address to start reading from. It can be specified in three ways:

[number, relative marker] - of which the marker can be b, c, or e. In this case, it'd usually be written like $000001:c, for instance.

b - The number is the absolute position.
c - The number is the offset from end of elder sibling or start of parent.
e - The number is the offset, backwards, from absolute end of context.

number - it's relative to the start of the context. That is, it assumes the marker b.
If unset, it defaults to the previous (by linear ordering) struct's end. That is, the default is 0:c.
When retrieved, it returns the absolute position as a number.

size: A size, which accepts a size type declaration. Defined as follows:

"number units" - Any string type with a number, whitespace, then a unit name. The number may contain a decimal point.

For now the units may be bits or bytes, others may be defined in the future.
If a decimal point is used with bytes it's assumed to be the number of bits extra, so 1.1 is 1 byte and the first bit (MSB) of the next byte. There may be a way to invert the bit order in the future.
bits does not accept a decimal point.

number - It assumes the units is "bytes", but does not allow a decimal point.
When retrieved as a number, it should return the number based on the units used to declare it, rounded up.

Requesting types which are knowledgable about the size type (or units in general) may choose a specific unit.

When retrieved as a string, it should return the original string declared, or the number plus bytes if it was declared as a number.
TODO: size type proposal & link

end: Specify the address one byte after the final byte included in the serialized data for this struct. Accepts the same definitions as base.

The locators have the following constraints, from which calculations of missing values can potentially be drawn:

base + size = end
From context's perspective: size.min = sum(c.size for c in children)

Assumed to be the real size if no size of the context is specified. But size, of course, maxes out at the end of the context, which will often be set when doing conversions to binary from another context type.

Textual

Textual contexts provide string data. They may either be structured text (like XML) or unstructured text (like TXT). Struct types which can be used in a textual context are called stringifiable types. They must define the methods stringify and parse in order to pack and unpack, respectively.

Structured text

Structured text refers largely to object notation and markup, such as JSON, YAML, and XML. The basic structured text context provider is the markup type.

Each type will define its own selectors which are natural for that format and provide them through the base locator key. It may (and probably will) accept multiple types of selectors, though. Because it's common to match multiple entities, these contexts also provide a limit locator key. These is no linear ordering.

XML will use XPATH selectors (and others should also accept this, since it's so general), JSON can use jq or JSON Pointers, HTML could use CSS Selectors probably with jQuery extensions, etc.

base: A selector of some sort, to find a chunk of text to interpret. What's accepted is defined per context.
limit: Defines how many may be accepted. If the value is strictly 1 (and not a reference), the struct will not be cloned. Otherwise, it will be cloned regardless of how many are found.

[min, max] - Min must be a number and max may be a number or the word any.
number - The strict value; exactly that many.
Defaults to 1:any.
When retrieved, it returns the actual value found.

Semi-structured text

Semi-structured text includes formatted text (RTF, LaTeX) and possibly even programming or annotated natural languages. The text content is unstructured, but it has structuring around it. These are not a priority but they should be considered during design.

Unstructured text

This works similarly to binary data except instead of being a sequence of bytes, it's a sequence of decoded characters. As such, it can accept a base similar to binary contexts, but also ones similar to regular text contexts. The basic unstructured text context provider is the string type.

The context itself must specify how it's interpreting the individual characters. It could perform one of the forms of Unicode normalization, it might accept combining diacritics as part of the preceding character, or it might just provide the raw codepoints (as text, though).

Children can either use base, length, and end where base and end act like in binary contexts; or they can use just base where it's a selector and may define a limit. Both forms are linearly ordered. If no locator keys are specified at all, it assumes the positional offset method.

The context itself may declare a key ignore which accepts a selector which must correctly match all unselected text between children. If it is declared and it fails to match, an error should be raised.

[prefix, infix, suffix] - All three are selectors, prefix must match the text before the first child, infix must match the text between each child, and suffix must match the text after the last child (if the length of the context extends past the final child's end).

In order to have all three be the same use "selector"*3 TODO: are string ranges like this okay somehow?

[prefix, infix] - For a list with two elements, it applies like this.

In order to define just infix and suffix use [, infix, suffix]

infix - A single string applies like this.

Offset form:

base: Positional offset.

[offset, relative marker] - Offset from the relative marker, same as in binary contexts.

b - The number is the absolute position.
c - The number is the offset from end of elder sibling or start of parent.
e - The number is the offset, backwards, from absolute end of context.

number - it's relative to the start of the context. That is, it assumes the marker b.
If unset, it defaults to the previous (by linear ordering) struct's end. That is, the default is 0:c.
When retrieved, it returns the absolute position as a number.

length: The number of characters it can contain. No units can be specified.

It may be possible to request certain units, but this isn't currently defined.

end: Specify the position one character after the final character included in the packed data for this struct. Accepts the same definitions as base.

This form also has the same constraints as binary contexts, and inferrences can be made for missing data:

base + length = end
From context's perspective: length.min = sum(c.length for c in children)

Assumed to be the real length if no length of the context is specified. But length, of course, maxes out at the end of the context, which will often be set when doing conversions to text from another context type.

Selector form:

base: Can be a generic text selector. There is no default interpretation, the selector must be specified (keystruct). It will find all sequential matches while respecting limit.
limit: Defines how many matches may be accepted. If the value is strictly 1 (and not a reference), the struct will not be cloned. Otherwise, it will be cloned regardless of how many are found.

[min, max] - Min must be a number and max may be a number or the word any.
number - The strict value; exactly that many.
Defaults to 1.
When retrieved, it returns the actual value found.

ignore: If declared, contains a selector which must correctly match all of the text between matches of base in order for the interpretation to be valid. If the match fails, an error should be raised.

Text selectors

These are keystructs so they are declared in MPRL like type(basic) or type[arg1, arg2] or type.subtype(basic) etc.

regex

By default (without subtyping) it constructs a Perl Regular Expression. The author may also specify regex.perl specifically.

Other subtypes include:

sed - GNU sed utility style, see here.
grep - GNU grep utility style, see here.
egrep - GNU egrep (grep extended) utility style, see here.
Maybe more GNU stuff? From here.
lua - Lua pattern matching, see here.

bounds

Declared in MPRL like bounds[start, end] which selects text between (not including) the first start it finds and the nearest end it finds after.

If start or end is itself a list, it can start and end at any of the respective matches. That is, it will end with any option, regardless of what it started at. start and end may also be regex by using it like bounds[regex(start expression), end] or around end or both separately.

In this example, Element's value is xyz d="

string Example {
	data: 'stuff <xyz d=">"> aa'
	string Element {
		base: bounds["<", ">"]
	}
}

Another option is bounds.recurse[start, end, others] which selects text between (not including) start and the matching end, but only selects the top level ones. others is optional, but is a varadic list of oStart:oEnd markers between which start and end matches are ignored.

In this example, Element's value is xyz d=">"

string Example {
	data: 'stuff <xyz d=">"> aa'
	string Element {
		base: bounds.recurse["<", ">", '"':'"', "'":"'"]
	}
}

scanf

From C, uses form scanf(format) where the format is as defined here. %p is not supported.

While this may be used as the basing for numbers, the format must contain a single %i, %d, %u, %o, %x, or %n. If it contains multiple of those, it must raise an error. Anything else in the format will be ignored. For some theoretical floating point struct type, it works the same, but with %f, %e, %g, and %a. For any string types, it will validate the match but not interpret the values.

scanf itself provides a list basic value of what it matched, such that index 0 is the first matching %-specifier, etc.

Filing

Filing contexts offer whole files. This can be thought of as essentially an archive or a folder. Struct types which can be used in a filing context are called stringifiable types. They must define the methods to_file and from_file in order to pack and unpack, respectively. The basic filing context provider is the folder type.

It provides the locator key base which is the filename (including path as appropriate) and the optional encoding key type which specifies the mimetype. If the context has a strict ordering of files, then the children are linearly ordered.