Pure data

Western Calculus' corollary to pure functions.

Apr 17, 2024

Today I have the joy of formally describing something I have been unconsciously facilitating for years by now in my structures and schema designs: that of the pure data. What is it? Why does it matter?

Pure data is a formulation of data that obviates the possibility of communicating any invalid value by virtue of its structure. The most basic form of data structure is the enumeration – therefore, given an enumeration and a numeric coding (typically binary, so base 2), one can ascertain that the data is pure if its enumerative typing completely fills its numeric coding.

The most trivial case of this is the boolean, which is an enumeration of two values, false and true. The point of value about this is that concretely, communicating this only takes one bit, leaving no room for ambiguity in its constitution.

This can then be extended to all enumerations of arbitrary size: an enumeration of some bitfield or set of bitmasks which deal with 5 bits would be pure if it had exactly 2⁵ or 32 valid values, with no bits reserved, prohibited or undefined. It cannot have 33 values, as this would not fit, and it cannot have 31 values, as this would leave room for ambiguity which might be considered invalid.

Beyond this, one can extend the concept to the full breadth of plain old data by using structs and unions in the same old C fashion. We define a pure data type as above, and conclude that a pure structure is that composed of a set of either pure data types or pure structures. The same law of associativity applies again with unions – as long as your primitive bits are pure, all the way down, your complex bits are pure, too.

Why does this matter? Two reasons: it is maximally efficient, and it eliminates deniability.

The first is more self-explanatory: no space is wasted in the information coding. We discussed this above with the 5-bit field providing exactly 32 values: any less and you still have room for more information, while any more would overflow the medium. It is therefore canonical in its structure.

The second perhaps has more profound implications: if I have a prior understanding with you that the form of something I may say to you in the future is either A, or B, and nothing else, I have made myself maximally transparent to you about my intentions and ability to communicate that thing whenever I may do so. I cannot later choose to send C, if only because I saved myself no means to do so. The only option besides A and B is to refrain from sending anything at all.

This is powerful because it minimises the surface area for more sophisticated forms of lying. 30 years of personal computing have shown us all the immense corrosive power of the colloquial “computer error” in popular culture, and unfortunately the utility of this as a basis for lies has only grown in proportion to the relevance of computers. I contend that this is only made possible by a defect in the understanding of informatics that informs our technical culture – we nerds simply don’t yet know how to make computers more robust in their honesty, until now. Historically we overlooked the propensity for chaos inherent to some field that is documented along the lines of Prohibited - Reserved Do Not Modify, and this has led to much chaos and discord that the world could do without. Even ignoring adversarial interpretations of protocols and implementations, it is a danger in itself to have invalid values in a concrete data type as it will, at a minimum, demand special attention in its handling that distorts the weight of the data in relation to its companion algorithm. Sorting through a data set may be a fine and clean ordeal, but suddenly it’s not so nice when you have to check for invalid values. Working around that is a problem I contend is better solved at the data design stage rather than the implementation stage. So, the concept of pure data is a more fully generalised solution to these problems.

You can make your intended messages as clear as possible, but it will only get you so far if the medium underlying those messages is ambiguous in its own right. Pure data is about creating mediums that lack that kind of vulnerable ambiguity. As sure as it solves for efficiency and security in one fell swoop, I am confident that it is a very true and powerful concept that will come to inform much of informatics for centuries to come.

Pure data

Western Calculus' corollary to pure functions.

Discussion about this post