A gentle introduction to C* I

An imperative walk-through of what makes this language different from others.

Jan 15, 2022

Hello, and welcome to the first of many guides explaining the basics of C*, a systems programming language. We’ll assume no more than that you know the basic syntax of C and understand general programming patterns like functions, loops and so on.

Here’s a basic Hello, World in C*:

#include <stdio.h>
#include <ctypes.hst>

int main( int ac, char **av )
{
   printf( "Good morning, Vietnam!\n" );

   return 0;
}

Simple, right?

Other than line 2, this program is identical, syntactically and semantically, to the equivalent as it would be written in ANSI C. So, what’s different?

The #include on line 2 brings in a suite of typedefs that provide the built-in types you’d expect from C. Here’s an abbreviated, imagined version of what the contents of ctypes.hst could look like:

/* ... */

typedef struct
{
   bit _[32] { "signed2" };
}
s32;

typedef s32 int;

typedef struct
{
   bit _[8];
}
char;

There are a lot of new things here. What is that underscore variable doing? What are those braced strings? What exactly is a bit? Let’s start with the structs: in C, we also have structs, also to lay out complex data structures, like so:

struct foo
{
   unsigned flags;
   /* ... */
   char * bar;
};

In C, structs are for “macro size” member variables. The ANSI and ISO standards never come right out and say it, but they essentially mandate alignment and padding in between these members to keep them on byte and sometimes word boundaries.

This implies a lower limit on the level of control a programmer has on the exact layout of their data. The ANSI and ISO standards don’t provide a way to override this behaviour, either, so all of the world’s C code uses compiler extensions to ask for the struct to be “packed”, ergo laid out without padding, even as this may cause alignment issues.

This is obviously a very painful and backwards way to approach this kind of thing. It’s worse for the fact that in C, manually handling the positioning of binary data in structs is often a very necessary task. So, in C*, we have changed a few things and added a couple more things to make this all much more intuitive.

Data layouts are always explicit

One thing C* does very differently is it performs no padding or alignment of struct fields at all. Padding data is not a difficult or time-consuming task for programmers, so compared to the cost of having to override it, it simply isn’t worth automating. Everything is tightly bit-packed, and programmers are expected to insert their own padding bits as desired. So, the above struct in C might be laid out like so in the memory of a 64-bit computer:

A = flags
B = bar
- = implicit padding
+ = discarded field

AA AA AA AA -- -- -- -- BB BB BB BB BB BB BB BB

While in C*, the same struct, by default, would be laid out very differently:

A = flags
B = bar
- = implicit padding
+ = discarded field

AA AA AA AA BB BB BB BB BB BB BB BB

Out of the box, this might cause performance problems on major ISAs such as x86-64 or AArch64, where pointers are 8 bytes wide and need to be aligned. The solution is trivial:

struct foo
{
   unsigned flags;
   unsigned _;
   char * bar;
};

This provides explicit padding, replicating the implied struct layout generated automatically in C:

A = flags
B = bar
- = implicit padding
+ = discarded field

AA AA AA AA ++ ++ ++ ++ BB BB BB BB BB BB BB BB

Nifty, isn’t it?

The underscore and its meaning

You might have seen it before in languages like Go, where it’s used in logic to syntactically receive outputs and discard them, forever anonymous.

If you’ve been following along, you have already seen what else it does in the context of struct definitions: it also discards whatever would have been in its place. In struct definitions, this has the important effect of “leaving space” where the field would have been in the layout, providing the function of padding explicitly to the programmer without the mess of having field names like padding0 and padding1 that can be accessed and messed with by anyone.

In general, the underscore is a pronoun – that is, it can be reused multiple times in the same semantic context without introducing ambiguity. After all, if you cannot refer to a discarded field in the first place, there would also be no issue of differentiating between them.

Here’s one more interesting case of the underscore and how it plays with structs:

struct bla
{
   bit _[16];
};

When there is a struct, and it contains exactly one member whose name is the underscore pronoun, it is what we call an inline struct. This means that we would treat variables of such types as if they were primitive types in C, without the dot notation:

/* continuing from above... */

/* function body */
struct bla a; /* it is a 16-bit integer, unsigned */
a = 17;
/* ... */
a &= bar;

Indeed, this is exactly how the aforementioned <ctypes.hst> provides the data types C programmers are surely familiar with.

And finally, the attributes

What are those string literals encased in curly braces? Those are attributes. They are furnished by the compiler, allowing the programmer to add semantics from the language as context to a given data type.

For instance, the trivialised <ctypes.hst> file above declares the s32 type with the "signed2" attribute; this tells the compiler that the integer (comprised of the static bit array shown) is a signed number, specifically signed using two’s complement.

The compiler knows about many types of complex data structures that have built-in meaning in most programming languages. Attributes are a pragmatic way to furnish context about a type of data in a way that is neither awkward nor dependent upon generic meta-programming tools cluttering up the language. This includes two’s complement, one’s complement, IEEE floating-point formats of all kinds, other forms of floating-point numbers, and even fixed-point numbers, just to name a few.

What qualifies for an attribute is anything that is both (a) non-trivially complex and (b) requires a close relationship to the language’s semantic building blocks. Furthermore, attributes are implementation-specific, so your compiler vendor may provide the means for you to add anything to augment the language, with the right amount of effort.

This has been part one of a gentle introduction to the C* programming language. Expect part two to release soon, where we will go over the new facilities of laws and marshalling with a fresh mind.

If you’d like to read more about those things now, check out Law & Order in Programming with C*, where I touch on the basics of those.

And finally, please consider subscribing. While I will strive to make about one third of my content open to public reading, I will nonetheless be making at least the first three parts of this series public. At $5.55/month, it’s hardly more than your average trinket from Amazon with Prime shipping. It means a lot to me.

A gentle introduction to C* I

An imperative walk-through of what makes this language different from others.

Data layouts are always explicit

The underscore and its meaning

And finally, the attributes

Discussion about this post