Tuesday, February 15, 2011

Understanding quirks of C: Structures

Targeted audience
=============
C programmers

System information
==============
Linux ubuntu 2.6.32-25-generic #44-Ubuntu SMP i686 GNU/Linux
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)

Structure are the most popular user-defined data type in C. It allows an user to create new data type by packing different data types and use them with a single name. In this article we will discuss fundamental operations provided by C structures and their internal behavior. We'd use following structure for understanding the concepts.

----CODE----

struct test{
  char c;
  int a;
  char d;
};

----CODE----

A variable of type “structure test” would have following memory layout in the process stack. We are assuming that compiler padding is done for a 4-byte boundary.


 Structure layout in memory

Let's assign values to this structure variable.

----CODE----

struct test var = {'a', 10, 'b'};
struct test *ptr;

ptr = &var;

printf(“%c”, var.d);
printf(“%c”, *(ptr->d));

----CODE----

How do the member access expression “var.d” and “ptr->d” work?

“var.i”: This expression is converted by the compiler into two components:
          a) Base address of symbol 'var' and,
          b) Offset of symbol 'd' in the structure memory layout

Base address of the structure variable is “0x0000” in our example. Next, compiler finds the distance of 'd' from the base. The distance is 8 bytes (1+3+4).

Similarly, “ptr->d” is also factored in mentioned manner, and address of member 'd' is returned.

How to calculate a member's byte offset in a given structure

In a trivial manner, we can calculate offset of member 'd' by getting the address of first element of the structure i.e. 'c' and then finding the difference with address of 'd'.

----CODE----

offset_of_d = (&var.d - &var.c);

----CODE----

The difference of these two addresses would fetch us the offset of 'd'. “&(var.c)” fetches us the base address of the structure.

The statement &(var.d) would be treated as = (Base address of variable 'var' + distance of 'd' from the base).

Efficient method to find the offset of a structure member
If we could set the base address to zero in above equation, we would directly get the offset of 'd' i.e. 8.

We can accomplish this as follows:

----CODE----

#define POINTER ((struct test*)(0))

int main()
{
  unsigned int x;
  x = &(POINTER->d);
  printf("\nOffset of d = %d\n", x);
}

----CODE----

The macro “POINTER” expands to “&(((struct test*) (0))->d)”.

We are typecasting '0' as an address, pointing to type of data 'struct test'. Now when you do “&(((struct test*)(0))->d)”, it is fetching the address of 'd' as:
                      (Base address + offset of d in the structure).

Since base address is set to zero, we are left with the offset of 'd'. We are tricking the compiler by giving base address as zero. This trick will work irrespective of compiler's padding scheme.

Both the discussed approaches lack portability though, and hence may suffer with incorrect output on different hardware. Another portable way to accomplish the same is with the help of “offsetof” macro, defined in header file stddef.h. The macro 'offsetof' accepts two arguments:
The structure definition, so you don't need to create a structure variable
The member element, for which the offset is to be calculated.

This macro too takes care of structure padding of elements, performed by the compiler.

------CODE------

#include "stddef.h"
int main()
{
  unsigned int x = offsetof(struct test, d);
  printf("\nOffset of d = %d\n", x);
}

------CODE--------

How “sizeof” behave in a quaint manner, where syntactically you find a NULL pointer dereference?

Consider the following code:

----CODE-----

struct test{
  char c;
  int a;
  char d;
};

int main()
{
  // Define a dangling pointer of type struct test
  struct test *p;

  // Getting the size of the structure with an uninitialized pointer
  printf("%u", sizeof(*p));
  printf("%u", sizeof(p));

  // Try it with one of implicit types like 'char'
  char *i= 'a';
  printf("%u", sizeof(*i)); // Displays 1, as the size needed to store a character
}

----CODE----

If you dereference a pointer inside “sizeof” operator, it fetches the size of data-type, the pointer is pointing to, i.e. 12 bytes for “struct test”.

 It is because sizeof operator works on “data type”, not data. And interestingly, it works for both internal types and language defined types. So, when you dereference a pointer inside sizeof; you are asking for the number of bytes pointee would need in the memory.

----CODE----
char *c;

// It would fetch you result as 1 byte.
sizeof(*c);

----CODE----

Remember that it is illegal to dereference a dangling pointer otherwise. In the mentioned code, you do not dereference the address carried by the pointer, and hence never get a classic segmentation fault error.


Hope you enjoyed these interesting facts of C. Please share your comments and suggestions.

----References----

No comments: