c++ - What's the rationale for null terminated strings?

Thursday, 20 July 2017

c++ - What's the rationale for null terminated strings?

Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}
char array[40];
void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

versus

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}
char array[40];
void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;
  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.

/* Concatenate 10th through 24th characters from src to dest */
void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.

Blog

Thursday, 20 July 2017