Discussion
delta_p_delta_x: The zero-terminated string is by far C's worst design decision. It is single-handedly the cause for most performance, correctness, and security bugs, including many high-profile CVEs. I really do wish Pascal strings had caught on earlier and platform/kernel APIs used it, instead of an unqualified pointer-to-char that then hides an O(n) string traversal (by the platform) to find the null byte.There are then questions about the length prefix, with a simple solution: make this a platform-specific detail. 16-bit platforms get strings of length ~2^16, 32 b platforms get 2^32 (which is a 4 GB-long string, which is more than 1000× as long as the entire Lord of the Rings trilogy), 64 b platforms get 2^64 (which is ~10^19).
david2ndaccount: Pascal strings might be the only string design worse than C strings. C Strings at least let you take a zero copy substring of the tail. Pascal strings require a copy for any substring! Strings should be two machine words - length + pointer (aka what is commonly called a string view). This is no different than any other array view. Strings are not a special case.
gizmo686: C strings also allow you to do a 0 copy split by replacing all instances of the delimeter with null (although you need to keep track of the end-of-list seperatly).
breuwi: This post is pretty confused about std::[w]string::data(). It definitely does not guarantee that the resulting buffer is null terminated: https://cplusplus.com/reference/string/string/data/ The article says that c_str() is to make things _clearer_ (italics theirs) but that is critically incorrect. c_str() is to make relying on the null terminators "not undefined behavior." It is true that many implementations will always maintain the null terminated buffer so that data() will normally return a buffer that is null terminated. But that's a really bad assumption to make.That said, I would generally agree with the author that "optimizing" or "modernizing" the code to convert to std::string_view is at least a waste of time and at worst dangerous. But this mostly comes down to it being so much easier to get use-after-free bugs.
theamk: [delayed]
theamk: [delayed]
Matheus28: Since C++11, data() is also required to be null terminated. Per your own source and cppreference.
amluto: > query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.I have also done this, but I would argue that, even at the time, the design was very poor. A much much better solution would have been wise pointers — pass around the length of the string separately from the pointer, much like string_view or Rust’s &str. Then you could skip the NULL-writing part.Maybe C strings made sense on even older machines which had severely limited registers —- if you have an accumulator and one resister usable as a pointer, you want to minimize the number of variables involved in a computation.
beached_whale: std::string since C++ 11 guarantees the buffer is zero terminated. The reasoning being thread safety of const members. https://eel.is/c++draft/basic.string#general-3
Joker_vD: Yeah, I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view. But the thrall of having to pass around only a single pointer is quite a strong one.
delta_p_delta_x: [delayed]
breuwi: LOL, I need to learn to click on the more modern tabs. Will delete comment.
vlovich123: If I recall correctly a pascal string has the length before the string. Ie to get the length you dereference the pointer and look backwards N bytes to get the length. A pascal string is still a single pointer.You cannot cheaply take an arbitrary view of the interior string - you can only truncate cheaply (and oob checks are easier to automate). That’s why pointer + length is important because it’s a generic view. For arrays it’s more complicated because you can have a stride which is important for multidimensional arrays.
delta_p_delta_x: [delayed]
jmyeet: The C string and C++'s backwards compatibility supporting it is why I think both C and C++ are irredeemable. Beyond the bounds overflow issue, there's no concept of ownership. Like if you pass a string to a C function, who is responsible for freeing it? You? The function you called? What if freeing it is conditional somehow? How would you know? What if an error prevents that free?C++ strings had no choice but to copy to underlying string because of this unknown ownership and then added more ownership issues by letting you call the naked pointer within to pass it to C functions. In fact, that's an issue with pretty much every C++ container, including the smart pointers: you can just call get() an break out of the lifecycle management in unpredictable ways.string_view came much later onto the scene and doesn't have ownership so you avoid a sometimes unnecessary copy but honestly it just makes things more complex.I honestly think that as long as we continue to use C/C++ for crucial software and operating systems, we'll be dealing with buffer overflow CVEs until the end of time.
masklinn: “Sure your software crashes and your machines get owned, but at least they’re not-working very fast!”
masklinn: You also need to own the buffer otherwise you’re corrupting someone else’s data, or straight up segfaulting.
dh2022: I think the concern was conserving memory ( which was scarce back then) and not iterating through each substring.
masklinn: > Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.Length + pointer is a record string, a pascal string has the length at the head of the buffer, behind the pointer.