18.5String as a collection
You've used String since chapter 1, but this lesson finally examines it as the collection it is: a growable buffer of UTF-8 bytes. That one fact, UTF-8 bytes, not characters, explains every surprising thing about Rust strings: why s[0] doesn't compile, why .len() counts bytes, why chars() and bytes() give different answers, and why slicing can panic. This is the lesson every Rust learner needs and few explanations deliver gently, so we'll go carefully.
A String is UTF-8 bytes
Internally, a String is a Vec<u8>: a sequence of bytes, with the guarantee that those bytes are valid UTF-8. UTF-8 is the encoding that represents text as bytes, and its crucial property is that different characters take different numbers of bytes. An ASCII letter like a is one byte; an accented letter like é is two bytes; many emoji are four. You met this in lesson 5.3: "café".len() is 5, not 4, because é is two bytes. That wasn't a quirk; it's the whole story of how strings work, and now we follow it to its consequences.
fn main() {
let s = String::from("café");
println!("{}", s.len()); // 5 bytes, not 4 characters
}5Why s[0] doesn't compile
In many languages you index a string to get its first character: s[0]. In Rust, that doesn't compile:
fn main() {
let s = String::from("hello");
let first = s[0];
}error[E0277]: the type `str` cannot be indexed by `{integer}`
--> src/main.rs:3:17
|
3 | let first = s[0];
| ^^^^ string indices are ranges of `usize`
The reason is exactly the byte-versus-character problem. If s[0] returned something, what would it be? A byte? That's not a character (for é, byte 0 is only half the character). A character? Then s[0] would be cheap-looking but secretly expensive, because finding the n-th character means scanning the bytes from the start (you can't jump straight to it when characters vary in size). Rather than pick a confusing answer, Rust refuses indexing by integer entirely. The error even points at the alternative: string indices are ranges (slicing, below), not single positions.
Key insight
s[0] is forbidden because there's no good answer: a byte isn't a character, and a character can't be found in constant time when characters vary in byte-length. Rust won't pretend an O(1) index exists when the data doesn't support one. Instead it makes you say what you mean, bytes or characters, with bytes() or chars(), each of which is honest about what it gives you. The forbidden s[0] isn't a missing feature; it's Rust refusing to hide UTF-8's nature behind a misleading syntax.
chars() and bytes()
To go through a string's contents, you choose your unit explicitly. chars() yields each Unicode character (technically each char, lesson 4.8); bytes() yields each raw u8:
fn main() {
let s = String::from("café");
for c in s.chars() {
print!("{c} ");
}
println!();
println!("char count: {}", s.chars().count()); // 4
println!("byte count: {}", s.bytes().count()); // 5
}c a f é
char count: 4
byte count: 5
chars() gives the four characters a human sees; bytes() gives the five bytes the computer stores. They differ precisely because é is one character but two bytes. When you want "the characters," use chars(); when you're doing byte-level work, use bytes(). To get the first character, s.chars().next() (returning Option<char>, since the string might be empty, lesson 11.3), which is the honest replacement for the forbidden s[0]. There's no integer index because there's no single right unit; you state the unit and iterate.
Why slicing can panic
You can slice a string by a byte range, &s[0..3], but it comes with a sharp edge: the range must fall on character boundaries, or it panics at runtime. This is the byte-versus-character issue one last time, from lesson 9.6:
fn main() {
let s = String::from("café");
println!("{}", &s[0..3]); // "caf": valid, byte 3 is a boundary
println!("{}", &s[0..4]); // panics: byte 4 splits the é
}caf
thread 'main' panicked at src/main.rs:4:21:
byte index 4 is not a char boundary; it is inside 'é' (bytes 3..5 of `café`)
&s[0..3] is "caf", fine, because byte 3 is a clean boundary between characters. But &s[0..4] tries to cut at byte 4, which is inside the two-byte é, splitting a character in half. A &str must always be valid UTF-8, and half a character isn't, so Rust panics rather than produce a broken string. The panic message is admirably specific: it names the byte index, the character it's inside, and that character's byte range.
So slicing by literal byte indices is risky on any text that might contain multi-byte characters. The safe approaches: slice at boundaries you computed from the text itself (like first_word scanning for a space, lesson 9.7, which always lands on a boundary), or work with chars() when you think in characters. Slicing with hardcoded byte numbers on arbitrary user text is the thing to avoid.
Best practice
Think in the right unit. For "process each character," use chars(). For byte-level work (parsing a protocol, counting bytes), use bytes(). To get a character at a position, chars().nth(n), not indexing. To slice, derive the boundaries from the text (search for a delimiter) rather than hardcoding byte offsets, or you risk the char-boundary panic. The discipline feels strict at first, but it's the price of a string type that never silently corrupts non-English text, which the languages that allow s[0] cannot promise.
Quiz time
Question #1
Why does s.len() return 5 for "café", and why won't s[0] compile?
Show solution
len() returns the number of bytes, and "café" is 5 bytes because é takes 2 bytes in UTF-8 (the other three are 1 each). s[0] won't compile because there's no good single answer: a byte isn't a character (it might be half of one), and finding the n-th character isn't a constant-time operation when characters vary in byte-length. Rust refuses rather than provide a misleading index.
Question #2
What's the difference between s.chars() and s.bytes() for the string "café"?
Show solution
s.chars() yields the 4 Unicode characters (c, a, f, é); s.bytes() yields the 5 raw bytes. They differ because é is one character but two bytes. Use chars() when you mean characters, bytes() when you mean raw bytes. To get the first character, s.chars().next() returns Option<char>.
Question #3
Why does &s[0..4] panic for "café", and how do you slice strings safely?
Show solution
Byte 4 falls inside the two-byte é (which occupies bytes 3..5), so the slice would split a character and produce invalid UTF-8. A &str must always be valid UTF-8, so Rust panics rather than create a broken string. Slice safely by using boundaries derived from the text itself (e.g. searching for a delimiter, which always lands on a char boundary, lesson 9.7) or by working with chars(), rather than hardcoding byte offsets on text that may contain multi-byte characters.
String maps text to values implicitly (position to character). The next collection maps explicit keys to values: HashMap, for when you want to look something up by name rather than by position.