Strings in Go and Rust

This week at Go Meetup, we talked briefly about how strings in Go are UTF-8, but not really. What I mean is, on the one hand, we can write

s := "Hello, 世界!"
fmt.Println(s)

and it prints out

Hello, 世界!

as expected. But on the other hand, we can put an invalid UTF-8 sequence into a string as well

s := "\x67\x72\xfc\xdf\x65"

It will compile just fine, but print out junk.

gr��e

If we accept strings from an external source, we probably don’t want to do stringy things with them without first checking that they’re valid. For example, this code

package main

import (
    "fmt"
    "os"
)

func main() {
    for _, s := range os.Args {
        fmt.Println(s)
    }
}

just prints whatever we give it

$ ./garbage foo bär $(echo -en "\x67\x72\xfc\xdf\x65") baz
./garbage
foo
bär
gr��e
baz

while this one

package main

import (
    "fmt"
    "os"
    "unicode/utf8"
)

func main() {
    for _, s := range os.Args {
        if utf8.ValidString(s) {
            fmt.Println(s)
        } else {
            fmt.Println("not valid")
        }
    }
}

only prints valid strings

$ go build valid_string.go 
$ ./valid_string foo bär $(echo -en "\x67\x72\xfc\xdf\x65") baz
./valid_string
foo
bär
not valid
baz

In Rust, strings are UTF-8 as well. We can write

let s = "Hello, 世界!";
println!("{}", s);

and it prints out

Hello, 世界!

as expected. But unlike Go, we can’t put an invalid UTF-8 sequence in a string. This

let s = "\x67\x72\xfc\xdf\x65";

doesn’t even compile

error: this form of character escape may only be used with characters in the range [\x00-\x7f]

However, we still need to be careful. This

let v = vec![0x67, 0x72, 0xfc, 0xdf, 0x65];
let t = String::from_utf8(v);
println!("{:?}", t);

compiles fine, but gives a run-time error

Err(FromUtf8Error { bytes: [103, 114, 252, 223, 101], error: Utf8Error { valid_up_to: 2 } })

So once again, if we accept strings from an external source, we probably don’t want to do stringy things with them without first checking that they’re valid. But, unlike in Go, we can’t even put them in a string until we check. This code

use std::env;

fn main() {
    for arg in env::args() {
        println!("{}", arg);
    }
}

panics if any arguments are not valid UTF-8

$ ./valid_string_panic foo bär $(echo -en "\x67\x72\xfc\xdf\x65") baz
./valid_string_panic
foo
bär
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "gr��e"', ../src/libcore/result.rs:837
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Instead of std::env::args, we can use std::env::args_os to collect the arguments

use std::env;

fn main() {
    for arg in env::args_os() {
        println!("{:?}", arg);

        //println!("{}", arg);
        // does not compile
    }
}

This gives us an OsString instead of a String. Right away, we can see it’s different because it won’t even compile if we try to print it with “{}”. When we change to “{:?}”, we get junk for invalid UTF-8

$ ./valid_string_garbage foo bär $(echo -en "\x67\x72\xfc\xdf\x65") baz
"./valid_string_garbage"
"foo"
"bär"
"gr��e"
"baz"

To check that it’s valid, we can try to convert the OsString to a String. The to_str method returns an Option, which we can check

use std::env;

fn main() {
    for arg in env::args_os() {
        match arg.to_str() {
            Some(s) => println!("{}", s),
            None => println!("not valid"),
        }
    }
}

Thus we get

$ rustc valid_string.rs
$ ./valid_string foo bär $(echo -en "\x67\x72\xfc\xdf\x65") baz
./valid_string
foo
bär
not valid
baz

just as in Go.

So even though both Go and Rust use UTF-8 for strings, they are not the same model. There’s more to it. When it comes to encodings, there’s always more to it!

Advertisements
Strings in Go and Rust