Engineering Blog

Base 64 - padding and URLs

Written by Juraj Martinka | Jun 15, 2022 11:14:03 AM

If you know the size of your input beforehand, it should be safe to leave out the padding character ('=') so you don’t need to percent-encode it for safe usage in URLs.



The spec

My colleague added a spec for Base 64 values - initially, it looked like this:

base64 - clojure.spec

That made me check what is actually a valid set of Base 64 characters.



Two variants of Base 64 encoding

RFC 4648 lists two common variants:

Notice that both of them define = for padding.



To pad or not to pad?

The thing is that = is not really safe to use in URLs so you either need to encode it (additional hassle) or leave it out (and make sure you don’t break decoding). The RFC suggests that:

The pad character "=" is typically percent-encoded when used in an URL.

 

Hence I asked, in the pull request comment, if we should include = too.

However, section 3.2 also says this:

In some circumstances, the use of padding ("=") in base-encoded data is not required or used. In the general case, when assumptions about the size of transported data cannot be made, padding is required to yield correct decoded data.


In fact, as the Api Security in Action book tells us, it’s common to exclude padding because you often know the whole encoded value before the decoding process starts.

The book contains this class for encoding/decoding:

public class Base64Url {

    // For more about padding and when it's needed see: https://stackoverflow.com/questions/4080988/why-does-base64-encoding-require-padding-if-the-input-length-is-not-divisible-by
    private static final Base64.Encoder encoder = Base64.getUrlEncoder().withoutPadding();
    private static final Base64.Decoder decoder = Base64.getUrlDecoder();

    public static String encode(byte[] data) {
        // Note that this uses ISO-8859-1 - should be safe since Base64 uses only ASCII characters anyway
        return encoder.encodeToString(data);
    }

    public static byte[] decode(String encoded) {
        return decoder.decode(encoded);
    }

    public static void main(String[] args) {
        System.out.println("Default charset: " + Charset.defaultCharset());
        System.out.println(encode("ahojľščáýíô".getBytes()));
        System.out.println("decoded: " + new String(decode(encode("ahojľščáýíô".getBytes()))));
    }
}

 

 

Base 64 with Clojure and Buddy Library

In our code, we are using buddy.core.codecs/bytes→b64u. Let’s look at an example:

(-> "abc1230901" .getBytes codecs/bytes->b64u String
;;=> "YWJjMTIzMDkwMQ"

 

The docstring of the function is rather sparse:

Encode data to base64 byte array (using url-safe variant).


There’s also a function for using the standard version of the encoding: bytes→b64 Notice the difference (the padding characters at the end):

(-> "abc1230901" .getBytes codecs/bytes->b64 String.)
;;=> "YWJjMTIzMDkwMQ=="

 

What the docstring of bytes→b64u doesn’t say, but it’s obvious when looking at the code, is that the function doesn’t use padding:

(defn bytes->b64u
  "Encode data to base64 byte array (using url-safe variant)."
  {:added "1.8.0"}
  [^bytes data]
  (let [^Base64$Encoder encoder (-> (java.util.Base64/getUrlEncoder)
                                    (.withoutPadding))] ;; (1)
    (.encode encoder data)))


Notice, the .withoutPadding method call - the same as used in the Base64Url class from Api Security in Action

So the end result is immediately safe to use in URLs without any additional encoding of = characters. But you shouldn’t use it for decoding incomplete inputs becuase you may get wrong results.



What about spec?

What I would do is the following:

  • Rename the spec to ::base64url-without-padding to make it clear that we deliberately don’t use padding with an explanation when this is safe to use

  • Remove + and / from the set of valid characters

  • Add - as it’s a valid character in the URL safe version

Alternatively, we could have a more generic spec but it would need to allow all the alphanumeric characters, +, /, -, _, and = (the pad character).