Cryptopals Set 1

This blog now supports annotations! Highlight a field to comment, and if I'm wrong, I will pay you.

Cryptopals is a set of cryptography challenges. Pretty straightforward– just do them.

I first stumbled across this sophomore year (nearly 4 years ago, I’m old :)) and I’ve been meaning to finish it up for years, only getting up to ~halfway set 3. I’ve made excuse after excuse, like school or internships or other projects, but no more! After my resolution and stumbling across long-lost Set 8 I’ve resolved to finish this up, plus I’ve definitely forgotten lots of these solutions by now.

Frankly, Cryptopals walkthroughs are a dime a dozen and already has thousands of samples. Still though, I think publicly posting this challenge on my site will be motivating enough to hold me accountable and finish this damn thing.

All of my solutions will be in Python3. I choose this because 1) the infrastructure around Python3 is fantastic and 2) personally, I find I can iterate over a solution incredibly quickly in Python, more so than other interpreted languages frankly and 3) Lots of solutions are already in Python2! Again, because this is a pretty often-done challenge released years ago.

Please forgive the noob Python– but hey, if it gets the job done, it gets the job done!

Set 1 Challenge 1 (Number 1)

Puzzle

The string:

49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d

Should produce:

SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t

Solution

Pretty straightforward, no explanation necessary.

import codecs

def hex_to_base64(hex_str):
    return codecs.encode( codecs.decode(hex_str, 'hex'), 'base64').decode()

Set 1 Challenge 2 (Number 2)

Puzzle

Write a function that takes two equal-length buffers and produces their XOR combination.

If your function works properly, then when you feed it the string:

1c0111001f010100061a024b53535009181c

… after hex decoding, and when XOR’d against:

686974207468652062756c6c277320657965

… should produce:

746865206b696420646f6e277420706c6179

Solution

Also quite straightforward. I feel that no explanation is needed.

I did see a solution that looped over every byte in the hex string using chr(), but… why, when XOR-ing an integer is the same as XOR-ing every bit?

def xor_hex_strings(hex_str_one, hex_str_two):
    assert(len(hex_str_one) == len(hex_str_two))

    return hex(int(hex_str_one, 16) ^ int(hex_str_two, 16) )

assert('746865206b696420646f6e277420706c6179' == xor_hex_strings('1c0111001f010100061a024b53535009181c', '686974207468652062756c6c277320657965')[2:])

Set 1 Challenge 3 (Number 3)

Puzzle

The hex encoded string:

1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736

… has been XOR’d against a single character. Find the key, decrypt the message.

You can do this by hand. But don’t: write code to do it for you.

How? Devise some method for “scoring” a piece of English plaintext. Character frequency is a good metric. Evaluate each output and choose the one with the best score.

Solution

I’ll admit– I spent an embarassing long time on this, maybe 2 hours? I was very resistant on looking up the solution, indeed trying to come up with my own “scoring” scheme (and was actually very proud of myself initially!). Because I think it’s important to acknowledge and learn from mistakes, I’ll talk about my thought process and why I took so long.

My first iteration of a “scoring” scheme looked something like this:

def decrypt_xor(ciphertext, guess_key_byte):
    for c in ciphertext:
        decrypt_c = c ^ guess_key_byte
        if decrypt_c > 31 and decrypt_c < 127: score += 5

My logic for this was because any plaintext would have to be “human”, i.e. “typeable”– no message you get would ever have a visible EOL.

Obviously, the above code didn’t work. Then I looked at the hint closer– “English plaintext”. I put further constraints on the if condition, and “subtracted” score/likelihood if the decrypted byte was “weird”, such as < or *, since 90% of the time people don’t use those in normal speech.

After killing a good hour wasted playing with the if conditions, I took a shower and thought :). I thought more about what made speech “English” or a “likely” word, when I remembered context-free grammars. Obviously I didn’t implement a full CFG library, but the idea stuck– I could assign probabilities based off of letters.

Despite this idea, I went down another rabbit hole– I created my own map of letters and assigned my own “score” again, such as giving “E” a score of “26” since it was a common letter, and “Z” a “1” for being uncommon. I gave up and googled letter frequencies.

There are several lessons to take away– take advantage of Google, take a step back when you hit a wall, rely on previously built work, etc. Overall, fun challenge, and I think effort well spent.

The full solution is a little long to post, and so I may throw solutions up on my Github, but here’s the core snippet–

englishLetterFreq = {'e': 12.70, 't': 9.06, 'a': 8.17, 'o': 7.51, 'i': 6.97, 'n': 6.75, 's': 6.33, 'h': 6.09, 'r': 5.99, 'd': 4.25, 'l': 4.03, 'c': 2.78, 'u': 2.76, 'm': 2.41, 'w': 2.36, 'f': 2.23, 'g': 2.02, 'y': 1.97, 'p': 1.93, 'b': 1.29, 'v': 0.98, 'k': 0.77, 'j': 0.15, 'x': 0.15, 'q': 0.10, 'z': 0.07, ' ': 1}

...     # insert code to format string properly

for c in hex_str:
    if chr(c) in englishLetterFreq:
        score += englishLetterFreq[chr(c)]
    else:
        score -= 10     # kinda assuming if it's not an alphanumeric character, it "loses" score
return score

Set 1 Challenge 4 (Problem 4)

Puzzle

One of the 60-character strings in this file has been encrypted by single-character XOR.

Find it.

(Your code from #3 should help.)

Solution

This puzzle is straightfoward, especially since most of the “scoring” code was done previously. Simply put, we can iterate over every line in the file and check what line has the best score.

def decrypt_file(filename):
    with open(filename, 'r') as f:
        score = 0
        best_decrypt_string = ''
        best_key = 0
        for line in f:
            (decrypt, temp_score, key) = xor_single_byte(line.strip())
            if temp_score > score:
                score = temp_score
                best_decrypt_string = decrypt
                best_key = key

        return (best_decrypt_string, score, best_key)

Set 1 Challenge 5 (Problem 5)

Puzzle

Burning ’em, if you ain’t quick and nimble I go crazy when I hear a cymbal

Encrypt it, under the key “ICE”, using repeating-key XOR.

In repeating-key XOR, you’ll sequentially apply each byte of the key; the first byte of plaintext will be XOR’d against I, the next C, the next E, then I again for the 4th byte, and so on.

It should come out to:

0b3637272a2b2e63622c2e69692a23693a2a3c6324202d623d63343c2a26226324272765272 a282b2f20430a652e2c652a3124333a653e2b2027630c692b20283165286326302e27282f

Solution

This is just as simple as challenge two where we XOR-encrypted with a single byte. We can simple pad the key ICE to the length of our message and XOR the two together.

from chall_two import xor_hex_strings
from math import ceil

def xor_encrypt_string(plaintext, key):
    # need to pad key, and cut key, for proper xor-ing. Only works if key.len < plaintext.len
    padded_key = key*( ceil( len(plaintext)/len(key) ) )
    if len(padded_key) > len(plaintext):
        padded_key = padded_key[:len(plaintext)]
        # assuming the plaintext and key are in hex
        hex_plain = plaintext.encode().hex()
        hex_key = padded_key.encode().hex()
    return xor_hex_strings(hex_plain, hex_key)

#plaintext = """Burning 'em, if you ain't quick and nimble
#I go crazy when I hear a cymbal"""
#key = "ICE"
#print( xor_encrypt_string(plaintext, key)[2:] )

Set 1 Challenge 6 (Problem 6)

Puzzle

There’s a file here. It’s been base64’d after being encrypted with repeating-key XOR.

Decrypt it.

Here’s how:

Let KEYSIZE be the guessed length of the key; try values from 2 to (say) 40.

Write a function to compute the edit distance/Hamming distance between two strings. The Hamming distance is just the number of differing bits. The distance between:

this is a test and wokka wokka!!! is 37. Make sure your code agrees before you proceed.

For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE.

The KEYSIZE with the smallest normalized edit distance is probably the key. You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances.

Now that you probably know the KEYSIZE: break the ciphertext into blocks of KEYSIZE length.

Now transpose the blocks: make a block that is the first byte of every block, and a block that is the second byte of every block, and so on.

Solve each block as if it was single-character XOR. You already have code to do this.

For each block, the single-byte XOR key that produces the best looking histogram is the repeating-key XOR key byte for that block. Put them together and you have the key.

This code is going to turn out to be surprisingly useful later on. Breaking repeating-key XOR (“Vigenere”) statistically is obviously an academic exercise, a “Crypto 101” thing. But more people “know how” to break it than can actually break it, and a similar technique breaks something much more important.

Solution

As the last sentence of the puzzle says, this is stuff that’s taught on the first day of class. However, simply being aware of how an attack works doesn’t mean you can actually do it– so let’s prove it!

The full code is certainly too long to put on a page, so I’ll highlight snippets and throw it up on Github later.

Step 1), compute the Hamming Distance, is easy. As Wikipedia literally says one can just compute a ^ b and sum the number of resulting ones. This is because Hamming Distance is a measure of how “different” two strings are, and XOR intrinsically has this property (it ain’t called EXCLUSIVE-or for nothing!).

    xor_strings = xor_hex_strings(hex_str1, hex_str2)
    num_ones = sum([1 for c in bin(int(xor_strings, 16))[2:] if c == '1'])

On the rest of the steps: Looking at some other solutions, they tend to simply say “This is ECB and ECB sucks” without giving a great explanation, not one I found satisfying anyways, so here’s my attempt at explaining why this works.

As Wikipedia says about ECB mode.

Because ECB encrypts identical plaintext blocks into identical ciphertext blocks, it does not hide data patterns well

It’s considered “common knowledge” to never use ECB. It’s gotten to the point where some clients I’ve dealt with shied away from ECB. I find this “advice” has gotten slightly misconstrued and has lost a good explanation, such as this link that seems to imply that usage of ECB itself is a vulnerability. WHAT???? Not at all. It’s a case where what we’re taught in school is correct 90% of the time, but that 10% is where all the fun nuance is.

Going back to the problem– each block is encrypted with the same key and the plaintext is in English. One characteristic of the English language is that its ASCII representation tends to fall under a specific bit range– ~31 to ~127.

This is vastly different than a properly encrypted byte, which would have a normal distribution across 0-255. Thereforce, A ^ B would result in a smaller Hamming Distance if A and B were English than anything else, and this fact is exposed further the more samples of A and B you use. It’s because of the predictability of both 1) key usage and 2) plaintext distribution that makes usage of ECB in this way insecure.

Below is the major snippet of code that computes the Hamming Distance. The puzzle recommends doing 2 or 4 blocks, but I did them all anyways. I believe when I first did this I got sizes 18, 29, and 5 as “being small”.

for i in range(2, 40):
    block_array = [message[j*i:(j+1)*i] for j in range(0, len(message), i) if len(message[j*i:(j+1)*i]) == i]
        
    total = [ hamming_distance(block_array[c], block_array[c+1])/i for c in range(0, len(block_array)-1) ]
    keysz_scores[i] = sum(total)/len(total)

Next, we group the N-th byte of each block together. This is because that N-th byte was encrypted with the same N-th byte of the key. This enables us to use the xor_single_byte function wrote in a previous challenge pretty easily.

My partitioned_blocks function does just that (ommitted for brevity).

for key_sz, score in best_key_scores:
         transposed_blocks = transpose_blocks(message, key_sz)

         #print(f"\n------\nUsing KEYSIZE = {key_sz}\n------")
         probable_key = ''
         for block in transposed_blocks:
             (decrypted_str, score, key_byte) = xor_single_byte(block.encode().hex())
             if score != 0:
                probable_key += chr(key_byte)

         if key_sz == len(probable_key):
             potential_keys.append(probable_key)

Then, looping over the suspected keys, we can attempt decrypting the message. From there, it’s pretty obvious what works and what doesn’t judging by what’s English!

I hope the Terminator brings the noise next time :)

Set 1 Challenge 7 (Problem 7)

Problem

The Base64-encoded content in this file has been encrypted via AES-128 in ECB mode under the key

"YELLOW SUBMARINE".

(case-sensitive, without the quotes; exactly 16 characters; I like “YELLOW SUBMARINE” because it’s exactly 16 bytes long, and now you do too).

Decrypt it. You know the key, after all.

Solution

I used python’s cryptography. No reason. Simple stuff. I believe other solutions used PyCrypto but that’s quite inactive (though Cryptodome is an active fork for this). Too many crypto libraries these days.

def aes_ecb_encrypt_with_key(plaintext, key):
    CipherObj = Cipher(algorithms.AES(key), modes.ECB(), backend=default_backend() )
    encryptor = CipherObj.encryptor()
    return (encryptor.update(plaintext) + encryptor.finalize(), CipherObj)

def aes_ecb_decrypt(CipherObj, ciphertext):
    decryptor = CipherObj.decryptor()
    return decryptor.update(ciphertext) + decryptor.finalize()

Set 1 Challenge 8 (Problem 8)

Problem

In this file are a bunch of hex-encoded ciphertexts.

One of them has been encrypted with ECB.

Detect it.

Remember that the problem with ECB is that it is stateless and deterministic; the same 16 byte plaintext block will always produce the same 16 byte ciphertext.

Solution

When I first did this, I think I overkilled it by using the same functions I wrote in 6.

I realized now I wasted a good amount of time by not fully understanding the hint– The same plaintext block will always produce the same ciphertext block. Basically, when you split all the ciphertexts in the file into 16B blocks, you’ll find that some blocks are literally simply repeated. The following code determines the number of repeated blocks in a ciphertext.

for line in f:
    hexbytes = bytes.fromhex(line.strip())
    blocks = [hexbytes[i:(i+16)].hex() for i in range(0, len(hexbytes), 16)]
    temp_score = len(set(blocks))
    if temp_score < likely_score:
        likely_score = temp_score
        likely_text = hexbytes.hex()
        found_line = i
    i += 1

Frankly I didn’t love this challenge, especially since it would’ve been a chance to use what we learned in six, but whatever– easy stuff.

That’s it for set 1! I probably spent as much time writing this as I did solving this. Writing it, however, definitely got me motivated to do further sets.

2020/01/21

Set 1 Challenge 1 (Number 1)

Puzzle

Solution

Set 1 Challenge 2 (Number 2)

Puzzle

Solution

Set 1 Challenge 3 (Number 3)

Puzzle

Solution

Set 1 Challenge 4 (Problem 4)

Puzzle

Solution

Set 1 Challenge 5 (Problem 5)

Puzzle

Solution

Set 1 Challenge 6 (Problem 6)

Puzzle

Solution

Set 1 Challenge 7 (Problem 7)

Problem

Solution

Set 1 Challenge 8 (Problem 8)

Problem

Solution