Machine Learning and Cryptography: Getting to Know CipherGAN

18:23
06.10.2024
Andrey_Biryukov
152

Machine learning is now used to a greater or lesser extent in various industries. Cryptographic analysis is no exception. In this article, we will look at the CipherGAN generative adversarial network, which is used to determine the basic encryption mapping from banks of unpaired encrypted text and plaintext.

CycleGAN

At the heart of CipherGAN is another generative adversarial network — CycleGAN. This type of generative adversarial network is used for image style transfer. For example, CycleGAN can be trained to convert images from one domain (e.g., Fortnite) to another, such as PUBG. Training is unsupervised, meaning there is no way to uniquely match images from both domains.

This opens up the possibility of performing many interesting tasks, such as improving photo quality, coloring images, style transfer, etc. All you need is a source and target dataset (which is just a catalog of images). The general principle of this network is shown in the following diagram.

Machine Learning and Cryptography: Visualizing the Work of CipherGAN, a Generative Model for Data Encryption.

But let's return to the main topic of our article — CipherGAN. As mentioned above, this network is used to determine the basic encryption mapping from banks of unpaired ciphertext and plaintext.

Substitution ciphers have the property of confusion, which ensures the indistinguishability of the encryption key and the corresponding encrypted text. In particular, confusion is the main building block of modern symmetric cryptography, which has the property of diffusion. Thus, understanding the structure and vulnerability of substitution ciphers can form the basis for the cryptanalysis of modern cryptography. Table 2 presents various AI-based cryptanalysis studies.

However, CipherGAN is capable of breaking language data encrypted using various cipher variations and the Vigenère cipher with a high degree of accuracy and for dictionaries significantly exceeding previously achieved ones.

Let's recall what the Vigenère cipher is. A table of alphabets, called tabula recta or the Vigenère square (table), can be used for encryption. For the Latin alphabet, the Vigenère table consists of rows of 26 characters, with each subsequent row shifted by several positions.

Diagram showing the training process of CipherGAN, where the generator and discriminator work together to create encrypted data.

At each stage of encryption, different alphabets are used, chosen depending on the character of the keyword. For example, suppose the original text looks like this:

ATTACKATDAWN

The person sending the message writes the keyword ("LEMON") cyclically until its length matches the length of the original text:

LEMONLEMONLE

Installing CipherGAN

CipherGAN is written in Python, which makes working with this network much easier. To install, run:

pip install -r CipherGAN/requirements.txt

To train our neural network, we use data generators. Special attention should be paid to the cipher_generator (https://github.com/for‑ai/CipherGAN/blob/master/data/data_generators/cipher_generator.py), which can be used to generate data for shift and Vigenère ciphers.

The settings required for generation are passed to the script as flags. For example, to generate a Vigenère cipher at the word level (key:CODE) with a sample length of 200, run:

python CipherGAN/data/data_generators/cipher_generator.py   --cipher=vigenere   --vigenere_key=345   --percentage_training=0.9   --corpus=brown   --vocab_size=200   --test_name=vigenere345-brown200-eval   --train_name=vigenere345-brown200-train   --output_dir=tmp/data   --vocab_filename=vigenere345_brown200_vocab.txt

Starting training

To train the neural network, you can use the train.py script shown below.

import shutil
import os
import tensorflow as tf

from .hparams.registry import get_hparams
from .models.registry import _MODELS
from .data.registry import _INPUT_FNS, get_dataset
from .metrics.registry import get_metrics
from .train_utils.lr_schemes import get_lr
from .train_utils.vocab_utils import read_vocab

tf.flags.DEFINE_string("model", "cycle_gan", "Which model to use.")
tf.flags.DEFINE_string("data", "cipher", "Which data to use.")
tf.flags.DEFINE_string("hparam_sets", "cipher_default", "Which hparams to use.")
tf.flags.DEFINE_string("hparams", "", "Run-specific hparam settings to use.")
tf.flags.DEFINE_string("metrics", "xy_mse",
                       "Dash separated list of metrics to use.")
tf.flags.DEFINE_string("output_dir", "tmp/tf_run",
                       "The output directory.")
tf.flags.DEFINE_string("data_dir", "tmp/data", "The data directory.")
tf.flags.DEFINE_integer("train_steps", 1e4,
                        "Number of training steps to perform.")
tf.flags.DEFINE_integer("eval_steps", 1e2,
                        "Number of evaluation steps to perform.")
tf.flags.DEFINE_boolean("overwrite_output", False,
                        "Remove output_dir before running.")
tf.flags.DEFINE_string("train_name", "data-train*",
                       "The train dataset file name.")
tf.flags.DEFINE_string("test_name", "data-eval*", "The test dataset file name.")

FLAGS = tf.app.flags.FLAGS
tf.logging.set_verbosity(tf.logging.INFO)

def _run_locally(train_steps, eval_steps):
  """Run training, evaluation and inference locally.

  Args:
    train_steps: An integer, number of steps to train.
    eval_steps: An integer, number of steps to evaluate.
  """
  hparams = get_hparams(FLAGS.hparam_sets)
  hparams = hparams.parse(FLAGS.hparams)
  hparams.total_steps = FLAGS.train_steps

  if "vocab_file" in hparams.values():
    hparams.vocab = read_vocab(hparams.vocab_file)
    hparams.vocab_size = len(hparams.vocab)
    hparams.vocab_size += int(hparams.vocab_size % 2 == 1)
    hparams.input_shape = [hparams.sample_length, hparams.vocab_size]

  output_dir = FLAGS.output_dir
  if os.path.exists(output_dir) and FLAGS.overwrite_output:
    shutil.rmtree(FLAGS.output_dir)

  if not os.path.exists(output_dir):
    os.makedirs(output_dir)

  def model_fn(features, labels, mode):
    lr = get_lr(hparams)
    return _MODELS[FLAGS.model](hparams, lr

python -m CipherGAN.train   --output_dir=runs/vig345   --test_name="vigenere345-brown200-eval*"   --train_name="vigenere345-brown200-train*"   --hparam_sets=vigenere_brown_vocab_200

Conclusion

Of course, artificial intelligence still cannot crack AES or GOST. The CipherGAN neural network presented in the article can currently decrypt educational ciphers, such as the Vigenere cipher. However, in the future, neural networks may well become an effective assistant for cryptanalysts.

The article was prepared in anticipation of the start of the course "Cryptographic Information Protection". Learn more about the course.