GaCrackEn

========================================================   
 This work is licensed under a Creative Commons License 
 See http://creativecommons.org/licenses/by-nc-nd/2.0/ 
========================================================   
 
Cracking codes with Genetic Algorithms 
by +mala 
200412 
 
1.  Introduction 
1.1 Note 
 
2.  GA Basics 
2.1 What is a GA? 
2.2 how does a GA work? 
 
3.  GA Techniques 
3.1 Choosing the Problem 
3.2 Model and chromosomes 
3.3 Fitness 
3.4 Reproduction e Mutation 
3.5 Genetic Algorithms and probability
 
4.  Programming 
4.1 GAlib 
4.2 Generic structure of a GA in C++ 
 
5.  Examples 
5.1 NSCE 
5.2 Mono-alphabetic Ciphers 
 
6.  Bibliography 
 
 
=============================================================================   
1.  Introduction
----------------------------------------------------------------------------- 
 
It was February 2003 when, on an anti-crack forum, I read the thread (that
you can still find at http://board.anticrack.de/viewtopic.php?t=1108)
entitled "Genetic Algorithms". The+Q wrote: 
 
<<This made me think, is there a limit to GA? More and more "un-solvable" 
problems are solved by GA, and im starting to feel its only a matter of 
time until hard crypto algorithms are cracked by such meens..>>  
 
The+Q wasn't the only one to  nurture such strong enthusiasm for genetic
algorithms  
: in the forum several reversers started to ask information on how to
develop programs that used them and I soon started to study and experiment
with them. Everyone thought that by using GA's we could radically change the
approach to reversing and  crypto-analysis. 
 
Now, two years later not much has changed. Instead, the old threads
On genetic algorithms are drying up and very little new material on GA has
come up: its enough to know that, even now, the tutorial that 
The+Q wrote in 2003 is cited as an example of genetic algorithm applied to
cracking. 
 
Why did this happen? Why did people stop talking about how powerful genetic
algorithms were soon after trying to use them? 
Is it not sufficient to just understand the theory in order to be able to
resolve every problem? 
I too, during my experiments, found myself more times incapable of finding a
solution  to problems that didn't seem at all so complicated, and every time
I asked myself where's the limit of these techniques. 
 
Even now I feel that I'm a long way from having learnt everything that
genetic algorithms 
Ask us to comprehend before being able to use them in a truly efficient way,
but at least I got past the stage of  "I don't know how they work, 
But they usually work, therefore they must work". 
I need this personal test to come to a conclusion, it's also for whoever
else that has the patience to follow me and start to work with GA's knowing
already (more or less) what to expect. 
 
 
1.1 Note 
----------------------------------------------------------------------------- 
 
I imagine you're already asking why this text has such a particular title
and if it will be of any real use, even if you're not reverse engineers or
crypto-analysts. 
 
Well, The response to the first question is fairly simple: given my dual
interest in both reversing and cryptography, it appeared interesting to me
to conduct experiments in both fields, accumulating them under the name of
"code cracking". In reality, the examples on which I worked are more varied:

not all of them regard code cracking, not all of them had a positive result.
How and ever , from every one of them I learnt something that I'll try to
transmit through these pages. 
 
As for the usefulness of the text your reading I believe that it's a
sufficient quantity of suggestions and principal ideas to permit anyone to
write genetic algorithms for whatever problem they want to resolve.
Naturally, a greater preparation on the specific argument to which you're
dedicating yourselves will noticeably simplify your work.
 
Lastly, if at the end of reading this text you're still alive and you're
still interested in genetic algorithms, try taking a look at 
http://mala.homeip.net/malawiki/GeneticAlgos for updates, source code, 
comments on the text, suggestions and voodoo curses. 
 
 
=============================================================================   
2.  GA Basics 
----------------------------------------------------------------------------- 
 
In this section I'll talk about genetic algorithms as simply and generally
as possible, introducing some definitions that will be then used in the rest
of this text. If you are already familiar with GA you can skip this section, 
but I'd advise you to take a look at it anyway, if for nothing else then
to verify that your terminology is exactly the same as mine. 
 
 
2.1 What is a GA? 
----------------------------------------------------------------------------- 
 
A GA is a Genetic Algorithm, that is an algorithm genetic (if you're asking
why they call it GA and not AG, do a search with google and see which gives
the most results). 
 
If we consult the relative  literature, we discover that  
Genetic Algorithms are a family of  computational models that try to evolve.
These models put into practice the Darwinian concept of "the  survival of
the fittest", parting from a population of possible solutions (usually
casual) to a given problem and leaving them to evolve on the basis of some
measurements of their success.  

Academics are still discussing trying to find a precise definition for
Genetic Algorithms. In the meantime, the same term is used with two
different exceptions: following a literal interpretation, this refers to a
literal model introduced and studied by John Holland (1975) and from his
students; 
With a meaning a little bit more ample, a Genetic Algorithms is a "model 
Based on a population that uses selection operators and recombination
To generate new points in a search space". 
 
Seeing as these definitions are as much rigorous as they are unclear, here
are a few examples that will help you better understand the significance of
what was written
A simple case (if not completely commonplace, but its one of the principles
that the literature offers us) could be that in which the PROBLEM consists
in  filling up a line with ones. The POPULATION, that is the group of
possible solutions, could then be represented by a series of strings of
initially casual "bits"  as long as the row desires to be. 
The MEASURE OF SUCCESS of a generic element of the population, which from
now on we'll refer to as CHROMOSOME, will be for example the number of bits
in a string that have a value of 1. You can easily guess that the best
chromosomes will be those closer to the solution, and it's these that will
have a better possibility of "surviving".
 
In a board game like, for example, Lights Out or Chinese draughts (which 
You can find  respectively, at http://www.2flashgames.com/f/f-35.htm and 
http://www.math.it/damacinese/dama.htm), the PROBLEM consists in ending the
game with a particular configuration: In Lights Out this deals with
switching off all the lights, whereas In Chinese draughts you need to end
with a single piece in the centre of the game board. The POPULATION, in both
cases,  
could be made up of CHROMOSOMES that represent sequences of moves, 
and the MEASURE OF SUCCESS could be the closeness to the configuration that
resolves the problem, that is how much one gets close to the configuration
with the least lights on or a single piece in the centre. 
 
In another example (that we'll describe in detail further on) The PROBLEM  
Consists in finding a password to register a program, The POPOLATION
Is made up from CHROMOSOMES that represent the _potentially_ valid password
(that is, satisfying some necessary conditions) an the MEASURE OF SUCCESS is
given by the closeness of the password to the correct one, calculated  
On the basis of the same algorithm used inside the program. 
 
A last example is that of Mono-alphabetic Ciphers (in which for every letter
of text there is only one other letter that always corresponds to another:
just so we understand each other, 
The same one used in the Puzzle Week). In this case, the PROBLEM 
Consists in finding a substitution of letters that resolves the cryptogram, 
the POPULATION is made up of CHROMOSOMES that corresponds to the various
cipher alphabets and the MEASURE OF SUCCESS is given from how much an
alphabet is good, 
that is , in how much the resulting deciphered text using that alphabet is a
valid test. 
 
 
2.2 How does a GA work? 
----------------------------------------------------------------------------- 
 
From the Wikipedia web page (http://it.wikipedia.org/wiki/Algoritmo): 
 
"In its most simple and intuitive definition an algorithm is a sequence  
 ordered into simple passes that have the purpose of bringing a more complex
task to an end (a cooking recipe, for example, could be considered as an  
 algorithm that parting from a group of single basic food elements
 and following a sequence of passes, produces as a result a prepared dish).
In a more formal mode, we can therefore define an algorithm  
 as an ordered sequence furnished with instructions that, given one or a
series of elements in input, produces one or a series of results in output."

 
In what sequence of passes, then, does a genetic algorithm follow?  
At a fairly abstract level  (later on we'll increase the detail), 
The operations to follow are the following : 
 
1) Initialization
   As a first step for the Algorithm a startup population is created , 
Made of casual solutions (chromosomes). When, in Genetic Algorithm theory we
talk about "solution to a problem", naturally, its meaning shouldn't be
intended as a "unique solution", or even a "correct solution". Just as a
mathematical function, on the basis of different input values, can give
different results, in the same way we can think about the presence of
different possible solutions. 
And naturally, as in the case of mathematics we can have a MAXIUM or an
OPTIMUM of a problem, 
The solution that were searching for in the end is an OPTIMUM. 
 
2) Calculation of the Fitness 
   A value of "fitness" is assigned to every  chromosome on the basis of its
   efficiency with respect to the problem.
   Usually the fitness value corresponds to the value of the function
objective
   (that is, that for which one desires to find the MAXIUM), but its not
   indispensable that its always like that: for example , in the
documentation
   of the GAlib (and , to be more precise , at the following address:
   http://lancet.mit.edu/galib-2.4/Overview.html)this distinction 
   is made note explicitly. 
 
3) Reproduction
   The chromosomes that have a higher fitness are those that more easily
become
   parents of the next solutions during the phases of reproduction.
   The choice of the parents however comes out in a casual manner , 
   All together the better members of a population are given a better    
   probability of being chosen. 
   Through operators that can change from one problem to another, the 
   chosen chromosomes can generate children , who in turn inherit part
   of their own genes from one parent and part from another. 
   Through a mutation operator, in the end, some casual changes are
performed
   on some of their children's genes before these enter to become part of
the
   new population. 
 
4) Successive Generation   
   If the optimal solution has been found (or if some other condition
imposed
   from outside has been obtained , for example: a max limit for execution
   time), the algorithm terminates.  
   In the contrary case, the old assembly of chromosomes is substituted by
the
   newly generated children: the new generation is complete, and we turn to
   point 2. 
 
 
2.3 Summary (and glossary) 
----------------------------------------------------------------------------- 
 
PROBLEM 
    The problem that you desire to resolve. As you will see further on ,
from
    the point of view of GA It will always deal with  a MODEL of your
problem,
    made by you in its use and consumption. 
 



POPOLATION
    An assembly of chromosomes. The dimension of a population can vary from 
    A problem to another and that variation can change the performance of a 
    genetic algorithm noticeably. 
 
CROMOSOME/GENOME 
    The generic element in a population. These are the possible solutions to
the 
    problem , not necessarily the most optimum ones but at least solutions
that
    are compatible with the problem. For example , if the solution is a set
of
    moves to win a board game, a chromosome will be a sequence of moves,
perhaps 
    not winning but at least valid. 
 
FITNESS 
    It's the value that's assigned to every chromosome in order to evaluate
its suitability for the problem to be resolved. The bigger the fitness the
better the presence of the chromosome in the gene pool and therefore all the
more likelihood that it will reproduce. 
 
FUNCTION OBJECTIVE 
    This is the function for which one tries to find the MAXIMUM to resolve
the problem. Its value is often used as a fitness value. 
 
SELECTION
    The selection procedure is actuated before reproduction. This consists
in 
    the selection (Casual) of chromosomes that will reproduce, the bigger
their 
    fitness value the higher their probability of selection. 
 
RIPRODUCTION
    This is the process through which children chromosomes are generated 
    Parting from parent chromosomes selected in precedence. The mode with
which
    this comes to place and the operators that are used are described in the

    next chapter. 
 
MUTAATION
    The Mutation consists in the change (casual)of some genes in a freshly
generated child chromosome. Not all the new chromosomes mutate : even in
this case the selection is casual. 
 
GENERATION
    Once the selection, reproduction and mutation procedures have
terminated, 
    A new population of chromosomes is created. This constitutes a new
generation. As you are sure to notice, the number of generations  
    Also corresponds to the number of algorithm's iterations and is often
used to give an estimate of the efficiency of the same algorithm. 
 
 
=============================================================================   
3.  GA Techniques 
----------------------------------------------------------------------------- 
 
In the previous chapter we made several references to the "goodness" and the

Performance of a genetic algorithm . Naturally we weren't referring to the 
Characteristics of an algorithm  at a high level of extraction, but to those
Of a GA complete in all its specifics, parting from the model of the problem

Up to arriving at the selection of the probability of reproduction and
mutation 
Of single chromosomes. 
 
How do we construct, then, a genetic algorithm? What are the steps that WE 
Must carry out to render it complete? But above all, how can we create a
good genetic algorithm? The chapter that you are about to read will try to
respond to these very demands.  
 
 
3.1 Choice of the problem 
----------------------------------------------------------------------------- 
 
If you really want to make genetic algorithms that work, pay great attention

To the problems that you choose! It's clear, unfortunately, that the exact
opposite usually occurs: you have a problem and you have to decide how to
resolve it. Well, in this case think well before choosing to implement a GA.

 
Talking with several people I asked if there existed a problem that a
genetic algorithm couldn't possibly resolve. The responses were more often
than not, 
Very vague, showing me among other things what was the average knowledge of
the argument. All together some of them were able to clarify my ideas
somewhat:
Usually they started with an honest "I don't know" which was closely
followed with a enlightened "perhaps". 
 
Trying to unite the "perhaps" clarifications and joining them with my (still
scarce) relative experience, I arrived at these conclusions: 
 
- Genetic algorithms are perfectly indicated for filling rows or squares
with 
  ones. All together , whatever problem to which a solution is found by hand

  in less time than that required by a GA tends to loose our interest,
  especially when you read it for the ninety-seventh time. 
  In general, however, it's a good thing to evaluate the time that could be

  required to hit a solution (for example brute force) and confront it to
  That which it is effectively taking. 
 
- often genetic algorithms give the impression of functioning insofar as,  
  reflecting on it, the population cant do anything other than evolve. 
  In reality, especially in the case of problems with a certain complexity, 
  These evolve with an incredible velocity in the beginning, then they slow
down
  More and more as they approach the solution until they almost stop. 
  Therefore Yes, sooner or later they resolve the problem, but we need to
see
  WHEN they do it: if a brute force algorithm ,or worse, by shooting casual

  numbers you have a higher probability of finishing in less time then
forget   
  about using GA's. 
 
- in some cases a GA may not be optimum solution, because maybe it requires
more
  elaboration time than another algorithm ,but may, at the same time, be the
  most rapid solution to develop , in so much as it may require a different
  knowledge of the problem. For example, 
  sometimes reverse engineering an algorithm (hopefully a one-way algo) its
more
  simply to use it "without bothering to check" and allow a GA to work on
it. 
 
- in some cases, the calculation of the fitness can be extremely boring in
terms
  of time. If there's no other way to evaluate a chromosome, it may be time
to
  start considering another algorithm (or another model). 
 
- genetic algorithms aren't very good at resolving problems for which "a
small
  change in input corresponds to big variations in output" (did you ever
hear
  talk about hash?). As a matter of fact, reproduction and mutation usually
give
  rise to offspring that aren't very different from the parents: the result
of  
  which gives rise to objective function values that are very far away from
  those anticipated (and as a consequence, worse). For the same reason, the
  problems for which it's guaranteed that "for small changes in input there
  is a corresponding small change in output" are particularly suitable for
  resolution with genetic algorithms. 
  An alternative that was suggested to me for "hash type" problems was to
study
  a different model, that's able to cancel, ignore, compensate or at least
  render this property of theirs less evident: for this reason I'm inviting
you
  to read the next point and the next section. 
 
- Since, as you'll see shortly, for a good result from a genetic algorithm
having a good model is indispensable, a discriminating factor for deciding
if the problem is resolvable or not is in who constructs the model.
Essentially speaking: if you're not capable of modeling the problem in an
adequate manner , maybe you're better off looking for another method.

...