======================================================== This work is licensed under a Creative Commons License See http://creativecommons.org/licenses/by-nc-nd/2.0/ ======================================================== Cracking codes with Genetic Algorithms by +mala 200412 1. Introduction 1.1 Note 2. GA Basics 2.1 What is a GA? 2.2 how does a GA work? 3. GA Techniques 3.1 Choosing the Problem 3.2 Model and chromosomes 3.3 Fitness 3.4 Reproduction e Mutation 3.5 Genetic Algorithms and probability 4. Programming 4.1 GAlib 4.2 Generic structure of a GA in C++ 5. Examples 5.1 NSCE 5.2 Mono-alphabetic Ciphers 6. Bibliography ============================================================================= 1. Introduction ----------------------------------------------------------------------------- It was February 2003 when, on an anti-crack forum, I read the thread (that you can still find at http://board.anticrack.de/viewtopic.php?t=1108) entitled "Genetic Algorithms". The+Q wrote: <<This made me think, is there a limit to GA? More and more "un-solvable" problems are solved by GA, and im starting to feel its only a matter of time until hard crypto algorithms are cracked by such meens..>> The+Q wasn't the only one to nurture such strong enthusiasm for genetic algorithms : in the forum several reversers started to ask information on how to develop programs that used them and I soon started to study and experiment with them. Everyone thought that by using GA's we could radically change the approach to reversing and crypto-analysis. Now, two years later not much has changed. Instead, the old threads On genetic algorithms are drying up and very little new material on GA has come up: its enough to know that, even now, the tutorial that The+Q wrote in 2003 is cited as an example of genetic algorithm applied to cracking. Why did this happen? Why did people stop talking about how powerful genetic algorithms were soon after trying to use them? Is it not sufficient to just understand the theory in order to be able to resolve every problem? I too, during my experiments, found myself more times incapable of finding a solution to problems that didn't seem at all so complicated, and every time I asked myself where's the limit of these techniques. Even now I feel that I'm a long way from having learnt everything that genetic algorithms Ask us to comprehend before being able to use them in a truly efficient way, but at least I got past the stage of "I don't know how they work, But they usually work, therefore they must work". I need this personal test to come to a conclusion, it's also for whoever else that has the patience to follow me and start to work with GA's knowing already (more or less) what to expect. 1.1 Note ----------------------------------------------------------------------------- I imagine you're already asking why this text has such a particular title and if it will be of any real use, even if you're not reverse engineers or crypto-analysts. Well, The response to the first question is fairly simple: given my dual interest in both reversing and cryptography, it appeared interesting to me to conduct experiments in both fields, accumulating them under the name of "code cracking". In reality, the examples on which I worked are more varied: not all of them regard code cracking, not all of them had a positive result. How and ever , from every one of them I learnt something that I'll try to transmit through these pages. As for the usefulness of the text your reading I believe that it's a sufficient quantity of suggestions and principal ideas to permit anyone to write genetic algorithms for whatever problem they want to resolve. Naturally, a greater preparation on the specific argument to which you're dedicating yourselves will noticeably simplify your work. Lastly, if at the end of reading this text you're still alive and you're still interested in genetic algorithms, try taking a look at http://mala.homeip.net/malawiki/GeneticAlgos for updates, source code, comments on the text, suggestions and voodoo curses. ============================================================================= 2. GA Basics ----------------------------------------------------------------------------- In this section I'll talk about genetic algorithms as simply and generally as possible, introducing some definitions that will be then used in the rest of this text. If you are already familiar with GA you can skip this section, but I'd advise you to take a look at it anyway, if for nothing else then to verify that your terminology is exactly the same as mine. 2.1 What is a GA? ----------------------------------------------------------------------------- A GA is a Genetic Algorithm, that is an algorithm genetic (if you're asking why they call it GA and not AG, do a search with google and see which gives the most results). If we consult the relative literature, we discover that Genetic Algorithms are a family of computational models that try to evolve. These models put into practice the Darwinian concept of "the survival of the fittest", parting from a population of possible solutions (usually casual) to a given problem and leaving them to evolve on the basis of some measurements of their success. Academics are still discussing trying to find a precise definition for Genetic Algorithms. In the meantime, the same term is used with two different exceptions: following a literal interpretation, this refers to a literal model introduced and studied by John Holland (1975) and from his students; With a meaning a little bit more ample, a Genetic Algorithms is a "model Based on a population that uses selection operators and recombination To generate new points in a search space". Seeing as these definitions are as much rigorous as they are unclear, here are a few examples that will help you better understand the significance of what was written A simple case (if not completely commonplace, but its one of the principles that the literature offers us) could be that in which the PROBLEM consists in filling up a line with ones. The POPULATION, that is the group of possible solutions, could then be represented by a series of strings of initially casual "bits" as long as the row desires to be. The MEASURE OF SUCCESS of a generic element of the population, which from now on we'll refer to as CHROMOSOME, will be for example the number of bits in a string that have a value of 1. You can easily guess that the best chromosomes will be those closer to the solution, and it's these that will have a better possibility of "surviving". In a board game like, for example, Lights Out or Chinese draughts (which You can find respectively, at http://www.2flashgames.com/f/f-35.htm and http://www.math.it/damacinese/dama.htm), the PROBLEM consists in ending the game with a particular configuration: In Lights Out this deals with switching off all the lights, whereas In Chinese draughts you need to end with a single piece in the centre of the game board. The POPULATION, in both cases, could be made up of CHROMOSOMES that represent sequences of moves, and the MEASURE OF SUCCESS could be the closeness to the configuration that resolves the problem, that is how much one gets close to the configuration with the least lights on or a single piece in the centre. In another example (that we'll describe in detail further on) The PROBLEM Consists in finding a password to register a program, The POPOLATION Is made up from CHROMOSOMES that represent the _potentially_ valid password (that is, satisfying some necessary conditions) an the MEASURE OF SUCCESS is given by the closeness of the password to the correct one, calculated On the basis of the same algorithm used inside the program. A last example is that of Mono-alphabetic Ciphers (in which for every letter of text there is only one other letter that always corresponds to another: just so we understand each other, The same one used in the Puzzle Week). In this case, the PROBLEM Consists in finding a substitution of letters that resolves the cryptogram, the POPULATION is made up of CHROMOSOMES that corresponds to the various cipher alphabets and the MEASURE OF SUCCESS is given from how much an alphabet is good, that is , in how much the resulting deciphered text using that alphabet is a valid test. 2.2 How does a GA work? ----------------------------------------------------------------------------- From the Wikipedia web page (http://it.wikipedia.org/wiki/Algoritmo): "In its most simple and intuitive definition an algorithm is a sequence ordered into simple passes that have the purpose of bringing a more complex task to an end (a cooking recipe, for example, could be considered as an algorithm that parting from a group of single basic food elements and following a sequence of passes, produces as a result a prepared dish). In a more formal mode, we can therefore define an algorithm as an ordered sequence furnished with instructions that, given one or a series of elements in input, produces one or a series of results in output." In what sequence of passes, then, does a genetic algorithm follow? At a fairly abstract level (later on we'll increase the detail), The operations to follow are the following : 1) Initialization As a first step for the Algorithm a startup population is created , Made of casual solutions (chromosomes). When, in Genetic Algorithm theory we talk about "solution to a problem", naturally, its meaning shouldn't be intended as a "unique solution", or even a "correct solution". Just as a mathematical function, on the basis of different input values, can give different results, in the same way we can think about the presence of different possible solutions. And naturally, as in the case of mathematics we can have a MAXIUM or an OPTIMUM of a problem, The solution that were searching for in the end is an OPTIMUM. 2) Calculation of the Fitness A value of "fitness" is assigned to every chromosome on the basis of its efficiency with respect to the problem. Usually the fitness value corresponds to the value of the function objective (that is, that for which one desires to find the MAXIUM), but its not indispensable that its always like that: for example , in the documentation of the GAlib (and , to be more precise , at the following address: http://lancet.mit.edu/galib-2.4/Overview.html)this distinction is made note explicitly. 3) Reproduction The chromosomes that have a higher fitness are those that more easily become parents of the next solutions during the phases of reproduction. The choice of the parents however comes out in a casual manner , All together the better members of a population are given a better probability of being chosen. Through operators that can change from one problem to another, the chosen chromosomes can generate children , who in turn inherit part of their own genes from one parent and part from another. Through a mutation operator, in the end, some casual changes are performed on some of their children's genes before these enter to become part of the new population. 4) Successive Generation If the optimal solution has been found (or if some other condition imposed from outside has been obtained , for example: a max limit for execution time), the algorithm terminates. In the contrary case, the old assembly of chromosomes is substituted by the newly generated children: the new generation is complete, and we turn to point 2. 2.3 Summary (and glossary) ----------------------------------------------------------------------------- PROBLEM The problem that you desire to resolve. As you will see further on , from the point of view of GA It will always deal with a MODEL of your problem, made by you in its use and consumption. POPOLATION An assembly of chromosomes. The dimension of a population can vary from A problem to another and that variation can change the performance of a genetic algorithm noticeably. CROMOSOME/GENOME The generic element in a population. These are the possible solutions to the problem , not necessarily the most optimum ones but at least solutions that are compatible with the problem. For example , if the solution is a set of moves to win a board game, a chromosome will be a sequence of moves, perhaps not winning but at least valid. FITNESS It's the value that's assigned to every chromosome in order to evaluate its suitability for the problem to be resolved. The bigger the fitness the better the presence of the chromosome in the gene pool and therefore all the more likelihood that it will reproduce. FUNCTION OBJECTIVE This is the function for which one tries to find the MAXIMUM to resolve the problem. Its value is often used as a fitness value. SELECTION The selection procedure is actuated before reproduction. This consists in the selection (Casual) of chromosomes that will reproduce, the bigger their fitness value the higher their probability of selection. RIPRODUCTION This is the process through which children chromosomes are generated Parting from parent chromosomes selected in precedence. The mode with which this comes to place and the operators that are used are described in the next chapter. MUTAATION The Mutation consists in the change (casual)of some genes in a freshly generated child chromosome. Not all the new chromosomes mutate : even in this case the selection is casual. GENERATION Once the selection, reproduction and mutation procedures have terminated, A new population of chromosomes is created. This constitutes a new generation. As you are sure to notice, the number of generations Also corresponds to the number of algorithm's iterations and is often used to give an estimate of the efficiency of the same algorithm. ============================================================================= 3. GA Techniques ----------------------------------------------------------------------------- In the previous chapter we made several references to the "goodness" and the Performance of a genetic algorithm . Naturally we weren't referring to the Characteristics of an algorithm at a high level of extraction, but to those Of a GA complete in all its specifics, parting from the model of the problem Up to arriving at the selection of the probability of reproduction and mutation Of single chromosomes. How do we construct, then, a genetic algorithm? What are the steps that WE Must carry out to render it complete? But above all, how can we create a good genetic algorithm? The chapter that you are about to read will try to respond to these very demands. 3.1 Choice of the problem ----------------------------------------------------------------------------- If you really want to make genetic algorithms that work, pay great attention To the problems that you choose! It's clear, unfortunately, that the exact opposite usually occurs: you have a problem and you have to decide how to resolve it. Well, in this case think well before choosing to implement a GA. Talking with several people I asked if there existed a problem that a genetic algorithm couldn't possibly resolve. The responses were more often than not, Very vague, showing me among other things what was the average knowledge of the argument. All together some of them were able to clarify my ideas somewhat: Usually they started with an honest "I don't know" which was closely followed with a enlightened "perhaps". Trying to unite the "perhaps" clarifications and joining them with my (still scarce) relative experience, I arrived at these conclusions: - Genetic algorithms are perfectly indicated for filling rows or squares with ones. All together , whatever problem to which a solution is found by hand in less time than that required by a GA tends to loose our interest, especially when you read it for the ninety-seventh time. In general, however, it's a good thing to evaluate the time that could be required to hit a solution (for example brute force) and confront it to That which it is effectively taking. - often genetic algorithms give the impression of functioning insofar as, reflecting on it, the population cant do anything other than evolve. In reality, especially in the case of problems with a certain complexity, These evolve with an incredible velocity in the beginning, then they slow down More and more as they approach the solution until they almost stop. Therefore Yes, sooner or later they resolve the problem, but we need to see WHEN they do it: if a brute force algorithm ,or worse, by shooting casual numbers you have a higher probability of finishing in less time then forget about using GA's. - in some cases a GA may not be optimum solution, because maybe it requires more elaboration time than another algorithm ,but may, at the same time, be the most rapid solution to develop , in so much as it may require a different knowledge of the problem. For example, sometimes reverse engineering an algorithm (hopefully a one-way algo) its more simply to use it "without bothering to check" and allow a GA to work on it. - in some cases, the calculation of the fitness can be extremely boring in terms of time. If there's no other way to evaluate a chromosome, it may be time to start considering another algorithm (or another model). - genetic algorithms aren't very good at resolving problems for which "a small change in input corresponds to big variations in output" (did you ever hear talk about hash?). As a matter of fact, reproduction and mutation usually give rise to offspring that aren't very different from the parents: the result of which gives rise to objective function values that are very far away from those anticipated (and as a consequence, worse). For the same reason, the problems for which it's guaranteed that "for small changes in input there is a corresponding small change in output" are particularly suitable for resolution with genetic algorithms. An alternative that was suggested to me for "hash type" problems was to study a different model, that's able to cancel, ignore, compensate or at least render this property of theirs less evident: for this reason I'm inviting you to read the next point and the next section. - Since, as you'll see shortly, for a good result from a genetic algorithm having a good model is indispensable, a discriminating factor for deciding if the problem is resolvable or not is in who constructs the model. Essentially speaking: if you're not capable of modeling the problem in an adequate manner , maybe you're better off looking for another method. ...