Sep 18, 2007, Copyright 2007, Randy Strauss

When is a garbled string the one I'm looking for?


The Problem

For reasons too complex to mention, I need to have a program run a telnet client on a Windows 2003 box to a terminal session on a linux box that has a custom CLI (command-line interpreter.) I'm using the XP's telnet.exe because it will log the session for me, and it's a relatively small executable (75kb).

Unfortunately both the telnet program and the CLI echo what is typed, mixing the two, so I get a second copy of the string mixed into the first, a bit later. My command-sending program waits a fraction of a second between sending the characters and sending the end-of-line character, yet some of the echoed characters often end up on the next line.

Thus the problem is, given one of these doubly echoed lines containing a string and all or part of itself echoed, how do I determine which string it was? (Note that someone may have typed in the commands, so I can't even rely on the command sequence in the file being what my program generates.)

For ease of discussion, lets call one of these doubly-echoed lines an echoMess, and the original line is the source. The echoMess might be the same as the source if the characters were echoed right after the end-of-line was echoed, or it might be two copies of the source if they were echoed completely after all the input text but before the end-of-line was echoed, or some mixing of the two.

Basically, we want a function:
    boolean echoEqual(String source, String echoMess)

Ideas

One idea is to simply have my program wait after each character is typed for it to be echoed. Then I just have to check if echoMess is twice as long as source and each character echoMess[i] == source[i/2]. This isn't a bad idea, but I decided it would take longer to try.

Another idea is to make a set of all the characters in source and then see if the echoMess has all the same characters. This may return true erroneously, but it can quickly weed out most things that aren't equal.

Or, instead of making a set, I could put a count of all the source characters in a small array. If a character in the source appears N times, the count of the character in the echoMess must be between N and 2N for them to be echoEqual, but again, it doesn't guarantee it.

Another approach is to remove the first occurrence of each letter in source from the echoMess. The E characters left in the echoMess must be the first E characters in the source, though they may be out of order.
    Example: abac in the mess: ababac
    Removing the first abac leaves: ba, not ab.

So after removing the first set of letters from the echoMess, remove the rest of the echoMess from the earliest part of the source string. If what's left of the source string is just part of the end of the source string, it was almost certainly a match- very few cases will be false-positives. (When I tried this on the set of 33 commands I need, only one occurred in the wrong order.)

In all of these solutions, it's still (theoretically?) possible that an echoMess could echoEqual a source string that didn't cause it.
    Example: abac in the mess: ababacca
    Removing the first abac leaves: baca
The echoed c must occur at the end if it was generated by abac
Maybe this was generated by the source string:   abacc

Solution

The perfect solution would find a match if and only if the source string and an overlapping full or partial echo could actually generate the echoMess string.

To write such an algorithm, we'll look at four questions.
How will we iterate?
What happens at the start?
What would it do in the middle?
Are we sure it will end?

The first way I thought of is to iterate on each character of the mess. Each character is part of the source string or the echo string. So it sounds like I'll have 3 parameters- the src (source) string, its echo, and the echoMess.

At the start, it's going to ask if the first character of the mess is part of the source string.

In the middle, it will have either said no, it the src string can't generate the echoMess, or it will have used up some of the src string, some (or none) of the echo, and some of the echoMess. At that point it'll ask if the next character of the echoMess is part of the src string or its echo.

At the end, unless it has decided it's impossible (such as if it finds an echoMess character not in the source string), it will have used up the echoMess and the source string and all, none or part of the echo.

I'm just about ready to write it. Java allocates a new object every time it makes a substring, so I originally wrote this passing indexes, but the substring-passing version is easier to read:

    static int ctr = 0;  // counts procedure calls for statistics

    /**
     * When telnet'ing, sometimes characters are echoed twice after a bit.
     * So you get two identical strings overlapped, with each 
     * char of the second after its corresponding char of the first.
     * This routing tells if the echo string is such a string.
     * @param src The source command string. 
     * @param echoMess The string containing good + some/all of good's echo
     * @return  true if echoMess can be generated by src and a full/partial echo
     */
    public static boolean echoEquals(String src, String echoMess) {
        ctr = 0;  // for statistics
        return echoEquals(echoMess, src, src);
    }

    /**
     * Returns true if
     * a) the first char of (what's left of) mess matches the first char of src
     * b) or the first char of mess matches the first char of the echo
     * and c) the src is longer than the echo 
     * 
     * @param mess
     * @param src
     * @param echo
     * @return  true if src and its echo can generate the mess
     */
    private static boolean echoEquals(String mess, String src, String echo) {
        if (mess.length() == 0) // got to the end of the input string! 
            return (src.length()==0);  // at end of first is good, short is bad
        if (src.length() > echo.length()) // don't allow src shorter than echo
            return false;
        if (src.length() == 0) { // got to the end of the first string
            // what's left of the mess must be part/all of the echo
            return echo.startsWith(mess);
        }
        ctr++;  // for statistics
        char m = mess.charAt(0),  s = src.charAt(0),  e = echo.charAt(0);
        // Here we try it two ways- if mess[0] matches src[0] or if that
        //   doesn't work, if it matches echo[0]
        if (m==s && echoEquals(mess.substring(1), src.substring(1), echo))
            return true;
        return m==e && echoEquals(mess.substring(1), src, echo.substring(1));
    }

Note that in each recursive call, we've chopped the first character off of the mess and one of the other strings, so it can't run forever. Also note that I've added a counter (ctr) to count the number of recursive calls. I count it after the initial checks- these could be done before the recursive calls, but it makes the code harder to read.

In my 40 actually-seen examples and test cases, the algorithm used a maximum of 1.83 calls per character.

On the other hand, if we check if aaaa can generate aaaaaaab, it'll try matching many times. If this is a possibility, use one of the other algorithms first to minimize the chance this will take so long. Below I show a few of these examples- it's going up approximately by a factor of 3 each time:

  compareEcho(a,       aac)              =false -- 1: 1.00
  compareEcho(aa,      aaaac)            =false -- 3: 1.50
  compareEcho(aaa,     aaaaaac)          =false -- 8: 2.67
  compareEcho(aaaa,    aaaaaaaac)        =false -- 22: 5.50
  compareEcho(aaaaa,   aaaaaaaaaac)      =false -- 64: 12.80
  compareEcho(aaaaaa,  aaaaaaaaaaaac)    =false -- 196: 32.67
  compareEcho(aaaaaaa, aaaaaaaaaaaaaac)  =false -- 625: 89.29
  compareEcho(aaaaaaaa,aaaaaaaaaaaaaaac) =false -- 2055: 256.88

  compareEcho(aaaabbbb,aaaaaaaabbbbbbbc) =false -- 358: 44.75
  compareEcho(abababac,aabbaabbababaacc) =true -- 16: 2.00

In the 40 true example test casts I tried, the worst I got was 1.83 procedure calls per source string character- good enough.

(The real problem was that occasionally, I only got a partial echo of a single instance of the source string. Sigh...)


To reach me, use the form at the bottom of my home page.
-r


(The telnet program has an option to turn on echoing, but no way to turn it off- strange, but that's why the phrase "Microsoft Quality" is an oxymoron.) My program needs to process the log whether it's produced by machine or by hand, so it's not an option just to remember the order. On the other hand, it would be very simple to check if one of the simpler algorithms is sufficient for my input. But how often do I get to write one of these papers?)