Tuesday, August 27, 2013

[Algorithms] Suffix Automata (DAWG)

KeyWords: string-matching, pattern-matching, indexing, inverted text, finite-automata, suffix-trees, suffix-automata,suffix-tries, factor-automata, music identification, Directed Acycle Word Graph (DAWG).

Let me name some common problems on string.

finding Least Common String of strings, Sub String search, Palindromes, Angarams, Dictionary search, Commonly occuring string (Atom String), unique significant substring/pattern, least common prefix, least common suffix, whats not in the list.

The key point for solving all of these problems is , maintain all data about a string.
All data about a string means {All prefix substrings,All suffix substrings, All sub-strings} .

What matters here ??

You can solve string related problems using, either Brute Force method or using some Data Structure which will allow you to store data about string, considering your time/space complexities.

    the way your data-structure stores data about  the string? -Time and space taken for processing of the string.
    the way your data-structure retrieves data about the string?- Query response time /time taken to retrieve the data.
If you are going to use data-structures, here are the few structure that you'll obvisouly look out for.



Data about strings can be comprised in either a Suffix-Tree/Trie or Suffx-Array or Finite-Automata or Factor-Oracle.

So many people did a vast research on the area of string processing.If this area interests you as well, then you should go throught the sources mentione at the end of this post.

Since, space-time are driving factors in dealing with large data, i prefer to construct a suffx automata(DAWG), for the problems on strings.

Examples of DAWGs:

For String ""

For String "a"

For String "aa"

For String "ab"

For String "aba"

For String "abb"

For String "abbb"


Explanation(Example String: "GCAGAGAG"):


State-3 is cloned as State-6 while inserting 2nd 'A', after state-4, to allow DAG Flow for the suffix string AGAGAG,AGXXXX
State-4 is cloned as State-8 while inserting 2nd 'G', after state-6, to allow DAG Flow for the suffix string GAGAG,GAXXX.
...Just do some work around on this string further, you will understand.

Constructing Suffix-Automata in O(N)-Time / O(N logK) space(Forward DAWG Matching algorithm):

Forward DAWG Matching algorithm, which is smallest suffix automation with Deterministic Finite Automaton.
How to Construct:
1. Define the state
    typedef struct state{
        int length;             //Length of the DAWG till this state
        state* prev;            //Link to previous transition state in the DAWG
        state* next[ALPHA_MAX]; //Links to next transition state in the DAWG
    } state;
2. Automata is constructed with set of states and transition from state-to-state. So define your Automata which comprises of the states, variables to track states and methods that enables to do state transitions.
    typedef struct SAM{
        /*FSM Variables*/
        state pool[MAX];        //You need a pool of states, from where you can pick a raw state to include in the DAWG.    
        int count;              //You need to track number of states present in the DAWG.
        state* head,last;       //You need to track your head of the DAWG and last State of the DAWG.

        /*FSM Methods*/
        void SAM_init();            //Initialize your Finite state Machine.
        voidSAM_extend(int input);  //Based on the  input extend states in your FSM(Finite State Machine)
    } SAM;
3. Initialization of FSM.
    void SAM::SAM_init(){
            memset(pool,0,sizeof(pool));
            count=0;
            head=last=&pool[count++];
    }
4. Extending the FSM.
    void SAM::SAM_extend(int input){
        state* new_state = &pool[count++];
        new_state->length = 1 + last->length;
        state* p = last;
        for( ; p && !p->next[input]; p->next[input]=new_state,p=p->prev);
        if(!p){
            new_state->prev = head;
        }else{
            state* q = p->next[input];
            if(p->length +1 == q->length){
                new_state->prev = q;
            }else{
                state* clone = &pool[total++];
                *clone = *q;
                clone->length = p->length +1;
                for(;p && p->next[input]==q;p->next[input]=clone,p=p->prev);
                new_state->prev = q->prev = clone;
            }
        }
        last = new_state;
    }
5. You can include other methods, If required (like below method, which i used it for the traversal of the DAWG.)
    void SAM::SAM_traverse(state** p,int& len,int input){
        while(*p!=head && !(*p)->next[input]){
            *p = (*p)->link;
            len = (*p)->length;
        }
        if((*p)->next[input]){
            *p = (*p)->next[input];
            len++;
        }
    }
Sources: http://www.cs.nyu.edu/~mohri/pub/nfac.pdf http://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf http://www-igm.univ-mlv.fr/~lecroq/string/fdm.html#SECTION00220 http://e-maxx.ru/algo/suffix_automata http://www-igm.univ-mlv.fr/~lecroq/string/node1.html Practise problems: http://www.spoj.com/problems/LCS/ http://www.spoj.com/problems/LCS2/

Thursday, August 1, 2013

SPOJ-2916::Can you answer these queries V

http://www.spoj.com/problems/GSS5/

Typical problem statement can be seen as below.

Problem: Given a array of numbers a[1...n] , and a Query(x1,y1,x2,y2).
Query(x1,y1,x2,y2) = Max { A[i]+A[i+1]+...+A[j] ; x1 <= i <= y1 , x2 j <= y2 and x1 <= x2 , y1 <= y2 }.

Lets analyze this query...
Wait a minute...!!
    Have you read my post on GSS1 problem ??
    Have you ever Solved GSS1 problem ??
If your answer is NO, then I suggest you to do that problem first.
http://code.karumanchi.me/2013/07/spoj-1043can-you-answer-these-queries-i.html


Here two cases araises based on {x1 <= i <= y1 , x2 j <= y2 and x1 <= x2 , y1 <= y2 }.
 Case-1: (x1,y1) and (x2,y2) doesn't overlap.
 Case-2: (x1,y1) and (x2,y2) overlaps.

Case-1::No Overlapping
 result would be (x1,y1).bestRightSum + (y1+1,x2-1).Sum +(x2,y2).bestLeftSum;

Case-2::No Overlapping
 result would be max of
   {
   (x1,x2-1).bestRightSum  + (x2,y2).bestLeftSum,
   (x1,y1).bestRightSum  + (y1+1,y2).bestLeftSum,
   (x2,y1).bestSum
   }



Implementation of the Same - Solution
// =====================================================================================
//       Filename:  GSS5.cpp
//    Description:
//         Author:  BrOkEN@!
// =====================================================================================
#include<cassert>
#include<cctype>
#include<climits>
#include<cmath>
#include<cstdio>
#include<cstdlib>
#include<cstring>
#include<ctime>
#include<bitset>
#include<deque>
#include<list>
#include<map>
#include<queue>
#include<set>
#include<stack>
#include<vector>
#include<fstream>
#include<iostream>
#include<sstream>
#include<streambuf>
#include<algorithm>
#include<iterator>
#include<new>
#include<string>
#include<utility>

template< class T >
inline T maxOfThree(T a, T b, T c){
    return max(max(a,b),c);
}

#define FOR(i,a,b) for(typeof((a)) i = (a); i <= (b) ; ++i )
#define REV_FOR(i,b,a) for(typeof((b)) i = (b); i >= (a) ; --i )
#define FOREACH(it,x) for(typeof((x).begin()) it=((x).begin()); it != ((x).end()); ++it)
#define REV_FOREACH(it,x) for(typeof((x).rbegin()) it=((x).rbegin()); it != ((x).rend()); ++it)


using namespace std;

typedef pair<int,int> PI;
typedef vector<PI> VI;

typedef struct node{
    int bestLeftSum,bestRightSum,Sum,bestSum;
   
    node merge(node& l,node& r){
        bestLeftSum = max(l.bestLeftSum,l.Sum+r.bestLeftSum);
        bestRightSum = max(l.bestRightSum+r.Sum,r.bestRightSum);
        Sum = l.Sum + r.Sum;
        bestSum = maxOfThree(l.bestSum,r.bestSum,l.bestRightSum+r.bestLeftSum);
    }
   
    node setNode(int val){
        bestLeftSum = bestRightSum = Sum = bestSum = val;
    }
       
} node;


const int MAX = 1<<14;

node T[MAX<<1];
int A[MAX];

void init(int Node,int i,int j){
    if(i==j){
        T[Node].setNode(A[i]);
        return;
    }else{
        int m = (i+j)/2;
        init(2*Node,i,m);
        init(2*Node+1,m+1,j);
        T[Node].merge(T[2*Node],T[2*Node+1]);
    }
}

node range_query(int Node,int i,int j,int L,int R){
    if(L > R) return T[0];
    if(i==L && R==j){
        return T[Node];
    }else{
        int m = (i+j)/2;
        if(R<=m){
            return range_query(2*Node,i,m,L,R);
        }else if(L>m){
            return range_query(2*Node+1,m+1,j,L,R);
        }else{
            node resultNode,left,right;
            left = range_query(2*Node,i,m,L,m);
            right = range_query(2*Node+1,m+1,j,m+1,R);
            resultNode.merge(left,right);
            return resultNode;
        }
    }
}

int query(int N,int x1,int y1,int x2,int y2){
    int result =0;
    if(y1<x2){
        result += range_query(1,0,N-1,x1,y1).bestRightSum;
        result += range_query(1,0,N-1,y1+1,x2-1).Sum;
        result += range_query(1,0,N-1,x2,y2).bestLeftSum;
    }else{
        result += maxOfThree(
                    range_query(1,0,N-1,x1,x2-1).bestRightSum  + range_query(1,0,N-1,x2,y2).bestLeftSum,
                    range_query(1,0,N-1,x1,y1).bestRightSum  + range_query(1,0,N-1,y1+1,y2).bestLeftSum,
                    range_query(1,0,N-1,x2,y1).bestSum
                    );
           
    }
    return result;
}

int main(){
    int T=0;
    scanf("%d",&T);

    int N=0,Q=0,x1=0,y1=0,x2=0,y2=0;
    FOR(t,1,T){
        scanf("%d",&N);
        FOR(i,0,N-1){scanf("%d",&A[i]);}
        init(1,0,N-1);
        scanf("%d",&Q);
        FOR(q,1,Q){
            scanf("%d%d%d%d",&x1,&y1,&x2,&y2);
            --x1;--y1;--x2;--y2;
            printf("%d\n",query(N,x1,y1,x2,y2));
        }
    }

    return 0;
}

SPOJ-1043::Can you answer these queries I

http://www.spoj.com/problems/GSS1/

Typical problem statement can be seen as below.

Problem: Given a array of numbers a[1...n] , and a query range [x,y].
query(x,y) should return the sub-sequence sum, whose sum is maximum in the interval [x,y].

Lets analyze this query.
Query(x,y) is the maximum of below values where sum(i,j) = a[i]+a[i+1]+......+a[j];
sum(x,x) sum(x,x+1) sum(x,x+2) .......... sum(x,y-1) sum(x,y)
  sum(x+1,x+1) sum(x+1,x+2) .......... sum(x+1,y-1) sum(x+1,y)
    sum(x+2,x+2) .......... sum(x+2,y-1) sum(x+2,y)
      .......... .......... ..........
        sum(y-1,y-1) sum(y-1,y)
          sum(y,y)
Seriously whats wrong with coloring -we will come to that part.
Blue::The Color of Left Sum.
Where the range starts at 'x' but ends at any where in [x,y].
sum(x,x) sum(x,x+1) sum(x,x+2) .......... sum(x,y-1) sum(x,y)
  sum(x+1,x+1) sum(x+1,x+2) .......... sum(x+1,y-1) sum(x+1,y)
    sum(x+2,x+2) .......... sum(x+2,y-1) sum(x+2,y)
      .......... .......... ..........
        sum(y-1,y-1) sum(y-1,y)
          sum(y,y)

Red::The Color of Right Sum.
Where the range starts at somewhere in [x,y] and ends at 'y'.
sum(x,x) sum(x,x+1) sum(x,x+2) .......... sum(x,y-1) sum(x,y)
  sum(x+1,x+1) sum(x+1,x+2) .......... sum(x+1,y-1) sum(x+1,y)
    sum(x+2,x+2) .......... sum(x+2,y-1) sum(x+2,y)
      .......... .......... ..........
        sum(y-1,y-1) sum(y-1,y)
          sum(y,y)

Grey::The Color of Sum of the Elements.
Simply sum of all elements in interval [x,y].
sum(x,x) sum(x,x+1) sum(x,x+2) .......... sum(x,y-1) sum(x,y)
  sum(x+1,x+1) sum(x+1,x+2) .......... sum(x+1,y-1) sum(x+1,y)
    sum(x+2,x+2) .......... sum(x+2,y-1) sum(x+2,y)
      .......... .......... ..........
        sum(y-1,y-1) sum(y-1,y)
          sum(y,y)

Green::The Color of nested Query ;).
Well, you can see this as the Query(x+1,y-1).
sum(x,x) sum(x,x+1) sum(x,x+2) .......... sum(x,y-1) sum(x,y)
  sum(x+1,x+1) sum(x+1,x+2) .......... sum(x+1,y-1) sum(x+1,y)
    sum(x+2,x+2) .......... sum(x+2,y-1) sum(x+2,y)
      .......... .......... ..........
        sum(y-1,y-1) sum(y-1,y)
          sum(y,y)
Now, you can understand why i had to use these colors.

Lets maintain 4 values. bestLeftSum - Best of all the Left Sums bestRightSum - Best of all the Right Sums Sum - Sum of all the elements bestSum - well we need this to store resultSum of each query (Nested Query ;)) Logically, Query(x,y).resultSum = max (bestLeftSum,bestRightSum,sum,Query(x+1,y-1).bestSum); But still the nested query stuff isn't so good O(N^2) :'(. Why can't we use a tree(SegmentTree) ?? Why not a O(logN) for query?? Lets call the above info {bestLeftSum,bestRightSum,Sum,bestSum} as QueryNode. and given information about QueryNode(L,M) and QueryNode(M+1,R). Can't you be able to determine the QueryNode(L,R) with above information?? QueryNode(L,M) -> l QueryNode(M+1,R) -> r Then For QueryNode(L,R), bestLeftSum = max (l.bestLeftSum,l.Sum+r.bestLeftSum); bestRightSum = max (l.bestRightSum+r.Sum,r.bestRightSum); Sum = l.Sum+r.Sum; bestSum = max(l.bestSum,r.bestSum,l.bestRightSum+r.bestLeftSum); How?? - Check below graphical representation
Implementation of the Same - Solution
// =====================================================================================
//       Filename:  GSS1.cpp
//    Description:  
//        Created:  05/23/2013 06:41:42 PM
//         Author:  BrOkEN@!
// =====================================================================================
#include<fstream>
#include<iostream>
#include<sstream>
#include<bitset>
#include<deque>
#include<list>
#include<map>
#include<queue>
#include<set>
#include<stack>
#include<vector>
#include<algorithm>
#include<iterator>
#include<string>
#include<cassert>
#include<cctype>
#include<climits>
#include<cmath>
#include<cstddef>
#include<cstdio>
#include<cstdlib>
#include<cstring>
#include<ctime>

#define FOR(i,a,b) for(typeof((a)) i=(a); i <= (b) ; ++i)
#define FOREACH(it,x) for(typeof((x).begin()) it=(x).begin(); it != (x).end() ; ++it)

using namespace std;

typedef pair<int,int> PI;
typedef vector<PI> VI;

inline int max2(int a, int b) {return ((a > b)? a : b);}
inline int max3(int a, int b, int c) {return max2(a, max2(b, c));}

const int MAX = 1 << 16;
int a[MAX];

struct node
{
    int Sum,bestLeftSum,bestRightSum,bestSum;
    
    node split(node& l, node& r){
    } // No Need of Split Function
    
    node merge(node& l, node& r)
    {
        Sum = l.Sum + r.Sum;
        bestLeftSum = max( l.Sum + r.bestLeftSum , l.bestLeftSum );
        bestRightSum = max( r.Sum + l.bestRightSum , r.bestRightSum );
        bestSum = max( max( l.bestSum , r.bestSum) , l.bestRightSum + r.bestLeftSum );
    }
    
    node setValue(int val){
        Sum = bestLeftSum = bestRightSum = bestSum = val;
    }
};

node T[MAX << 1];

void init(int Node, int i, int j) {
    if(i==j) { // Initialize Leaf Nodes
        T[Node].setValue(a[i]);
        return;
    }else{ // Summerize Descendant Nodes Nodes
        int m = (i+j)/2;
        init(2*Node, i, m);
        init(2*Node+1, m+1, j);
        T[Node].merge(T[2*Node],T[2*Node+1]);
    }
}

void update(int Node, int i, int j,int idx, int val) { 
// Update element with index 'idx' in the range [i,j]
    if(i==j && i == idx) { // Update the LeafNode idx
        T[Node].setValue(val);
        return;
    }else{ // Summerize Descendant Nodes Nodes
        int m = (i+j)/2;
        if(idx <=m)
            update(2*Node, i, m, idx, val);
        else
            update(2*Node+1, m+1, j, idx, val);
        T[Node].merge(T[2*Node],T[2*Node+1]);
    }
}

void range_query(node& resultNode,int Node,int i, int j, int L, int R){ // Search for Node having interval info [L,R] in [i,j]; (i<=L<R<=j)
    if(i==L && j==R){
        resultNode = T[Node];
        return;
    }else{
        int m = (i+j)/2;
        if(R<=m)
            range_query(resultNode, 2*Node,  i, m, L, R);
        else if(L>m)
            range_query(resultNode, 2*Node+1, m+1, j, L, R);
        else{
            node left, right;
            range_query(left, 2*Node,  i, m, L, m);
            range_query(right, 2*Node+1, m+1, j, m+1, R);
            resultNode.merge(left,right);
        }
    }
}


int solve(){

    return 0;
}


int main(){

    int N=0,M=0;
    node res;
    scanf("%d",&N);
    FOR(i,0,N-1){
            scanf("%d", &a[i]);
    }
    init(1, 0, N-1);
    scanf("%d",&M);
    int x,y;
    FOR(i,0,M-1){
            scanf("%d%d",&x,&y);
                range_query(res, 1, 0, N-1, --x, --y);
                printf("%d\n", res.bestSum);
    }

    return 0;
}