String Matching Algorithms

Danushka Herath

1 year ago

Reading Time: 3 minutes

What is String Match?

String Match is a useful activity in our day today lives. If you want to search something that you don’t know you just google the word or set of words that describes what you want to know. And in the studies if you want to find a phrase in a book you search for it one page by page. The manual efficient String Search Algorithm is the dictionary. Dictionary has ordered every word in ascending order of the alphabet. If you want to find a word from the dictionary you can move to the exact page very fast.

But Just imaging if you have a book with 1000 pages and just text paragraphs and you want to find a word. You will have to read the entire book to find what you search for. But how computer find a string so quickly from a thousands of words. I’m going to explain some algorithms works inside the computer from this set of posts. These are the algorithms:

Naive Search Algorithms
Boyer – moore Algorithm
Robin – karp Algorithm

1. Naive Search Algorithm

Native Search algorithms is kind of idiot algorithm that use in computers to search words. I call it idiot because the run time is very high. However this is the simplest algorithm for implementation

#include <stdio.h>
#include <memory.h>

void bruteForce(char text[],char pattern[]);

int main() {
char text[]="abbcabab";
char pattern[]="ab";
bruteForce(text,pattern);
}

void bruteForce(char text[],char pattern[]){

int maxShift= strlen(text)- strlen(pattern);
for (int i=0;i<maxShift;i++){
int count_of_match=0;
for (int j=0;j< strlen(pattern);j++) {
if (text[i + j] == pattern[j]) {
count_of_match++;
}
}
if (count_of_match== strlen(pattern)){
printf("%s%d\n","pattern matched at index ",i);
}
}
}

For this kind of search It will take O(mn) kind of complexity for find a single word. It’s heavily use the system resources because it compares character by character again and again. Therefore we have some other algorithm to make it Faster.

2. Boyer Moore Algorithm

This is a kind of fast algorithm when comparing with naive search. Because this can shift the pattern fast and find the index where the pattern exactly positioned.

You can more refer about this algorithm from here : https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

In this algorithm the most considered thing is bad characteristic table. First we go thorough all characters in the text and make and array with key as the index and the value as the index character last occurred index of the pattern. If the selected character is not occurred in the pattern we put -1 as the value . This is the reprocessing part of the algorithm. Once this table is generated you can search for a string with O(n) complexity.

Ok let’s go for the Implementation.

#include <stdio.h>
#include <string.h>

int findchar(char pat[], char letter);
int boyer_moore(char txt[], char pat[]);
void setTable(char text[],char pattern[]);
int index_table[256];

int main(){
char text[] = "ratcatbatnatfat";
char pattern[] = "cat";
int shift= boyer_moore(text,pattern);
if (shift!=-1){
printf("%s%d\n","Match Found at index",shift);
}else{
printf("%s\n", "Not found a Match");
}
return 0;
// findchar(pattern,'c');
}

int boyer_moore (char txt[], char pat[]){
setTable(txt,pat);
int i,j,n=strlen(txt),m=strlen(pat),lastch;
i=m-1;
j=m-1;
while (i<n){
if (txt[i]==pat[j]){
// printf("%c%s%d\n",txt[i]," i= ",i);
if (j==0){
//printf("%s%d\n"," ",i);
return i;
}else{
j--;
i--;
}
}else{
lastch=index_table[(int)txt[i]];
// printf("%s%c%d","Last char of ",txt[i],lastch );
if(lastch==-1){
i+=m;
}else{
i+=(j-lastch);
}
// printf("%s%d\n"," i= ",i);
j=m-1;
}
}
return -1;
}

int findchar(char pat[], char letter){
int pat_size= strlen(pat)-1;
for (int i=0; i<pat_size+1; i++){
// printf("%c%s%c\n", pat[pat_size-i]," > ",letter);
if (pat[pat_size-i]==letter){
// printf("%s%c%d","letter ",letter,(pat_size-i));
return (pat_size-i);
}
}
return -1;
}

void setTable(char text[],char pattern[]){
for (int i=0;i<strlen(text);i++) {
int current= findchar(pattern,text[i]);
index_table[(int)text[i]]=current;
}
// for (int i=0;i<256;i++){
// printf("%d\n",index_table[i] );
// }
}

I’m not going to explain whole code. But If you don’t understand what kind of method that I have used simply just put a comment about your problem. Then I would like to move in to the rabin karp algorithm.

3. Rabin Karp Algorithm

Rabin karp is kind of method that based on some calculation. Actually humans can’t go with this process because we are poor in calculations. But a computer can do it very fast. So in this method we use a Hashing method to compare two strings. Hash is generated by a mathematical formula. I will paste the Implementation and reference link for find more information about this algorithm.

Reference : https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

#include <stdio.h>
#include <string.h>

#define d 256

int rabincarp_search(char pattern[], char text[], int q){
int m = strlen(pattern);
int n= strlen(text);
int i,j,t=0,p=0,h=1;

for (i=0; i<m-1; i++){
h=(h*d)%q;
}

for (i=0; i<m ; i++){
p=(p*d + pattern[i])%q;
t=(t*d + text[i])%q;
}

for (i=0; i<=n-m; i++){
if (p==t){
for (j=0; j<m; j++){
if (text[i+j]!=pattern[j]){
break;
}
}
if(j==m){
printf("%s%d\n", "Pattern has found in index ",i);
}
}
if (i<n-m){
t=(d*(t-text[i]*h) + text[i+m])%q;
if (t < 0)
t = (t + q);
}
}
return 0;
}

int main(){
char text[]="carratcappatnat";
char pattern[]="rat";
int q=101;
int shift=rabincarp_search(pattern,text,q);
return shift;
}

This will little weird if you don’t try it yourself. Just try and post your questions here. I will be there to help you. Thank You.