Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Rabin–Karp algorithm #413

Merged
merged 8 commits into from
Oct 12, 2021
Merged

Added Rabin–Karp algorithm #413

merged 8 commits into from
Oct 12, 2021

Conversation

pratikgl
Copy link
Member

Added Rabin–Karp algorithm to find API

References to other Issues or PRs or Relevant literature

Fixes #402

Brief description of what is fixed or changed

In the find API implemented in the string algorithm. I have added Rabin–Karp algorithm which finds the occurrence of a query string in a text using string hashing.

Other comments

Reference: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #413 (823fc7b) into master (5cf53a2) will increase coverage by 0.013%.
The diff coverage is 100.000%.

❗ Current head 823fc7b differs from pull request most recent head 98209e4. Consider uploading reports for the commit 98209e4 to get more accurate results

@@              Coverage Diff              @@
##            master      #413       +/-   ##
=============================================
+ Coverage   98.622%   98.635%   +0.013%     
=============================================
  Files           26        26               
  Lines         3411      3444       +33     
=============================================
+ Hits          3364      3397       +33     
  Misses          47        47               
Impacted Files Coverage Δ
pydatastructs/strings/algorithms.py 100.000% <100.000%> (ø)

Impacted file tree graph

Comment on lines 125 to 128
if q == 0:
for i in range(t):
positions.append(i)
return positions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if q == 0:
for i in range(t):
positions.append(i)
return positions
if q == 0 or t == 0:
return positions

@@ -107,3 +109,38 @@ def _do_match(string, query, kmp_table):
k = k + 1

return positions

def hash_str(string, p=137, m=1e9 + 7):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def hash_str(string, p=137, m=1e9 + 7):
def _hash_str(string, p=137, m=1e9 + 7):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation, I've used a similar hash function but not the exact same. Instead of doing T[i] * p^(j-i) I've done T[i] * p^(i). So instead of iterating from len(T) to 0, it is iterating from 0 to len(T)
Here is the reference https://cp-algorithms.com/string/string-hashing.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating 1e9 + 7, let's define it as a constant (PRIME_NUMBER = 10000000007) somewhere at the starting of this file and then use it else where. We also need a Note in documentation of find to explain the details of hash function like what value of p and M we are using and why.
Rest it's fine.

@@ -71,7 +73,7 @@ def _knuth_morris_pratt(text, query):

def _build_kmp_table(query):
pos, cnd = 1, 0
kmp_table = OneDimensionalArray(int, len(query) + 1)
kmp_table = OneDimensionalArray(int, len(query) + 5) # hacky fix to avoid error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is kmp_table = OneDimensionalArray(int, len(query) + 1) then for the case where len(query) is 0 the line kmp_table[pos] = cnd after the while statement will pop out an error index out of range as pos is intialised with 1. I've increased the array size so it won't go out of range. kmp_table = OneDimensionalArray(int, len(query) + 2) will also work though ;).
Rest the increased array size will not affect anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this at the entry point of _kmp.

@pratikgl
Copy link
Member Author

Changes added:

  1. Added _p_pow function to decrease redundancy in the code
  2. Defined global variables PRIME_NUMBER, MOD

@czgdp1807
Copy link
Member

Thanks @pratikgl

@pratikgl pratikgl deleted the rka branch October 12, 2021 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rabin Karp Algorithm (GSSoC)
2 participants