Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rules for removing table markup. Fix #1710 #1954

Merged
merged 3 commits into from
Mar 15, 2018

Conversation

chaitaliSaini
Copy link
Contributor

Updated the regex rules for removing table markup.

@@ -185,10 +186,14 @@ def remove_markup(text, promote_remaining=True, simplify_links=True):
if simplify_links:
text = re.sub(RE_P6, '\\2', text) # simplify links, keep description only
# remove table markup

text = text.replace('||', '\n|') # each table cell on a separate line
text = text.replace("!!", "\n|") # leave only cell content(in table head)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a performance difference after this changes?

@@ -645,6 +645,18 @@ def test_max_token_len_set(self):
corpus = self.corpus_class(self.enwiki, processes=1, token_max_len=16, lemmatize=False)
self.assertTrue(u'collectivization' in next(corpus.get_texts()))

def test_removed_table_markup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI stuck on this test (and this failed), please debug it.

@menshikh-iv
Copy link
Contributor

ping @chaitaliSaini, how is going?

@chaitaliSaini
Copy link
Contributor Author

chaitaliSaini commented Mar 12, 2018

Time taken before changes: 2hrs, after changes: 2hrs 22mins.
Corpus used: Wikipedia dump(~14.6GB)

"""Table cell formatting."""
RE_P14 = re.compile(r'\[\[Category:[^][]*\]\]', re.UNICODE)
"""Categories."""
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)
"""Remove File and Image templates."""
RE_P16 = re.compile(r'\[{2}(.*?)\]{2}', re.UNICODE)
"""Capture interlinks text and article linked"""

RE_P17 = re.compile(r'(\n.{0,4}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))', re.UNICODE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: too long line https://travis-ci.org/RaRe-Technologies/gensim/jobs/351361148#L510, please fix this (we using 120 characters limit)

@menshikh-iv
Copy link
Contributor

@piskvorky this change ~20% slowdown segment_wiki (but fix bug with tables markup), I think we should merge this (because removing table markup is important, despite the loss in performance), wdyt?

@piskvorky
Copy link
Owner

piskvorky commented Mar 12, 2018

Of course. Correctness comes before speed.

@chaitaliSaini is this fix complete, or are there still markup situations where the parsing fails?

@menshikh-iv
Copy link
Contributor

@piskvorky 👍
@chaitaliSaini I need some time for check diff in output files before/after (I want to be sure that all works as expected), sorry for waiting

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 15, 2018

I compared results before and after this change

  • this remove markup
  • sometimes this produce too much \n (but I don't think that this is a problem).

LGTM, good work @chaitaliSaini 👍

@menshikh-iv menshikh-iv merged commit 47d995a into piskvorky:develop Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants