Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikitext headings are considered part of the sentence #2

Open
prtksxna opened this issue Aug 12, 2020 · 2 comments
Open

Wikitext headings are considered part of the sentence #2

prtksxna opened this issue Aug 12, 2020 · 2 comments

Comments

@prtksxna
Copy link
Member

In Wikitext headings are marked up between multiple = characters and are separated from the text using new lines. When breaking the text into sentences, wink-nlp doesn't consider the heading to be a separate sentence.

const text = `He spoke of a five-year freeze in domestic spending, eliminating 
tax breaks for oil companies and reversing tax cuts for the wealthiest Americans, 
banning congressional earmarks, and reducing healthcare costs. He promised the 
United States would have one million electric vehicles on the road by 2015 and 
be 80% reliant on \"clean\" electricity.\n\n\n==== LGBT rights ====\nOn October 
8, 2009, Obama signed the Matthew Shepard and James Byrd Jr. Hate Crimes 
Prevention Act, a measure that expanded the 1969 United States federal hate-crime 
law to include crimes motivated by a victim's actual or perceived gender, sexual 
orientation, gender identity, or disability.On October 30, 2009, Obama lifted the 
ban on travel to the United States by those infected with HIV, which was celebrated 
by Immigration Equality.On December 22, 2010, Obama signed the Don't Ask, Don't 
Tell Repeal Act of 2010, which fulfilled a key promise made in the 2008 
presidential campaign to end the Don't ask, don't tell policy of 1993 that had 
prevented gay and lesbian people from serving openly in the United States Armed 
Forces. In 2016, the Pentagon also ended the policy that barred transgender 
people from serving openly in the military.`;
const doc = nlp.readDoc( text );

console.log( doc.sentences().itemAt(2).out() );

The output for this was:

==== LGBT rights ====
On October 8, 2009, Obama signed the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, a measure that expanded the 1969 United States federal hate-crime law to include crimes motivated by a victim's actual or perceived gender, sexual orientation, gender identity, or disability.

The expected outcome would be that ==== LGBT rights ==== and the rest of the text are in two separate sentences. This might be too specific a use case to actually solve for.

@prtksxna
Copy link
Member Author

Should this be raised in https://github.com/winkjs/wink-eng-lite-model instead?

@sanjayaksaxena
Copy link
Member

@sanjayaksaxena sanjayaksaxena transferred this issue from winkjs/wink-nlp Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants