Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: word count #118

Merged

Conversation

gentlementlegen
Copy link
Member

@gentlementlegen gentlementlegen commented Sep 12, 2024

Resolves #82
QA: https://github.com/ubiquibot/conversation-rewards/actions/runs/10845258477/job/30095719413

Changes

  • added log level so logs can be changed through the configuration
  • now only the content of the node is evaluated, ignoring its children content
  • html comments are stripped from the result

# Conflicts:
#	tests/__mocks__/results/content-evaluator-results.json
#	tests/__mocks__/results/formatting-evaluator-results.json
#	tests/__mocks__/results/github-comment-results.json
#	tests/__mocks__/results/output-reward-split.html
#	tests/__mocks__/results/output.html
#	tests/__mocks__/results/permit-generation-results.json
#	tests/__mocks__/results/reward-split.json
@gentlementlegen gentlementlegen marked this pull request as ready for review September 13, 2024 07:55
@@ -10,6 +11,7 @@ import { userExtractorConfigurationType } from "./user-extractor-config";

export const incentivesConfigurationSchema = T.Object(
{
logLevel: T.Enum(LOG_LEVEL, { default: LOG_LEVEL.INFO }),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default be errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nice to get info within the logs by default because some interesting info is there during the process.

@@ -26,6 +26,8 @@ export class DataPurgeModule implements Module {
.replace(/^>.*$/gm, "")
// Remove commands such as /start
.replace(/^\/.+/g, "")
// Remove HTML comments
.replace(/<!--[\s\S]*?-->/g, "")
Copy link
Collaborator

@0x4007 0x4007 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a virtual DOM creator like jsdom or mdast, we should be able to query element.textContent and it should handle this and other situations in a robust manner. Since this is already finished, its fine. But if there are any problems with the implementation, or more operations you need to perform, consider the virtual DOM approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Body comes in a text form. Then it gets transformed into MD -> HTML. So it is rendered as a text form. So typically when fetched from GitHub the body looks like
Resolves #23 <!-- comment --> so I don't see a way to skip it other than removing it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML comments shouldn't be included in element.textContent is my point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but there is no way to know it is a comment before converting it to HTML, since the MD renderer is ran first
https://github.com/gentlementlegen/conversation-rewards/blob/cf5ecb6a9f1bb551fa01c866d6f8ccaefa7e1804/src/parser/formatting-evaluator-module.ts#L112

(it is actually the same for v1) so it is first converted to a p that contains the comment.

@@ -139,17 +141,36 @@ export class FormattingEvaluatorModule implements Module {

for (const element of elements) {
const tagName = element.tagName.toLowerCase();
const wordCount = this._countWords(this._multipliers[commentType].regex, element.textContent || "");
// We cannot use textContent otherwise we would duplicate counts, so instead we extract text nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textContent of the top level parent element will do the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, will be part of #92

score,
};
logger.debug("Tag content results", { tagName, symbols, text: element.textContent });
// If we already had that tag included in the result, merge them and update total count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that for the statistics it might be interesting to count words per element but honestly its out of scope and doesn't add business value while complicating the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be changed in #92

@0x4007 0x4007 merged commit 3120201 into ubiquity-os-marketplace:development Sep 13, 2024
6 checks passed
@ubiquity-os ubiquity-os bot mentioned this pull request Sep 13, 2024
@gentlementlegen gentlementlegen deleted the fix/token-count branch September 13, 2024 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected Word Count
2 participants