Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wiki retreval service #324

Merged
merged 17 commits into from
Aug 14, 2024
Merged

Add Wiki retreval service #324

merged 17 commits into from
Aug 14, 2024

Conversation

PengfeiHePower
Copy link
Contributor


name: Pull Request
about: Create a pull request

Description

wiki.py implements Wikipedia retrieval including text, category list, infobox, image and table.
wiki_test.py implement unit tests.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

Copy link
Collaborator

@ZiTao-Li ZiTao-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the inline comment. Other LGTM

src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
@ZiTao-Li ZiTao-Li requested a review from DavdGao July 3, 2024 22:32
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the functions provided in this PR as follows:

  1. Why do we provide wiki_get_infobox, wiki_get_all_wikipedia_tables, wiki_get_page_images_with_captions and wiki_get_page_content_by_paragraph respectively rather than combining them into one function that provides wikipedia search functionality, and returns all available information? For example:
{
    "infobox": ...
    "content": ...  # the tables should be in the content
    "image": ...
}

If I was a user and want to search in wikipedia, a simple wikipedia_search function will be enough, and too much categorization can create a barrier to understanding. For example:

  • What do the infobox and tables stand for?
  • For users, how to decide which function to use?
  1. The functionality partly overlaps with that of the service function digest_webpage. Try to reuse it rather than reimplementing.

  2. I wonder how do we handle the fuzzy search in wikipedia. For example, sometimes wikipedia will return a candidate list rather than skip to the corresponding web page as follows.

image

setup.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
src/agentscope/service/web/wiki.py Outdated Show resolved Hide resolved
@PengfeiHePower
Copy link
Contributor Author

I'm confused about the functions provided in this PR as follows:

  1. Why do we provide wiki_get_infobox, wiki_get_all_wikipedia_tables, wiki_get_page_images_with_captions and wiki_get_page_content_by_paragraph respectively rather than combining them into one function that provides wikipedia search functionality, and returns all available information? For example:
{
    "infobox": ...
    "content": ...  # the tables should be in the content
    "image": ...
}

If I was a user and want to search in wikipedia, a simple wikipedia_search function will be enough, and too much categorization can create a barrier to understanding. For example:

  • What do the infobox and tables stand for?
  • For users, how to decide which function to use?
  1. The functionality partly overlaps with that of the service function digest_webpage. Try to reuse it rather than reimplementing.
  2. I wonder how do we handle the fuzzy search in wikipedia. For example, sometimes wikipedia will return a candidate list rather than skip to the corresponding web page as follows.

image

  1. I have added a new function "wiki_page_retrieval" which includes all types of information (output as a dictionary) excluding the category list. This is because the category list is a totally different type of content.
  2. We would like to maintain a key-val structure for contents in 'infobox', but the 'parse_html_to_text' function in web_digest only provide plain texts and it it not convenient to reuse.
  3. We have provided a function '_check_entity_existence' which will first check if the entity exists in Wiki. If the entity does not have a page (as in your example), it will return the first 5 entities listed below (i.e. Tong Dawei, Gao Lu,... in your example).

Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DavdGao DavdGao merged commit 01530ee into modelscope:main Aug 14, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants