Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse DataPack on backend #232

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

mylibrar
Copy link
Collaborator

@mylibrar mylibrar commented Jul 5, 2022

This PR fixes #230 .

Description of changes

The DataPack parsing is moved from frontend to backend.

Possible influences of this PR

We still maintain support to the previous DataPack format so that the example projects won't break.

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still too many low-level calls and reliance on the internal serialization format. This does not utilize the power of parsing the data pack from backend, since at the back end we should use all Forte interfaces to make this task more standard. The usage of DataStore is making the problem even worse.

@@ -371,10 +351,9 @@ def new_link(request, document_id):
link = received_json_data.get('data')
link["py/state"]['_tid'] = link_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like here we use a certain data structure very similar to the Forte format, and need to use some internal understanding. I feel like this would make the system still depend on some certain Forte version.

"""

ANNOTATION_LIST = "annotations"
LINK_LIST = "links"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why only Annotation and Link have the special class variable, how about Group or other top level ontologies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because Stave users can only add/edit/delete these two types of top ontologies. Looks like we haven't design any mechanism for users to manage Group or other types of ontologies through Stave UI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can, by enabling the Group extension


# Clear the previous type info to avoid conflicts in onto
# definitions
DataStore._type_attributes = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to access the DataStore interface? I think this is a low level interface and if we access this that means we actually don't have backward incompatibility.

Btw, not having backward incompatibility should be fine, as long as we know which Forte-Stave version pairs work, but having access to the low level implementation details here looks problematic.

"""
if self._pack is None:
# Add entry to DataPack with legacy format
if self._data_store._is_subclass(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many usages of low-level details here, can we use the public API of data pack only?

Note that using the public API we should know whether a type_name is a subclass of Annotation.

if self._data_store._is_subclass(
type_name=entry_dict["py/object"], cls=Annotation
):
self._pack_json['py/state']["annotations"].append(entry_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we have to keep the py/state structure here? These are stuff used by jsonpickle.

def _transform_pack_json(self):
"""
Transform a DataPack dictionary with legacy format to a json for
frontend rendering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform still looks manual to me. If we want to support legacy format we should directly use Forte data pack API. For example:

for annotation in pack.get("Annotation"):
  annotations.append(
     "span": {
            "begin": annotation.begin,
            "end": annotation.end
      },
  )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes most of this part is still manual parsing. To support legacy format, we don't have access to the pack object since the it's not deserializable using the latest forte. We only have access to a json object like this which is derived from old version of forte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we can feel free to directly use Forte to parse the pack object. So for Stave to work, we simply need to install the correct Forte version. For example, we can update the example db with the new Forte format in the master branch now. To use the old forte format we simply downgrade Stave to a prior version.

Cross-version compatibility is another issue that we can deal with separately. My opinion is that that's not very important right now. Even if we want to deal with that, it probably should happen in Forte repo not Stave repo

@mylibrar
Copy link
Collaborator Author

Block by asyml/forte#881.
There is still too much manual parsing in this PR and we should not directly interact with low level data structures like DataStore. To achieve that, we need add new interfaces in DataPack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate DataPack parsing to backend
2 participants