Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept recap uploads for docket iQuery pages #2374

Merged
merged 3 commits into from
Nov 27, 2022

Conversation

albertisfu
Copy link
Contributor

As requested in freelawproject/recap#251

This PR adds support for a new recap upload type: IQUERY_PAGE = 12

So now we can store Docket Iquery pages and process them.

As commented in freelawproject/juriscraper#589 a Docket Iquery page can be parsed by Juriscraper DocketHistoryReport.

A new task was added process_docket_iquery_page this task adds or updates the docket mentioned in the uploaded docket Iquery page.

@mlissner I have one question, should we add/update the SOLR index for dockets added/updated via a Docket Iquery page using add_or_update_recap_docket?

I saw that for Dockets and Docket history reports uploads the Solr indexes for all its related documents are updated if there are new docket entries/recap documents, or if the docket date_last_index is older than 1 hour ago.

I also read within add_or_update_recap_docket docstring that if the case name changes all the docket entries must be updated.

Since Docket Iquery pages don't have docket entries and Iquery pages might be uploaded many times (due to they are free). I think we only need to update Solr indexes in case we detected the case name have changed after an Iquery page upload?

@mlissner
Copy link
Member

I think we only need to update Solr indexes in case we detected the case name have changed after an Iquery page upload?

Yeah, I think that's a good way to do it.

mark_pq_status(pq, msg, PROCESSING_STATUS.QUEUED_FOR_RETRY)
raise self.retry(exc=exc)

report = DocketHistoryReport(map_cl_to_pacer_id(pq.court_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the PR in Juriscraper, this surprises me.

Comment on lines 828 to 831
@app.task(
bind=True, max_retries=3, interval_start=5 * 60, interval_step=5 * 60
)
def process_docket_iquery_page(self, pk):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine (aside from my one other comment), but did you look around to see if we had other code that'd do this instead? Alternatively, if we have other tasks that are very similar, can/should they be redone to use this code? I just fear repetition, though it's not the end of the world in this case.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back at you. A couple questions.

@albertisfu albertisfu force-pushed the iquery-pages-support-for-recap-uploads branch from 42d47e6 to 5aaf1fd Compare November 22, 2022 23:01
@albertisfu
Copy link
Contributor Author

Thanks! now using CaseQuery to parse iquery page uploads.

We have already tasks for CaseQuery like do_case_query_by_pacer_case_id and make_docket_by_iquery but they query PACER to get the iquery page so I think they are not too similar to the new one process_case_query_page that processes the upload from a ProcessingQueue.

I think this could be merged into a task like process_recap_docket_history_report however I'm not sure since right now each upload_type has its own task, but it could be a new task like: process_case_query_page_or_docket_history_report. Does it sound good?

  • Also, added add_bankruptcy_data_to_docket in case the iquery page contains bankruptcy_data.

  • Added the logic to detect if this is a new docket or if the case_name has changed, so content_updated is assigned as True in order to add or update in Solr.

Let me know what you think.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. One suggestion. If you think it makes sense, please accept it and merge.

For your question about using the docket history parser, that's interesting. I have to admit I never noticed that the docket history page has the same info as the case query page! I was confused why you were using the docket history parser and had to go look at a docket history page before I figured out what you were thinking!

I think for now there's no need to merge them. I kind of like that they're isolated in case there's some minor difference we don't know about.

I also think we could refactor some of our other case_query code to generate a processing queue item (and then process it), but for now it doesn't seem worth the effort.

To summarize:

  • One suggestion for you to consider.
  • Woah, Case Query and Docket History are same?
  • We could refactor code, but let's worry about that another time.

Thank you.

Comment on lines 873 to 881
content_updated = False
current_case_name = d.case_name
d.add_recap_source()
update_docket_metadata(d, data)

if current_case_name != d.case_name or not d.pk:
# This docket should be added to Solr or updated since is new or the
# case name has changed.
content_updated = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things. The search engine only works if we have RECAPDocument items in a docket. Without a RECAPDocument object, we can't add it to Solr, unfortunately. All case query pages do not have enough information to make a RECAP Document for the docket, so if this is a new docket, that's great, BUT no reason to add it to solr (it lacks any RECAP Documents).

Second, we should put content_updated closer to where it's doing work.

So:

Suggested change
content_updated = False
current_case_name = d.case_name
d.add_recap_source()
update_docket_metadata(d, data)
if current_case_name != d.case_name or not d.pk:
# This docket should be added to Solr or updated since is new or the
# case name has changed.
content_updated = True
current_case_name = d.case_name
d.add_recap_source()
update_docket_metadata(d, data)
content_updated = False
if current_case_name != d.case_name:
# Update the docket in PACER since is new or the
# case name has changed.
content_updated = True

@albertisfu
Copy link
Contributor Author

Thanks for your suggestion!

I applied it, just changed it a bit since we need to only update in SOLR dockets that are not new, we need to confirm it has a pk already. Due to if it's new the current_case_name would be blank that's different from the updated case_name.

Since you commented that the docket only should be updated in SOLR if it contains RECAP Documents I added an additional condition to check if the docket contains docket entries.

content_updated = False
if current_case_name != d.case_name and d.pk:
    if d.docket_entries.exists():
        content_updated = True

Does it make sense?

About your question:

Case Query and Docket History are same?

Yeah, seems that they are pretty similar, the difference I could see is that the Docket History report contains some docket entries.

@mlissner
Copy link
Member

Great. Merging, thank you.

@mlissner mlissner merged commit 188c575 into main Nov 27, 2022
@mlissner mlissner deleted the iquery-pages-support-for-recap-uploads branch November 27, 2022 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants