Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ai-form-recognizer] Lazy iterator for words of a line #18444

Merged

Conversation

witemple-msft
Copy link
Member

@witemple-msft witemple-msft commented Oct 29, 2021

This PR adds support for accessing the words of a DocumentLine. The computation is lazy, supported by a fast algorithm that assumes the words and spans are sorted. When the first element is iterated, a binary search is performed to find the first possible child element n the words array. Each subsequent iteration is nearly constant-time, as the elements are sorted and the algorithm optimizes the iteration by only comparing spans/words for as long as they are possible candidates for inclusion. The tests compare the results of this algorithm with a naive implementation.

It also adds samples and a CHANGELOG entry.

@witemple-msft witemple-msft added Client This issue points to a problem in the data-plane of the library. Cognitive - Form Recognizer labels Oct 29, 2021
@witemple-msft witemple-msft self-assigned this Oct 29, 2021
@azure-sdk
Copy link
Collaborator

API changes have been detected in @azure/ai-form-recognizer. You can review API changes here

API changes

+     words?: () => IterableIterator<DocumentWord>;

@witemple-msft witemple-msft marked this pull request as ready for review November 5, 2021 20:45
@azure-sdk
Copy link
Collaborator

API changes have been detected in @azure/ai-form-recognizer. You can review API changes here

API changes

+     words: () => IterableIterator<DocumentWord>;

@witemple-msft
Copy link
Member Author

/azp run js - ai-form-recognizer - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@deyaaeldeen deyaaeldeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall but I left a few comments that I believe worth considering before merging.

Also, could you please run the release script?

sdk/formrecognizer/ai-form-recognizer/src/lro/analyze.ts Outdated Show resolved Hide resolved
* @returns a convenience layer DocumentLine
*/
function toDocumentLineFromGenerated(
generated: GeneratedDocumentLine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: perhaps the generated param can be typed as DocumentLine instead of being cast to it twice in the body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't satisfy DocumentLine until after the words method is added, so I'm reluctant to do this.

Comment on lines +122 to +130
it("search finds correct index", () => {
for (const datum of TEST_DATA) {
const testSpan = { offset: datum.span.offset, length: 1 };
assert.strictEqual(
(iteratorFromFirstMatchBinarySearch(testSpan, TEST_DATA).next().value as TestData)?.id,
naiveFindFirst(testSpan, TEST_DATA)?.id
);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: did you consider a quickcheck-style tests? I am not sure if JS has a good lib for it. Found https://github.com/gampleman/quick_check.js/ but not sure how good it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider this, but it's an interesting idea. When we support more generic kinds of element relationships, I'd like to add some fuzzing to this test matrix, but I will explore that during MQ.

sdk/formrecognizer/ai-form-recognizer/src/lro/analyze.ts Outdated Show resolved Hide resolved
Copy link
Member

@deyaaeldeen deyaaeldeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@witemple-msft witemple-msft merged commit 9fa6c41 into Azure:main Nov 5, 2021
sadasant pushed a commit to sadasant/azure-sdk-for-js that referenced this pull request Nov 6, 2021
* [ai-form-recognizer] Lazy iterator for words of a line

* Use method instead of property

* Regenerate API

* Polished, wrote changelog, added some more tests, samples

* Updated API MD

* Improved docs

* Apply changes from review
danieljurek pushed a commit that referenced this pull request Nov 10, 2021
* [ai-form-recognizer] Lazy iterator for words of a line

* Use method instead of property

* Regenerate API

* Polished, wrote changelog, added some more tests, samples

* Updated API MD

* Improved docs

* Apply changes from review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Cognitive - Form Recognizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants