Adding QR codes support in the ImageRedactorEngine #1036

vpvpvpvp · 2023-02-17T18:49:42Z

Change Description

This PR adds to the Presidio Image Redactor the ability to analyze the content of QR codes on the image.

Summary of Changes

Added abstract class QRRecognizer for QR code recognizers
Added concrete OpenCVQRRecongnizer which uses OpenCV to recognize QR codes
Added QRImageAnalyzerEngine which uses QRRecognizer for QR code recognition and AnalyzerEngine to analyze its contents for PII entities
Modified ImagePiiVerifyEngine and ImageRedactorEngine to allow using QRImageAnalyzerEngine as an alternative to ImageAnalyzerEngine

Issue reference

This PR fixes issue #1035

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

vpvpvpvp · 2023-02-17T18:50:29Z

@microsoft-github-policy-service agree

SharonHart · 2023-02-19T18:16:59Z

/azp run

azure-pipelines · 2023-02-19T18:17:13Z

Azure Pipelines successfully started running 1 pipeline(s).

SharonHart · 2023-02-20T12:09:31Z

@vpvpvpvp
Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

vpvpvpvp · 2023-02-20T17:34:32Z

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0
pytesseract - 0.3.10
OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.

omri374 · 2023-02-20T19:46:44Z

Hi @vpvpvpvp, before going deeper into the code, what are your thoughts of having the QR code analyzer working potentially in parallel to OCR? something like that:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    redact_image --> return_image

Then we could always extend it to more types of detectors in the future, similar to the text analyzer architecture, e.g.:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    read_image --> extract_faces
    read_image --> extract_license_plates
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    extract_faces --> redact_image
    extract_license_plates --> redact_image
    redact_image --> return_image

One way to achieve this is to have QRImageAnalyzerEngine extend ImageAnalyzerEngine, and then we could later create a composable ImageAnalyzerEngine which holds multiple image analyzers.
WDYT?

vpvpvpvp · 2023-02-21T00:43:21Z

Hi @omri374, that sounds great! In the current PR you can choose between QRImageAnalyzerEngine and ImageAnalyzerEngine, but it would be great to be able to run them and other analyzers in parallel. At first, I wanted to extend ImageAnalyzerEngine a bit, so that it would also accept QRRecognizer as a parameter in addition to OCR. Something like this:

class ImageAnalyzerEngine:
    """ImageAnalyzerEngine class.

    :param analyzer_engine: The Presidio AnalyzerEngine instance
        to be used to detect PII in text
    :param ocr: the OCR object to be used to detect text in images.
    :param qr: the QRRecognizer object to detect and decode text in QR codes
    """
    def __init__(
        self,
        analyzer_engine: Optional[AnalyzerEngine] = None,
        ocr: Optional[OCR] = None,
        qr: Optional[QRRecognizer] = None,
    ):

And then, in the analyze function, extract the text and its coordinates first with self.ocr and then with self.qr. But then I decided not to change ImageAnalyzerEngine this time, to make less edits to the original source code.

SharonHart · 2023-02-21T08:38:04Z

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

vpvpvpvp · 2023-02-21T09:47:08Z

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2
Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

Updated the test images, locally the tests passed.

SharonHart · 2023-02-21T11:35:54Z

/azp run

SharonHart · 2023-02-21T11:51:43Z

@vpvpvpvp Seems to work now, but failing on another test that should be resolved in #1032 , try to rebase after once merged

omri374 · 2023-02-28T12:36:00Z

/azp run

azure-pipelines · 2023-02-28T12:36:12Z

Azure Pipelines successfully started running 1 pipeline(s).

presidio-image-redactor/Pipfile

omri374 · 2023-03-01T05:35:35Z

/azp run

azure-pipelines · 2023-03-01T05:35:50Z

Azure Pipelines successfully started running 1 pipeline(s).

presidio-image-redactor/setup.py

omri374 · 2023-03-01T07:32:03Z

/azp run

azure-pipelines · 2023-03-01T07:32:17Z

Azure Pipelines successfully started running 1 pipeline(s).

SharonHart · 2023-03-01T09:56:40Z

/azp run

azure-pipelines · 2023-03-01T09:56:54Z

Azure Pipelines successfully started running 1 pipeline(s).

SharonHart · 2023-03-02T07:31:26Z

/azp run

azure-pipelines · 2023-03-02T07:31:41Z

Azure Pipelines successfully started running 1 pipeline(s).

SharonHart · 2023-03-02T07:56:08Z

@vpvpvpvp you have a green build 🎊
Will try to review the code later today

SharonHart · 2023-03-06T06:30:05Z

presidio-image-redactor/tests/integration/test_qr_image_analyzer_engine_integration.py

+from tests.integration.methods import get_resource_image
+
+
+def test_given_qr_image_then_text_entities_are_recognized_correctly(


very nice scenarios

Should I add more complex/realistic scenarios?

SharonHart · 2023-03-06T06:48:46Z

presidio-image-redactor/presidio_image_redactor/qr_recognizer.py

+
+                recognized.append(
+                    QRRecognizerResult(
+                        text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]


passed list for bbox but declared tuple. Also in some places, bbox is a dictionary, not sure what is better but at some point I think we should use a common bbox class

SharonHart · 2023-03-06T06:56:12Z

presidio-image-redactor/presidio_image_redactor/qr_recognizer.py

+            for text, p in zip(decoded, points):
+                (x, y, w, h) = cv2.boundingRect(p)
+
+                recognized.append(
+                    QRRecognizerResult(
+                        text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]
+                    )
+                )


Prefer immutability.

Suggested change

for text, p in zip(decoded, points):

(x, y, w, h) = cv2.boundingRect(p)

recognized.append(

QRRecognizerResult(

text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]

)

)

recognized = [QRRecognizerResult(text=text, bbox=cv2.boundingRect(point), polygon=[*point.flatten(), *point[0]]) for text, point in zip(decoded, points)]

( If you find it too complex, for readability sake, extract into privates = _get_ploygon )

I will add these changes, thanks for the suggestion.

omri374

Thanks again for the contribution! Left some points for discussion, hopefully we can simplify the design and decouple the QR code analysis from downstream classes.

omri374 · 2023-03-02T13:08:35Z

presidio-image-redactor/presidio_image_redactor/qr_recognizer.py

+import numpy as np
+
+
+class QRRecognizerResult:


Can we have that class inherint from ImageRecognizerResult? https://github.com/vpvpvpvp/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/entities/image_recognizer_result.py

I'm not so sure about that. QRRecognizerResult is needed to represent the results of QR code recognition (bboxes and raw text without PII analysis). In this sense, QRRecognizerResult is closer to the dictionary returned by the perform_ocr method of the TesseractOCR(OCR) class. At the same time, ImageRecognizerResult already includes the results of text analysis by the presidio_analyzer.

I see, thanks for the clarification

omri374 · 2023-03-06T07:08:24Z

presidio-image-redactor/presidio_image_redactor/qr_image_analyzer_engine.py

+from presidio_image_redactor.qr_recognizer import OpenCVQRRecongnizer
+
+
+class QRImageAnalyzerEngine:


Can this class be inherited from ImageAnalyzerEngine? Just a question, to see if we can simplify the design instead of extending it to a new set of independent classes.

Was thinking exactly the same, see below.

Yes, it can be inherited from ImageAnalyzerEngine. My concern is that in this case, QRImageAnalyzerEngine will also inherit the logic of working with ocr tools not related to QR code recognition.

Yes, that's my concern too. As the package is still in beta, we should (carefully) consider breaking backward compatibility. We'll do some thinking on this and get back to you. We can also have a quick design session together over video if you're interested.

Yes, that sounds interesting. If you have time, we could do that.

Sure. To avoid putting personal emails on GH, could you please email [email protected] and we'll continue the discussion over email?

omri374 · 2023-03-06T07:09:22Z

presidio-image-redactor/presidio_image_redactor/image_redactor_engine.py

-    def __init__(self, image_analyzer_engine: ImageAnalyzerEngine = None):
+    def __init__(
+        self,
+        image_analyzer_engine: Union[ImageAnalyzerEngine, QRImageAnalyzerEngine] = None,


If QRImageAnalyzerEngine inherits from ImageAnalyzerEngine, then this class could be independent of the QR implementation

SharonHart · 2023-03-06T07:56:13Z

presidio-image-redactor/presidio_image_redactor/image_redactor_engine.py

-        bboxes = self.image_analyzer_engine.analyze(
-            image, ocr_kwargs, **text_analyzer_kwargs
-        )
+        if isinstance(self.image_analyzer_engine, QRImageAnalyzerEngine):


@omri374 @vpvpvpvp Any idea on making it more open-close?
Maybe a single ImageAnalyzerEngine that we inherit from with optional **ocr_kwargs?

In case of direct inheritance of QRImageAnalyzerEngine from ImageAnalyzerEngine, it would only need to add ocr_kwars to the analyze method of QRImageAnalyzerEngine. This is probably the easiest way.

Potentially, it seems like the most optimal implementation when ImageAnalyzerEngine is used for orchestrating different recognizers (ocr recognizer, QR recognizer, etc.). In the vein of what was suggested earlier #1036 (comment).

SharonHart

Looks great, left some minor comments

vpvpvpvp added 18 commits February 8, 2023 12:43

Add abstract QRRecognizer

5600a52

Add OpenCVQRRecongnizer

4a6832a

Move

6112513

Add multiple qr codes support

9511480

Convert PIL to numpy

72a51f3

Fix empty points list

ad81ade

Fix multiple qr codes detection and decoding

3977ebb

Update

2e3ca74

Update qr_recognizer.py

ecd403c

Update image redactor

b78dffd

Update verify engine

a5a32fd

Create QRImageAnalyzerEngine

3c3fee9

Change detection order

8f170c8

Add tests

984286f

Update __init__.py

2a53628

Lock deps

ac97022

Fix Dockerfile

2bc39bb

Update docs

007e0d9

vpvpvpvp requested a review from a team as a code owner February 17, 2023 18:49

Update test qr images

1de5f94

vpvpvpvp and others added 3 commits February 25, 2023 14:24

fix decode duplication

f248581

Merge branch 'main' into main

843ef17

Merge branch 'main' into main

17aff14

omri374 reviewed Feb 28, 2023

View reviewed changes

presidio-image-redactor/Pipfile Show resolved Hide resolved

Update dependencies

e29f0e0

omri374 reviewed Mar 1, 2023

View reviewed changes

presidio-image-redactor/setup.py Outdated Show resolved Hide resolved

Update presidio-image-redactor/setup.py

bb8095c

Update presidio-image-redactor/setup.py

8ac62fb

SharonHart reviewed Mar 6, 2023

View reviewed changes

omri374 reviewed Mar 6, 2023

View reviewed changes

SharonHart reviewed Mar 6, 2023

View reviewed changes

SharonHart previously approved these changes Mar 6, 2023

View reviewed changes

vpvpvpvp added 2 commits March 7, 2023 00:39

Update recognize method

4975682

Merge branch 'main' of https://github.com/vpvpvpvp/presidio

98a7d1a

vpvpvpvp dismissed SharonHart’s stale review via 98a7d1a March 6, 2023 21:42

pabloromeo approved these changes Jun 29, 2024

View reviewed changes

		from tests.integration.methods import get_resource_image


		def test_given_qr_image_then_text_entities_are_recognized_correctly(

		from presidio_image_redactor.qr_recognizer import OpenCVQRRecongnizer


		class QRImageAnalyzerEngine:

Adding QR codes support in the ImageRedactorEngine #1036

Are you sure you want to change the base?

Adding QR codes support in the ImageRedactorEngine #1036

Conversation

vpvpvpvp commented Feb 17, 2023 • edited Loading

Change Description

Summary of Changes

Issue reference

Checklist

vpvpvpvp commented Feb 17, 2023

SharonHart commented Feb 19, 2023

azure-pipelines bot commented Feb 19, 2023

SharonHart commented Feb 20, 2023

vpvpvpvp commented Feb 20, 2023

omri374 commented Feb 20, 2023

vpvpvpvp commented Feb 21, 2023

SharonHart commented Feb 21, 2023

vpvpvpvp commented Feb 21, 2023

SharonHart commented Feb 21, 2023

SharonHart commented Feb 21, 2023

omri374 commented Feb 28, 2023

azure-pipelines bot commented Feb 28, 2023

omri374 commented Mar 1, 2023

azure-pipelines bot commented Mar 1, 2023

omri374 commented Mar 1, 2023

azure-pipelines bot commented Mar 1, 2023

SharonHart commented Mar 1, 2023

azure-pipelines bot commented Mar 1, 2023

SharonHart commented Mar 2, 2023

azure-pipelines bot commented Mar 2, 2023

SharonHart commented Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SharonHart Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

SharonHart Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omri374 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SharonHart left a comment

Choose a reason for hiding this comment

vpvpvpvp commented Feb 17, 2023 •

edited

Loading

SharonHart commented Mar 2, 2023 •

edited

Loading

SharonHart Mar 6, 2023 •

edited

Loading

SharonHart Mar 6, 2023 •

edited

Loading