Merge pull request #55 from IGNF/dev

Integration of Entropy in decision process
IGNF · Mar 28, 2022 · e102738 · e102738
2 parents 27d048d + e23043b
commit e102738
Show file tree

Hide file tree

Showing 22 changed files with 271 additions and 2,529 deletions.
diff --git a/.github/workflows/cicd.yaml b/.github/workflows/cicd.yaml
@@ -25,17 +25,28 @@ jobs:
       run: docker run lidar_prod_im pytest --ignore=actions-runner --ignore="notebooks"
 
     - name: Full module run on LAS subset
-      run: docker run -v /var/data/cicd/CICD_github_assets:/CICD_github_assets lidar_prod_im 
-
-    - name: Evaluate decisions using optimization code on a single, corrected LAS 
       run: >
-        docker run -v /var/data/cicd/CICD_github_assets:/CICD_github_assets lidar_prod_im
-        python lidar_prod/run.py print_config=true +task='optimize'
+        docker run 
+        -v /var/data/cicd/CICD_github_assets/M8.4/inputs/:/inputs/
+        -v /var/data/cicd/CICD_github_assets/M8.4/outputs/:/outputs/ lidar_prod_im
+        python lidar_prod/run.py
+        print_config=true
+        paths.src_las=/inputs/730000_6360000.subset.prototype_format202.las
+        paths.output_dir=/outputs/
+
+    - name: Evaluate decisions using optimization task (debug mode, on a single, corrected LAS) 
+      run: >
+        docker run 
+        -v /var/data/cicd/CICD_github_assets/M8.4/inputs/evaluation/:/inputs/
+        -v /var/data/cicd/CICD_github_assets/M8.4/outputs/evaluation/:/outputs/ lidar_prod_im
+        python lidar_prod/run.py 
+        print_config=true 
+        +task='optimize'
         +building_validation.optimization.debug=true
         building_validation.optimization.todo='prepare+evaluate+update'
-        building_validation.optimization.paths.input_las_dir=/CICD_github_assets/M8.0/20220204_building_val_V0.0_model/20211001_buiding_val_val/
-        building_validation.optimization.paths.results_output_dir=/CICD_github_assets/opti/
-        building_validation.optimization.paths.building_validation_thresholds_pickle=/CICD_github_assets/M8.3B2V0.0/optimized_thresholds.pickle
+        building_validation.optimization.paths.input_las_dir=/inputs/
+        building_validation.optimization.paths.results_output_dir=/outputs/
+        building_validation.optimization.paths.building_validation_thresholds_pickle=/inputs/optimized_thresholds.pickle
 
     - name: clean the server for further uses
       if: always()  # always do it, even if something failed

diff --git a/dockerfile → Dockerfile b/dockerfile → Dockerfile
@@ -14,7 +14,6 @@ RUN apt-get update && apt-get upgrade -y && apt-get install -y \
     wget                        \
     git                         \
     postgis                     \
-    pdal                        \
     libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6   # package needed for anaconda
 
 # install anaconda
@@ -38,17 +37,15 @@ SHELL ["conda", "run", "-n", "lidar_prod", "/bin/bash", "-c"]
 RUN echo "Make sure pdal is installed:"
 RUN python -c "import pdal"
 
-# the entrypoint garanty that all command will be runned in the conda environment
-ENTRYPOINT ["conda",                \   
-    "run",                  \
-    "-n",                   \
+# the entrypoint garanties that all command will be runned in the conda environment
+ENTRYPOINT ["conda", \
+    "run", \
+    "-n", \
     "lidar_prod"]
 
 # cmd for a normal run (non evaluate)
-CMD        ["python",               \
-    "lidar_prod/run.py",    \
+CMD  ["python", \
+    "lidar_prod/run.py", \
     "print_config=true",    \
-    "paths.src_las=/CICD_github_assets/M8.0/20220204_building_val_V0.0_model/subsets/871000_6617000_subset_with_probas.las", \
-    "paths.output_dir=/CICD_github_assets/app/", \
-    "data_format.codes.building.candidates=[202]", \
-    "building_validation.application.building_validation_thresholds_pickle=/CICD_github_assets/M8.3B2V0.0/optimized_thresholds.pickle"]
+    "paths.src_las=your_las.las", \
+    "paths.output_dir=./path/to/outputs/"]
diff --git a/README.md b/README.md
@@ -35,23 +35,29 @@ Goal: Confirm or refute groups of candidate building points when possible, mark
 
 1) Clustering of _candidate buildings points_ into connected components.
 2) Point-level decision
-    1) Decision at the point-level based on probabilities : `confirmed` if p>=`C1` /  `refuted` if (1-p)>=`R1`
-    2) Identification of points that are `overlayed` by a building vector from the database.
+   1) Identification of points with ambiguous probability: `high entropy` if entropy $\geq$ E1 
+   2) Identification of points that are `overlayed` by a building vector from the database.
+   3) Decision at the point-level based on probabilities : 
+      1) `confirmed` if:
+         1) p$\geq$`C1`, or
+         2) `overlayed` and p$\geq$ (`C1` * `Cr`), where `Cr` is a relaxation factor that reduces the confidence we require to confirm when a point overlayed by a building vector. 
+      2) `refuted` if (1-p)$\geq$`R1`
 3) Group-level decision :
-    1) Confirmation: if proportion of `confirmed` points >= `C2` OR if proportion of `overlayed` points >= `O1`
-    2) Refutation: if proportion of `refuted` points >= `R2` AND proportion of `overlayed` points < `O1`
-    3) Uncertainty: elsewise.
+    1) Uncertain due to high entropy: if proportion of `high entropy` points $\geq$ `E2`
+    2) Confirmation: if proportion of `confirmed` points $\geq$ `C2` OR if proportion of `overlayed` points $\geq$ `O1`
+    3) Refutation: if proportion of `refuted` points $\geq$ `R2` AND proportion of `overlayed` points < `O1`
+    4) Uncertainty: elsewise (this is a safeguard: uncertain groups are supposed to be already captured via their entropy)
 4) Update of the point cloud classification
 
-Decision thresholds `C1`, `C2`, `R1`, `R2`, `O1` are chosen via a multi-objective hyperparameter optimization that aims to maximize automation, precision, and recall of the decisions. Right now we have automation=90%, precision=98%, recall=98% on a validation dataset. Illustration comes from older version.
+Decision thresholds `E1`, `E2` , `C1`, `C2`, `R1`, `R2`, `O1` are chosen via a multi-objective hyperparameter optimization that aims to maximize automation, precision, and recall of the decisions. Right now we have automation=91%, precision=98.5%, recall=98.1% on a validation dataset. Illustration comes from older version.
 
 ![](assets/img/LidarBati-BuildingValidationM7.1V2.0.png)
 
 #### B) Building Completion
 
 Goal: Confirm points that were too isolated to make up a group but have high-enough probability nevertheless (e.g. walls)
 
-Identify  _candidate buildings points_ that have not been clustered in previous step due AND have high enough probability (p>=0.5)).
+Among  _candidate buildings points_ that have not been clustered in previous step due, identify those which nevertheless meet the requirement to be `confirmed`.
 Cluster them together with previously confirmed building points in a relaxed, vertical fashion (higher tolerance, XY plan).
 For each cluster, if some points were confirmed, the others are considered to belong to the same building, and are 
 therefore confirmed as well.
@@ -63,7 +69,9 @@ therefore confirmed as well.
 
 Goal: Highlight potential buildings that were missed by the rule-based algorithm, for human inspection. 
 
-Clustering of points that have a probability of beind a building p>=`C1` AND are **not** _candidate buildings points_. This clustering defines a LAS extra dimensions (default name `Group`).
+Among points that were **not** _candidate buildings points_ identify those which meet the requirement to be `confirmed`, and cluster them.
+
+This clustering defines a LAS extra dimensions (`Group`) which indexes newly found cluster that may be some missed buildings.
 
 ![](assets/img/LidarBati-BuildingIdentification.png)
 
@@ -100,7 +108,7 @@ To run the module from anywhere, you can install as a package in a your virtual
 conda activate lidar_prod
 
 # install the package
-pip install --upgrade https://github.com/IGNF/lidar-prod-quality-control/tarball/main  # from github directly
+pip install --upgrade https://github.com/IGNF/lidar-prod-quality-control/tarball/prod  # from github directly, using production branch
 pip install -e .  # from local sources
 ```
 
@@ -153,13 +161,12 @@ conda activate lidar_prod
 python lidar_prod/run.py +task=optimize building_validation.optimization.todo='prepare+evaluate+update' building_validation.optimization.paths.input_las_dir=[path/to/labelled/test/dataset/] building_validation.optimization.paths.results_output_dir=[path/to/save/results] building_validation.optimization.paths.building_validation_thresholds_pickle=[path/to/optimized_thresholds.pickle]
 ```
 
-### CICD, Releases and versions
+### CICD and versions
 
 New features are staged in the `dev` branch, and CICD workflow is run when a pull requets to merge is created.
 In Actions, check the output of a full evaluation on a single LAS to spot potential regression. The app is also run 
 on a subset of a LAS, which can be visually inspected before merging - there can always be surprises.
 
 Package version follows semantic versionning conventions and is defined in `setup.py`. 
 
-Releases are generated when new high-level functionnality are implemented (e.g. a new step in the production process) or
-when key parameters are changed. Generally speaking, the latest release `Vx.y.z` is the one to use in production. 
+Releases are generated when new high-level functionnality are implemented (e.g. a new step in the production process), with a documentation role. Production-ready code is fast-forwarded in the `prod` branch when needed. 
diff --git a/bash/setup_environment/requirements.yml b/bash/setup_environment/requirements.yml
@@ -11,7 +11,8 @@ dependencies:
   - isort # import sorting
   - flake8 # code analysis
   # --------- geo --------- #
-  - conda-forge:python-pdal
+  - conda-forge:pdal==2.3.*
+  - conda-forge:python-pdal==3.0.*
   - conda-forge:laspy==2.1.*
   - numpy
   - scikit-learn

diff --git a/configs/building_validation/application/default.yaml b/configs/building_validation/application/default.yaml
@@ -22,9 +22,11 @@ bd_uni_request:
 
 # TODO: update min_frac_confirmation_factor_if_bd_uni_overlay and others after optimization...
 thresholds:
-  min_confidence_confirmation: 0.697
-  min_frac_confirmation: 0.384
-  min_frac_confirmation_factor_if_bd_uni_overlay: 0.808
-  min_uni_db_overlay_frac: 0.508
-  min_confidence_refutation: 0.973
-  min_frac_refutation: 0.285
+  min_confidence_confirmation: 0.6400365762003571  # min proba to validate a point
+  min_frac_confirmation: 0.779844069887882  # min fractin of confirmed points per group for confirmation
+  min_frac_confirmation_factor_if_bd_uni_overlay: 0.5894477997785892  # relaxation factor to min proba when point is under BDUni vector
+  min_uni_db_overlay_frac: 0.5041941489707767  # min fraction of points  under BDUni vector per group for confirmation
+  min_confidence_refutation: 0.7477148092712739 # min proba to refute a point
+  min_frac_refutation: 0.7979734453001499   # min fractin of refuted points per group for confirmation
+  min_entropy_uncertainty: 0.884546947499147   # min entropy to flag a point as uncertain
+  min_frac_entropy_uncertain: 0.7271206406484895   # min fractin of uncertain points (based on entropy) per group to flag as uncertain
diff --git a/configs/building_validation/optimization/default.yaml b/configs/building_validation/optimization/default.yaml
@@ -33,10 +33,10 @@ study:
   directions: ["maximize","maximize","maximize"]
   sampler:
     _target_: optuna.samplers.NSGAIISampler
-    population_size: 30
+    population_size: 50
     mutation_prob: 0.25
-    crossover_prob: 0.8
-    swapping_prob: 0.5
+    crossover_prob: 0.1
+    swapping_prob: 0.1
     seed: 12345
     constraints_func:
       _target_: functools.partial

diff --git a/configs/data_format/cleaning/default.yaml b/configs/data_format/cleaning/default.yaml
diff --git a/configs/data_format/default.yaml b/configs/data_format/default.yaml
@@ -5,35 +5,50 @@ crs: 2154
 # Those names connect the logics between successive tasks
 las_dimensions:
   # input
-  classification: classification  #las format
+  classification: classification  # las format
+
+  # Extra dims
+  # ATTENTION: If extra dimensions are added, you may want to add them in cleaning.in parameter as well.
   ai_building_proba: building  # user-defined - output by deep learning model
+  entropy: entropy # user-defined - output by deep learning model
 
-  # intermediary channels
+  # Intermediary channels
   cluster_id: ClusterID  # pdal-defined -> created by clustering operations
   uni_db_overlay: BDTopoOverlay  # user-defined -> a 0/1 flag for presence of a BDUni vector
   candidate_buildings_flag: F_CandidateB # -> a 0/1 flag identifying candidate buildings found by rules-based classification  
   ClusterID_candidate_building: CID_CandidateB  # -> Cluster index from BuildingValidator, 0 if no cluster, 1-n elsewise
   ClusterID_isolated_plus_confirmed: CID_IsolatedOrConfirmed  # -> Cluster index from BuildingCompletor, 0 if no cluster, 1-n elsewise
 
-
-  # additionnal output channel
+  # Additionnal output channel
   ai_building_identified: Group
 
+cleaning:
+  input:
+    _target_: lidar_prod.tasks.cleaning.Cleaner
+    extra_dims:
+      - "${data_format.las_dimensions.ai_building_proba}=float"
+      - "${data_format.las_dimensions.entropy}=float"
+  output:
+    # Extra dims that are kept when cleaning dimensions
+    # You can override with "all" to keep all extra dimensions at development time.
+    _target_: lidar_prod.tasks.cleaning.Cleaner
+    extra_dims:
+      - "${data_format.las_dimensions.ai_building_identified}=uint"
+      - "${data_format.las_dimensions.ai_building_proba}=float"
+
 codes:
   building:
     candidates: [202]  # found by rules-based classification (TerraScan)
     detailed:  # used for detailed output when doing threshold optimization
+      unsure_by_entropy: 200 # unsure (based on entropy)
       unclustered: 202  # refuted
       ia_refuted: 110  # refuted
-      ia_refuted_and_db_overlayed: 111  # unsure
-      both_unsure: 112  # unsure
+      ia_refuted_but_under_db_uni: 111  # unsure
+      both_unsure: 112  # unsure (elsewise)
       ia_confirmed_only: 113  # confirmed
       db_overlayed_only: 114  # confirmed
       both_confirmed: 115  # confirmed
     final:  # used at the end of the building process
       unsure: 214  # unsure
       not_building: 208  # refuted
-      building: 6  # confirmed
-
-defaults:
-  - cleaning: default.yaml
+      building: 6  # confirmed
diff --git a/flake8_output.txt b/flake8_output.txt
diff --git a/lidar_prod/application.py b/lidar_prod/application.py
@@ -29,23 +29,31 @@ def apply(config: DictConfig):
 
     """
     assert os.path.exists(config.paths.src_las)
-    in_f = config.paths.src_las
-    out_f = osp.join(config.paths.output_dir, osp.basename(in_f))
+    IN_F = config.paths.src_las
+    OUF_F = osp.join(config.paths.output_dir, osp.basename(IN_F))
 
     with TemporaryDirectory() as td:
         # Temporary LAS file for intermediary results.
-        temp_f = osp.join(td, osp.basename(in_f))
+        temp_f = osp.join(td, osp.basename(IN_F))
 
+        # Removes unnecessary input dimensions to reduce memory usage
+        cl: Cleaner = hydra.utils.instantiate(config.data_format.cleaning.input)
+        cl.run(IN_F, temp_f)
+
+        # Validate buildings (unsure/confirmed/refuted) on a per-group basis.
         bv: BuildingValidator = hydra.utils.instantiate(
             config.building_validation.application
         )
-        bv.run(in_f, temp_f)
+        bv.run(temp_f, temp_f)
 
+        # Complete buildings with non-candidates that were nevertheless confirmed
         bc: BuildingCompletor = hydra.utils.instantiate(config.building_completion)
         bc.run(temp_f, temp_f)
 
+        # Define groups of confirmed building points among non-candidates
         bi: BuildingIdentifier = hydra.utils.instantiate(config.building_identification)
         bi.run(temp_f, temp_f)
 
-        cl: Cleaner = hydra.utils.instantiate(config.data_format.cleaning)
-        cl.run(temp_f, out_f)
+        # Remove unnecessary intermediary dimensions
+        cl: Cleaner = hydra.utils.instantiate(config.data_format.cleaning.output)
+        cl.run(temp_f, OUF_F)
diff --git a/lidar_prod/tasks/building_completion.py b/lidar_prod/tasks/building_completion.py
@@ -119,7 +119,12 @@ def prepare(self, in_f: str, out_f: str):
             value=f"{self.data_format.las_dimensions.cluster_id} = 0"
         )
         pipeline |= pdal.Writer(
-            type="writers.las", filename=out_f, forward="all", extra_dims="all"
+            type="writers.las",
+            filename=out_f,
+            forward="all",
+            extra_dims="all",
+            minor_version=4,
+            dataformat_id=8,
         )
         os.makedirs(osp.dirname(out_f), exist_ok=True)
         pipeline.execute()

diff --git a/lidar_prod/tasks/building_identification.py b/lidar_prod/tasks/building_identification.py
@@ -86,7 +86,12 @@ def prepare(self, in_f: str, out_f: str) -> None:
         )
 
         pipeline |= pdal.Writer(
-            type="writers.las", filename=out_f, forward="all", extra_dims="all"
+            type="writers.las",
+            filename=out_f,
+            forward="all",
+            extra_dims="all",
+            minor_version=4,
+            dataformat_id=8,
         )
         os.makedirs(osp.dirname(out_f), exist_ok=True)
         pipeline.execute()