Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implementing a metadatavalidator for <meta> tags #731

Open
wants to merge 107 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
7533bd0
First skeleton of metadatavalidator
tomschr May 17, 2024
68bbb70
Remove comments
tomschr May 17, 2024
82b73c3
Define the config directories to read config
tomschr May 17, 2024
e3d244a
Introduce config parsing and exception handling
tomschr May 17, 2024
11220a1
Snapshot
tomschr May 22, 2024
9499b9a
pyproject.toml: Add tool.pytest.ini_option section
tomschr May 22, 2024
0d06751
Move script into bin/ folder
tomschr May 22, 2024
9aadc40
Correct pyproject.toml using bin directory
tomschr May 22, 2024
856612d
Switch to a src layout
tomschr May 22, 2024
d69646b
Add simple test cases
tomschr May 22, 2024
1185655
Define custom exceptions
tomschr May 22, 2024
2a6e0b3
Add process skeleton
tomschr May 22, 2024
4bd2a12
Add dependencies and optional-dependencies
tomschr May 22, 2024
6fd948d
Start with a two small check functions
tomschr May 22, 2024
24f4d68
Validate and convert ConfigParser obj into dict
tomschr May 23, 2024
c92a579
Find checks_* functions semi-automatically
tomschr May 23, 2024
ca644bc
Introduce exception classes for config errors
tomschr May 23, 2024
55ae26d
pyproject.toml: Use correct coverage module
tomschr May 24, 2024
c969c8a
Introduce valid_languages
tomschr May 24, 2024
28a7394
Define __str__ for exception classes
tomschr May 24, 2024
0b118d3
Catch BaseConfigError in cli.py
tomschr May 24, 2024
32b9448
Collect the result of each XML file
tomschr May 24, 2024
2f609e0
Improve error output
tomschr May 24, 2024
aef1db2
Add test cases and check for info element
tomschr May 24, 2024
5543e6d
Add test_missing_config_files()
tomschr May 24, 2024
bd12e7e
Add basic test for CLI/parsecli()
tomschr May 24, 2024
a439fd6
Avoid reading config files two times
tomschr May 24, 2024
9696093
Test more info failures, add revhistory check
tomschr May 24, 2024
4eb8fa6
Use log.debug for each check, use base XML name
tomschr May 24, 2024
7a06c73
Add checks and tests for revhistory/@xml:id
tomschr May 24, 2024
5a45b3b
Add comment about order in checks.__all__
tomschr May 27, 2024
1011939
Remove check_info_revhistory_xmlid()
tomschr May 27, 2024
be1817c
Introduce metadata section
tomschr May 28, 2024
a452721
Introduce util.py and getfullxpath()
tomschr May 29, 2024
e7aa370
Improve exception with __str__
tomschr May 29, 2024
11d4cac
Check and test revhistory/revision/date
tomschr May 29, 2024
369b42b
Add parse_date
tomschr May 29, 2024
34a2103
Move basic_xmlcontent and improve tests
tomschr May 29, 2024
74ba8ca
Check for correct order of revision/date
tomschr May 29, 2024
3ebfb22
Use xmlparser fixture
tomschr May 29, 2024
aee1f64
Catch MissingAttributeWarning
tomschr May 29, 2024
af88a31
Fix a bug in validatedatevalue
tomschr May 29, 2024
5bdf684
Add test_check_info_revhistory_revision_order_one_invalid_date
tomschr May 29, 2024
2fa1fdd
Add first & second date in comparing dates
tomschr Jun 3, 2024
35d34ee
Check <meta name="title">
tomschr Jun 3, 2024
bf5cc62
Check <meta name="description">
tomschr Jun 3, 2024
34a0455
Remove obsolete structures to raise coverage
tomschr Jun 3, 2024
db22009
Support --config option for a config file
tomschr Jun 3, 2024
782b0c6
Group existing tests under tests/unit
tomschr Jun 3, 2024
f007573
Add first integration test
tomschr Jun 3, 2024
af46b9a
Don't check date when there is no revhistory
tomschr Jun 3, 2024
a43d1e5
Allow -C for --config option
tomschr Jun 3, 2024
2dcb70f
Use strategy pattern to format output as JSON or text
tomschr Jun 3, 2024
6ce94f1
Use JSON output for first integration test
tomschr Jun 3, 2024
83c4d71
Replace namespace URL with prefix, use d:info/...
tomschr Jun 3, 2024
94d970c
Function: format_results -> format_results_text
tomschr Jun 3, 2024
78874fa
Test wrong --format argument
tomschr Jun 3, 2024
f411721
Check <meta name="series">
tomschr Jun 3, 2024
c2d76ac
Check <meta name="techpartner">
tomschr Jun 3, 2024
cc5d92e
tests/integration/{case1 => goodcase1}
tomschr Jun 3, 2024
c03abca
Amend README with Installation section
tomschr Jun 3, 2024
662098b
Correct etree._ElementTree (with the underscore)
tomschr Jun 4, 2024
6ad5372
Check <meta name="platform">
tomschr Jun 4, 2024
5a28541
Check <meta name="platform">
tomschr Jun 4, 2024
05be611
Add line breaks in function argument list
tomschr Jun 4, 2024
a66cffe
Correct a typo in require_meta_series
tomschr Jun 4, 2024
1d4a61e
Check <meta name="architecture">
tomschr Jun 4, 2024
1f4839e
Add comments in config INI file
tomschr Jun 5, 2024
3afed36
Fix config options in test_check_meta_architecture
tomschr Jun 5, 2024
78f3d3a
Check <meta name="category">
tomschr Jun 13, 2024
a902366
Refactor validate_and_convert_config
tomschr Jun 21, 2024
1df2894
Add additional test cases
tomschr Jun 21, 2024
b2b3a5f
Check <meta name="task">
tomschr Jun 27, 2024
54787cc
Simplify tests
tomschr Jun 27, 2024
d533ca3
Add missing output of script in README
tomschr Jun 27, 2024
f92de25
Add return type
tomschr Jun 27, 2024
4ababb3
Introduce getinfo() and info_or_fail()
tomschr Jun 27, 2024
5f42c1a
Use tree fixture and add subelement of <info>
tomschr Jun 27, 2024
86e56a4
Use info_or_fail() and NAMESPACES dict
tomschr Jun 27, 2024
b1a48b8
Use ElementMaker and refactor tests
tomschr Jun 29, 2024
a2f9df0
Correct geinfo() to get merge from <assembly>
tomschr Jun 29, 2024
364b3b7
Remove <info/> from assemblystr fixture
tomschr Jun 29, 2024
dbdea35
Add source line in error message
tomschr Jun 29, 2024
4dc4d9a
Refactor process_xml_file()
tomschr Jun 29, 2024
393f474
Move red() & green() function to util.py
tomschr Jun 29, 2024
4046e0a
Rename test_badcase1.py -> test_badcase.py
tomschr Jun 29, 2024
488149a
Rename test_integration.py -> test_goodcase.py
tomschr Jun 29, 2024
f9b6f2a
Add another goodcase: entity in doctype
tomschr Jun 29, 2024
dc7936f
goodcase1 -> goodcases
tomschr Jun 29, 2024
a7dbd95
badcase1 -> badcases
tomschr Jun 29, 2024
6b498de
Replace basic_xmlcontent with tree fixture
tomschr Jun 29, 2024
33bcf35
Rename cfg name revhistory -> require_revhistory
tomschr Jun 29, 2024
90c4e85
Split big test_check_meta.py into other files
tomschr Jun 29, 2024
6d8153e
Split test_check_info.py
tomschr Jun 29, 2024
265cd44
Add missing tests for <meta name="task">
tomschr Jun 29, 2024
4593b35
Correct and improve test/checks for revhistory
tomschr Jun 30, 2024
473f218
Update the README
tomschr Jul 1, 2024
bb14462
Require some tags
tomschr Jul 1, 2024
39539dc
Add .code-workspace for VSCode
tomschr Jul 1, 2024
d6cbbe3
Use plural forms of valid_meta_*
tomschr Jul 1, 2024
56c0cbe
Improve help output and use module docstring
tomschr Jul 1, 2024
b626c3c
Bump version to 0.3.0
tomschr Jul 1, 2024
93445ff
Improve CLI epilog
tomschr Jul 16, 2024
7f6707b
Fix check_meta_task()
tomschr Jul 16, 2024
38c481f
Fix empty text in meta[@name='title']
tomschr Jul 16, 2024
59208f4
Add statement about current directory from to install
tomschr Jul 25, 2024
6f06cd3
Use plural form in INI file
tomschr Sep 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions python-scripts/metadatavalidator/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env*
.venv*
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

### https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/*.code-snippets

# Local History for Visual Studio Code
.history/

# Built Visual Studio Code Extensions
*.vsix


#### Specific to this project
201 changes: 201 additions & 0 deletions python-scripts/metadatavalidator/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
Metadata validator for DocBook
==============================

The script in this directory check several metadata definition for DocBook.
Metadata can be found in the ``<meta>`` and ``<info>`` tags.


Requirements
------------

* lxml (the more recent, the better)
* Python >=3.11 (only due to for installing with :file:`pyproject.toml`.)


Preparing the environment
-------------------------

It's recommended to create a Python virtual environment first before you
proceed further. The virtual environment is a self-contained Python environment
that separates the dependencies from the system Python installation.

To create a virtual environment, execute the following steps:

1. Create a virtual environment with Python 3.11:

.. code-block:: bash

$ python3.11 -m venv .venv311

1. Activate the virtual environment:

.. code-block:: bash

$ source .venv311/bin/activate

Your prompt changes to show the active virtual environment:

.. code-block:: bash

(.venv311) $

1. Upgrade the package manager ``pip`` and ``setuptools`` to the latest version:

.. code-block:: bash

(.venv311) $ pip install --upgrade pip setuptools

This makes your virtual environment ready for the next steps.

If you don't need the virtual environment anymore, you can deactivate it:

.. code-block:: bash

(.venv311) $ deactivate



Installing the script
---------------------

Before you install the script, your current directory must be
`python-scripts/metadatavalidator/`.
To install the script, run the following command:

.. code-block:: bash

$ pip install .


For development, install the script in "editable" mode:

.. code-block:: bash

$ pip install -e .[test]


Setting the configuration
-------------------------

Before you call the script, check the values in the configuration file.
The configuration file is an INI file and is searched in the following order (from highest to lowest):

* Command line with :option:`--config`. This doesn't search for other configuration files.
* Environment variable :envar:`METAVALIDATOR_CONFIG`.
* In the current directory: :file:`metadatavalidator.ini`
* In the users' home directory: :file:`~/.config/metadatavalidator/config.ini`
* In the system: :file:`/etc/metadatavalidator/config.ini`

The configuration file is a standard INI file.
All boolean values are case-insensitive and can be ``true``/``yes``, ``on``/``off`` or ``0``/``1``.
Everything else is considered as ``false``.
List values are separated by commas.

All config files are merged together. If a key is defined in multiple files,
the last one wins. This way you can have a global configuration in the
system directory and a local one in the current directory.


Calling the script
------------------

Call the script with the following command:

.. code-block:: bash

$ metadatavalidator PATH_TO_DOCBOOK_FILES

The script will show all problems with metadata:

.. code-block::

$ metadatavalidator a.xml b.xml
==== RESULTS ====
[1] a.xml:
1.1: check_info_revhistory_revision: Missing recommended attribute in /d:article/d:info[2]/d:revhistory[12]/d:revision/@xml:id

[2] b.xml:
2.1: check_meta_task: Invalid value in metadata Unknown task(s) {'Clusering'}. Allowed are ...

The output shows:

* The filename.
* The name of the check that the script executed and failed.
* A description of the problem.
* In some cases a line number.


If wanted, you can add your own configuration file with the option :option:`--config`:

.. code-block:: bash

$ metadatavalidator --config /path/to/config.ini PATH_TO_DOCBOOK_FILES

For machine readable output of the result, use the option :option:`--format`:

.. code-block:: bash

$ metadatavalidator --format json PATH_TO_DOCBOOK_FILES


Configuration
-------------

The configuration file is search in the following order (first is the highest):

1. Command line with :option:`--config`. This doesn't search for other configuration files.

1. Environment variable :envar:`METAVALIDATOR_CONFIG`.

1. In the current directory: :file:`metadatavalidator.ini`

1. In the users' home directory: :file:`~/.config/metadatavalidator/config.ini`

1. In the system: :file:`/etc/metadatavalidator/config.ini`


Configuration values
--------------------

The following values are recognized:

* :var:`validator`: Global options to configure the validator.
* :var:`file_extension`: The file extension to search for. Default is
``.xml``.

* :var:`check_root_elements`: List of allowed root elements (space separated by local DocBook name). Default is ``assembly article book topic``.

* :var:`valid_languages`: List of valid languages (space separated by ISO 639-1 code). Default is ``ar-ar cs-cz de-de en-us es-es fr-fr hu-hu it-it ja-jp ko-kr nl-nl pl-pl pt-br ru-ru sv-se zh-cn zh-tw``.

* :var:`metadata`: Options to change behaviour of specific `<meta>` tags.
* :var:`require_revhistory`: Requires a ``<revhistory>`` tag or not.

* :var:`require_xmlid_on_revision`: Requires a ``xml:id`` attribute on each ``<revision>`` tag or not.

* :var:`require_meta_title`: Requires a ``<meta name="title">`` tag or not.

* :var:`meta_title_length`: Checks the length of the text content in ``<meta name="title">``. Default is 55.

* :var:`require_meta_description`: Requires a ``<meta name="description">`` tag or not.

* :var:`meta_description_length`: Checks the length of the text content in ``<meta name="description">``. Default is 155.

* :var:`require_meta_series`: Requires a ``<meta name="series">`` tag or not.

* :var:`valid_meta_series`: Lists the valid series names for ``<meta name="series">``.

* :var:`require_meta_techpartner`: Requires a ``<meta name="techpartner">`` tag or not.

* :var:`require_meta_platform`: Requires a ``<meta name="platform">`` tag or not.

* :var:`require_meta_architecture`: Requires a ``<meta name="architecture">`` tag or not.

* :var:`valid_meta_architectures`: Lists the valid architecture names for ``<meta name="architecture">/<phrase>``.

* :var:`require_meta_category`: Requires a ``<meta name="category">`` tag or not.

* :var:`valid_meta_categories`: Lists the valid category names for ``<meta name="category">/<phrase>``.

* :var:`require_meta_task`: Requires a ``<meta name="task">`` tag or not.

* :var:`valid_meta_tasks`: Lists the valid task names for ``<meta name="task">/<phrase>``.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"folders": [
{
"path": "."
}
],
"settings": {}
}
Loading
Loading