-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ensembl mysql table wrapper #3103
Conversation
…biomart-table wrapper
…cking, do previously created options as fallback), structure error messages better
@tedil As you already had a look, just wanted to let you know about this latest change: I just updated the build number checking code, to actually use the Also, I improved the error messages. Mainly by structuring them a bit more, to make all the reported values in them easier to scan over quickly. |
…kemake/snakemake-wrappers into feat/add-ensembl-mysql-table-wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor stuff, looks good to me (and really useful!)
WalkthroughThe recent updates enhance a bioinformatics framework, improving reproducibility and user experience in data analysis. New configuration files establish Conda environments with necessary package dependencies, while a Snakemake workflow supports generating annotation tables from Ensembl databases. An R script acts as a wrapper for streamlined database interactions, facilitating efficient data handling for researchers. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (7)
bio/reference/ensembl-mysql-table/environment.yaml (1)
9-9
: Add a newline at the end of the file.The file should end with a newline character to adhere to POSIX standards.
9c9 < - r-rmariadb =1.2 \ No newline at end of file --- > - r-rmariadb =1.2Tools
yamllint
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
bio/reference/ensembl-mysql-table/test/Snakefile (1)
8-8
: Consider restructuringmain_tables
for consistency.Using a consistent structure for
main_tables
andjoin_tables
might enhance readability and maintainability, especially if more sub-keys are added in the future.main_tables={"repeat_feature": {"database": "core"}}bio/reference/ensembl-mysql-table/meta.yaml (2)
9-9
: Remove trailing spaces.Trailing spaces can cause formatting issues and should be removed.
9c9 < `main_tables`, to allow for `left_join'ing their columns onto them. --- > `main_tables`, to allow for `left_join'ing their columns onto them.Tools
yamllint
[error] 9-9: trailing spaces
(trailing-spaces)
50-50
: Add a newline at the end of the file.The file should end with a newline character to adhere to POSIX standards.
50c50 < https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.html \ No newline at end of file --- > https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.htmlTools
yamllint
[error] 50-50: no new line character at the end of file
(new-line-at-end-of-file)
bio/reference/ensembl-mysql-table/wrapper.R (3)
184-199
: Consider implementing table filtering.The TODO comment suggests that filtering could be implemented to reduce data transfer. Consider whether this is necessary and how it could be implemented efficiently.
# Example of potential filtering implementation: # filter_condition <- snakemake@params[["filter_condition"]] # table <- tbl(table_connection, table_name) |> filter(!!rlang::sym(filter_condition)) |> collect()
32-35
: Ensure clarity in error messages.The error messages use
cli_abort
to provide detailed information. Verify that the messages are clear, informative, and guide the user towards resolving the issue.Consider rephrasing error messages for improved clarity and user guidance.
Also applies to: 128-157, 163-180, 238-243, 251-257
19-40
: Add explicit parameter validation.The parameters
species
,release
, andbuild
retrieved from Snakemake should be explicitly validated before use to prevent potential issues with malformed or incorrect input.
- File:
bio/reference/ensembl-mysql-table/wrapper.R
- Lines: 19-40
Analysis chain
Verify parameter validation.
Ensure that the parameters retrieved from Snakemake are validated before use. The logic for setting the database port based on the genome build should be verified for correctness.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the validation of parameters retrieved from Snakemake. # Test: Search for parameter validation logic in the repository. Expect: Proper validation. rg --type r $'snakemake@params'Length of output: 8200
Script:
#!/bin/bash # Description: Check for validation or error handling related to specific parameters in the file. # Test: Search for validation or error handling for 'species', 'release', and 'build' parameters. rg --type r 'species|release|build' bio/reference/ensembl-mysql-table/wrapper.R -A 5 -B 5Length of output: 7605
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- bio/reference/ensembl-mysql-table/environment.linux-64.pin.txt (1 hunks)
- bio/reference/ensembl-mysql-table/environment.yaml (1 hunks)
- bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
- bio/reference/ensembl-mysql-table/test/Snakefile (1 hunks)
- bio/reference/ensembl-mysql-table/wrapper.R (1 hunks)
- test.py (1 hunks)
Files skipped from review due to trivial changes (1)
- bio/reference/ensembl-mysql-table/environment.linux-64.pin.txt
Additional context used
yamllint
bio/reference/ensembl-mysql-table/environment.yaml
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
bio/reference/ensembl-mysql-table/meta.yaml
[error] 9-9: trailing spaces
(trailing-spaces)
[error] 50-50: no new line character at the end of file
(new-line-at-end-of-file)
Additional comments not posted (9)
bio/reference/ensembl-mysql-table/wrapper.R (7)
223-257
: Ensure correct output file format handling.The code handles different output file formats and compression types. Verify that the logic correctly identifies and processes supported formats.
Verification successful
Output file format handling is correctly implemented.
The code correctly identifies and processes supported output file formats and compression types using
write_tsv
andwrite_parquet
. Error handling for unsupported formats is also in place. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of output file formats and compression types. # Test: Search for the handling of different file formats and compression types. Expect: Correct handling. rg --type r $'write_tsv|write_parquet'Length of output: 871
202-209
: Ensure efficient data binding.The code binds rows from multiple tables into a single tibble. Verify that this process is efficient and handles large datasets appropriately.
52-161
: Verify database name retrieval logic.Ensure that the logic for retrieving and verifying the database name is correct. The function uses multiple methods to check the build number, which should be verified for accuracy.
6-8
: Ensure proper logging setup.The logging setup redirects both standard output and error messages to a log file. Ensure that the log file path is correctly specified in the Snakemake configuration.
211-220
: Verify join conditions.Ensure that the join conditions specified in the parameters are correct. Verify that the joins are performed efficiently and handle potential mismatches gracefully.
Verification successful
The
rg
command output indicates that the only occurrence ofleft_join
is within thebio/reference/ensembl-mysql-table/wrapper.R
file, which is the same file and section we are reviewing. This means there are no other uses ofleft_join
in the repository, and thus no other join conditions in the codebase to a different part of the code.Given this information, the focus should be on ensuring that the join conditions within this section of the code are correctly defined and that the function
left_join
is used appropriately. The review comment suggested verifying the join conditions, but since there are no other occurrences, we need to ensure that the current usage is correct.To further verify, we should check the definitions and uses of the
join_tables
structure and theget_and_check_db_name
andget_table
functions to ensure that they are providing the correct information for the joins.Let's proceed to gather more information about these entities. We will search for their definitions and usages in the codebase.
The search results provide insights into the usage and definition of the key components involved in the join operation:
join_tables
: This variable is initialized fromsnakemake@params[["join_tables"]]
. It is used to determine the join conditions and the databases involved in the join operations.
get_and_check_db_name
: This function is defined in the same file and is responsible for retrieving and validating the database name based on the connection, species, database, release, and build parameters.
get_table
: This function is also defined in the same file and is used to retrieve a table from a specified database and port.The join operation in the code snippet is dependent on the structure of
join_tables
and the correctness of the database names and tables retrieved byget_and_check_db_name
andget_table
. Given that all these components are defined and used within the same file, it suggests a self-contained logic for handling joins.To ensure correctness, it would be beneficial to verify the structure of
join_tables
and the logic withinget_and_check_db_name
andget_table
. However, without further context or issues reported, the implementation appears to be consistent with the intended functionality.Conclusion: The join conditions and related logic in the code appear to be defined and used correctly within the file. No issues were found with the current implementation.
- File:
bio/reference/ensembl-mysql-table/wrapper.R
- Lines: 211-220
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness and efficiency of join conditions. # Test: Search for the usage of join conditions in the repository. Expect: Correct and efficient joins. rg --type r $'left_join'Length of output: 84
Script:
#!/bin/bash # Search for the definition and usage of the `join_tables` variable. rg --type r 'join_tables' # Search for the definition and usage of the `get_and_check_db_name` function. rg --type r 'get_and_check_db_name' # Search for the definition and usage of the `get_table` function. rg --type r 'get_table'Length of output: 1400
10-18
: Verify library usage.Ensure that all imported libraries are necessary for the script's functionality. The
rlang::global_entrace()
call is unusual and should be verified for its intended purpose.
42-50
: Ensure secure database connection.The database connection uses an anonymous user with no password. Verify that this is secure and intended for the use case. Consider potential security implications.
test.py (2)
5676-5688
: Verify the setup and dependencies for the test.Ensure that the
run
function is correctly implemented and that the necessary files for the Snakemake workflow are present.Verification successful
Setup and dependencies for the test are correct.
The
run
function is implemented correctly, and the necessary files for the Snakemake workflow are present in the repository. The test setup appears to be properly configured.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the presence of necessary files and the correct setup of the `run` function for the test. # Test: Check for the presence of the required Snakemake workflow files. fd -e yaml -e smk -e py -e R -e Rmd -e sh -e txt --search-path bio/reference/ensembl-mysql-table # Test: Check the implementation of the `run` function. ast-grep --lang python --pattern $'def run($_, $_, check_log=None):\n $$$\n'Length of output: 5362
5691-5703
: Verify the setup and dependencies for the test.Ensure that the
run
function is correctly implemented and that the necessary files for the Snakemake workflow are present.Verification successful
Setup and dependencies for the test are correctly implemented.
The
run
function is well-implemented and handles the execution of the Snakemake workflow with necessary checks and cleanup. The required files for the workflow are present in the specified directory. No issues were found with the test setup.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the presence of necessary files and the correct setup of the `run` function for the test. # Test: Check for the presence of the required Snakemake workflow files. fd -e yaml -e smk -e py -e R -e Rmd -e sh -e txt --search-path bio/reference/ensembl-mysql-table # Test: Check the implementation of the `run` function. ast-grep --lang python --pattern $'def run($_, $_, check_log=None):\n $$$\n'Length of output: 5362
Co-authored-by: Till Hartmann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (1)
- bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
Additional context used
yamllint
bio/reference/ensembl-mysql-table/meta.yaml
[error] 9-9: trailing spaces
(trailing-spaces)
[error] 50-50: no new line character at the end of file
(new-line-at-end-of-file)
Additional comments not posted (2)
bio/reference/ensembl-mysql-table/meta.yaml (2)
13-15
: URL and author information look good.The URL and author information are correctly specified.
16-22
: Output formats are well-defined.The output formats and documentation links are correctly specified.
…kemake/snakemake-wrappers into feat/add-ensembl-mysql-table-wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (4)
- bio/reference/ensembl-mysql-table/environment.yaml (1 hunks)
- bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
- bio/reference/ensembl-mysql-table/test/Snakefile (1 hunks)
- bio/reference/ensembl-mysql-table/wrapper.R (1 hunks)
Additional comments not posted (7)
bio/reference/ensembl-mysql-table/environment.yaml (2)
1-4
: Ensure the channel order is optimal.The
nodefaults
channel is listed, which means only the specified channels will be used. Ensure that this is intentional and that the order ofconda-forge
andbioconda
is optimal for resolving dependencies.
6-9
: Verify the compatibility of R package versions.Ensure that the specified versions of
r-nanoparquet
,r-tidyverse
,r-dbplyr
, andr-rmariadb
are compatible with each other and with the intended use case.bio/reference/ensembl-mysql-table/test/Snakefile (2)
31-46
: Ensure parameter consistency across rules.The
create_regulatory_annotations_parquet
rule uses similar parameters to the previous rule. Ensure that parameter names and formats are consistent across rules for maintainability.
26-26
: Review caching strategy.The
cache: "omit-software"
directive is used to save space and time. Ensure that this caching strategy aligns with the workflow requirements and does not affect reproducibility.Also applies to: 60-60
bio/reference/ensembl-mysql-table/meta.yaml (1)
28-28
: Correct grammatical error.Change "consults" to "consult" for grammatical correctness.
- For full valid species names, consults the respective table for the + For full valid species names, consult the respective table for theLikely invalid or redundant comment.
bio/reference/ensembl-mysql-table/wrapper.R (2)
6-8
: Ensure proper log handling.The log file is opened and sinks are set up for both standard output and messages. Ensure that the log file is properly closed at the end of the script to avoid file descriptor leaks.
211-220
: Verify join column consistency.Ensure that the join columns specified in the
join_tables
parameter exist in both the main table and the join tables to prevent runtime errors.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@coderabbitai review |
Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (1)
- bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
Additional comments not posted (3)
bio/reference/ensembl-mysql-table/meta.yaml (3)
10-15
: URL and authors section are well-structured.The URL and authors sections are correctly formatted and provide necessary information.
16-22
: Output section is informative and well-structured.The output section clearly describes the supported file formats and provides relevant documentation links.
50-50
: Add a newline at the end of the file.Ensure the file ends with a newline character for better compatibility with Unix-based systems.
+
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
🤖 I have created a release \*beep\* \*boop\* --- ## [4.0.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v3.14.2...v4.0.0) (2024-08-14) ### ⚠ BREAKING CHANGES * switch to nonpareil utils script for plotting (#3100) ### Features * add ensembl mysql table wrapper ([#3103](https://www.github.com/snakemake/snakemake-wrappers/issues/3103)) ([169a315](https://www.github.com/snakemake/snakemake-wrappers/commit/169a31508f330079799842a21d3b555714365041)) * switch to nonpareil utils script for plotting ([#3100](https://www.github.com/snakemake/snakemake-wrappers/issues/3100)) ([d8ee6cb](https://www.github.com/snakemake/snakemake-wrappers/commit/d8ee6cb1d15537cd8a42c38b9f2116cc77fa2880)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
QC
snakemake-wrappers
.While the contributions guidelines are more extensive, please particularly ensure that:
test.py
was updated to call any added or updated example rules in aSnakefile
input:
andoutput:
file paths in the rules can be chosen arbitrarilyinput:
oroutput:
)tempfile.gettempdir()
points tometa.yaml
contains a link to the documentation of the respective tool or command underurl:
Summary by CodeRabbit
New Features
Bug Fixes