Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ensembl mysql table wrapper #3103

Merged
merged 26 commits into from
Aug 14, 2024

Conversation

dlaehnemann
Copy link
Contributor

@dlaehnemann dlaehnemann commented Aug 7, 2024

QC

While the contributions guidelines are more extensive, please particularly ensure that:

  • test.py was updated to call any added or updated example rules in a Snakefile
  • input: and output: file paths in the rules can be chosen arbitrarily
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:)
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to
  • the meta.yaml contains a link to the documentation of the respective tool or command under url:
  • conda environments use a minimal amount of channels and packages, in recommended ordering

Summary by CodeRabbit

  • New Features

    • Added a Conda environment setup file for Linux 64-bit systems to simplify package management.
    • Introduced a specialized Conda environment for bioinformatics projects with key R package dependencies.
    • Launched a configuration schema to facilitate the creation of biological annotation tables from Ensembl databases.
    • Implemented new rules in Snakemake for generating repeat and regulatory annotations from Ensembl databases.
    • Released an R script wrapper for easier interaction with Ensembl databases to retrieve genomic data.
  • Bug Fixes

    • Enhanced testing coverage with new automated tests for annotation file creation processes.

@dlaehnemann dlaehnemann marked this pull request as draft August 7, 2024 16:35
@dlaehnemann dlaehnemann marked this pull request as ready for review August 7, 2024 19:50
@dlaehnemann dlaehnemann requested a review from tedil August 8, 2024 18:27
@dlaehnemann
Copy link
Contributor Author

@tedil As you already had a look, just wanted to let you know about this latest change:

I just updated the build number checking code, to actually use the assembly.name and assembly.default meta_key: meta_values from the meta table in the core database. This should be even more robust than the checks I had previously come up with. But I kept those previous versions as fallbacks.

Also, I improved the error messages. Mainly by structuring them a bit more, to make all the reported values in them easier to scan over quickly.

…kemake/snakemake-wrappers into feat/add-ensembl-mysql-table-wrapper
Copy link
Contributor

@tedil tedil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor stuff, looks good to me (and really useful!)

bio/reference/ensembl-mysql-table/meta.yaml Outdated Show resolved Hide resolved
bio/reference/ensembl-mysql-table/meta.yaml Outdated Show resolved Hide resolved
bio/reference/ensembl-mysql-table/test/Snakefile Outdated Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
Copy link
Contributor

coderabbitai bot commented Aug 12, 2024

Walkthrough

The recent updates enhance a bioinformatics framework, improving reproducibility and user experience in data analysis. New configuration files establish Conda environments with necessary package dependencies, while a Snakemake workflow supports generating annotation tables from Ensembl databases. An R script acts as a wrapper for streamlined database interactions, facilitating efficient data handling for researchers.

Changes

Files Change Summary
bio/reference/.../environment.linux-64.pin.txt New file for explicit Conda package dependencies for Linux 64-bit setups.
bio/reference/.../environment.yaml New file defining a Conda environment for bioinformatics, specifying channels and key R dependencies.
bio/reference/.../meta.yaml New configuration file outlining metadata for generating Ensembl annotation tables.
bio/reference/.../test/Snakefile New rules added for creating repeat and regulatory annotations using Snakemake workflow.
bio/reference/.../wrapper.R New R script that provides a wrapper for accessing genomic data from Ensembl databases.

Poem

🐰 In the garden where data flows,
New tools arise like springtime rows.
With R and Snakemake by our side,
We hop through bioinformatics with pride.
Conda environments, a cozy burrow,
For every scientist, a path to tomorrow! 🌱


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (7)
bio/reference/ensembl-mysql-table/environment.yaml (1)

9-9: Add a newline at the end of the file.

The file should end with a newline character to adhere to POSIX standards.

9c9
<   - r-rmariadb =1.2
\ No newline at end of file
---
>   - r-rmariadb =1.2
Tools
yamllint

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

bio/reference/ensembl-mysql-table/test/Snakefile (1)

8-8: Consider restructuring main_tables for consistency.

Using a consistent structure for main_tables and join_tables might enhance readability and maintainability, especially if more sub-keys are added in the future.

main_tables={"repeat_feature": {"database": "core"}}
bio/reference/ensembl-mysql-table/meta.yaml (2)

9-9: Remove trailing spaces.

Trailing spaces can cause formatting issues and should be removed.

9c9
<     `main_tables`, to allow for `left_join'ing their columns onto them. 
---
>     `main_tables`, to allow for `left_join'ing their columns onto them.
Tools
yamllint

[error] 9-9: trailing spaces

(trailing-spaces)


50-50: Add a newline at the end of the file.

The file should end with a newline character to adhere to POSIX standards.

50c50
<       https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.html
\ No newline at end of file
---
>       https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.html
Tools
yamllint

[error] 50-50: no new line character at the end of file

(new-line-at-end-of-file)

bio/reference/ensembl-mysql-table/wrapper.R (3)

184-199: Consider implementing table filtering.

The TODO comment suggests that filtering could be implemented to reduce data transfer. Consider whether this is necessary and how it could be implemented efficiently.

# Example of potential filtering implementation:
# filter_condition <- snakemake@params[["filter_condition"]]
# table <- tbl(table_connection, table_name) |> filter(!!rlang::sym(filter_condition)) |> collect()

32-35: Ensure clarity in error messages.

The error messages use cli_abort to provide detailed information. Verify that the messages are clear, informative, and guide the user towards resolving the issue.

Consider rephrasing error messages for improved clarity and user guidance.

Also applies to: 128-157, 163-180, 238-243, 251-257


19-40: Add explicit parameter validation.

The parameters species, release, and build retrieved from Snakemake should be explicitly validated before use to prevent potential issues with malformed or incorrect input.

  • File: bio/reference/ensembl-mysql-table/wrapper.R
  • Lines: 19-40
Analysis chain

Verify parameter validation.

Ensure that the parameters retrieved from Snakemake are validated before use. The logic for setting the database port based on the genome build should be verified for correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the validation of parameters retrieved from Snakemake.

# Test: Search for parameter validation logic in the repository. Expect: Proper validation.
rg --type r $'snakemake@params'

Length of output: 8200


Script:

#!/bin/bash
# Description: Check for validation or error handling related to specific parameters in the file.

# Test: Search for validation or error handling for 'species', 'release', and 'build' parameters.
rg --type r 'species|release|build' bio/reference/ensembl-mysql-table/wrapper.R -A 5 -B 5

Length of output: 7605

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 26a097d and 0244a59.

Files selected for processing (6)
  • bio/reference/ensembl-mysql-table/environment.linux-64.pin.txt (1 hunks)
  • bio/reference/ensembl-mysql-table/environment.yaml (1 hunks)
  • bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
  • bio/reference/ensembl-mysql-table/test/Snakefile (1 hunks)
  • bio/reference/ensembl-mysql-table/wrapper.R (1 hunks)
  • test.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • bio/reference/ensembl-mysql-table/environment.linux-64.pin.txt
Additional context used
yamllint
bio/reference/ensembl-mysql-table/environment.yaml

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

bio/reference/ensembl-mysql-table/meta.yaml

[error] 9-9: trailing spaces

(trailing-spaces)


[error] 50-50: no new line character at the end of file

(new-line-at-end-of-file)

Additional comments not posted (9)
bio/reference/ensembl-mysql-table/wrapper.R (7)

223-257: Ensure correct output file format handling.

The code handles different output file formats and compression types. Verify that the logic correctly identifies and processes supported formats.

Verification successful

Output file format handling is correctly implemented.

The code correctly identifies and processes supported output file formats and compression types using write_tsv and write_parquet. Error handling for unsupported formats is also in place. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the handling of output file formats and compression types.

# Test: Search for the handling of different file formats and compression types. Expect: Correct handling.
rg --type r $'write_tsv|write_parquet'

Length of output: 871


202-209: Ensure efficient data binding.

The code binds rows from multiple tables into a single tibble. Verify that this process is efficient and handles large datasets appropriately.


52-161: Verify database name retrieval logic.

Ensure that the logic for retrieving and verifying the database name is correct. The function uses multiple methods to check the build number, which should be verified for accuracy.


6-8: Ensure proper logging setup.

The logging setup redirects both standard output and error messages to a log file. Ensure that the log file path is correctly specified in the Snakemake configuration.


211-220: Verify join conditions.

Ensure that the join conditions specified in the parameters are correct. Verify that the joins are performed efficiently and handle potential mismatches gracefully.

Verification successful

The rg command output indicates that the only occurrence of left_join is within the bio/reference/ensembl-mysql-table/wrapper.R file, which is the same file and section we are reviewing. This means there are no other uses of left_join in the repository, and thus no other join conditions in the codebase to a different part of the code.

Given this information, the focus should be on ensuring that the join conditions within this section of the code are correctly defined and that the function left_join is used appropriately. The review comment suggested verifying the join conditions, but since there are no other occurrences, we need to ensure that the current usage is correct.

To further verify, we should check the definitions and uses of the join_tables structure and the get_and_check_db_name and get_table functions to ensure that they are providing the correct information for the joins.

Let's proceed to gather more information about these entities. We will search for their definitions and usages in the codebase.


The search results provide insights into the usage and definition of the key components involved in the join operation:

  1. join_tables: This variable is initialized from snakemake@params[["join_tables"]]. It is used to determine the join conditions and the databases involved in the join operations.

  2. get_and_check_db_name: This function is defined in the same file and is responsible for retrieving and validating the database name based on the connection, species, database, release, and build parameters.

  3. get_table: This function is also defined in the same file and is used to retrieve a table from a specified database and port.

The join operation in the code snippet is dependent on the structure of join_tables and the correctness of the database names and tables retrieved by get_and_check_db_name and get_table. Given that all these components are defined and used within the same file, it suggests a self-contained logic for handling joins.

To ensure correctness, it would be beneficial to verify the structure of join_tables and the logic within get_and_check_db_name and get_table. However, without further context or issues reported, the implementation appears to be consistent with the intended functionality.

Conclusion: The join conditions and related logic in the code appear to be defined and used correctly within the file. No issues were found with the current implementation.

  • File: bio/reference/ensembl-mysql-table/wrapper.R
  • Lines: 211-220
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness and efficiency of join conditions.

# Test: Search for the usage of join conditions in the repository. Expect: Correct and efficient joins.
rg --type r $'left_join'

Length of output: 84


Script:

#!/bin/bash
# Search for the definition and usage of the `join_tables` variable.
rg --type r 'join_tables'

# Search for the definition and usage of the `get_and_check_db_name` function.
rg --type r 'get_and_check_db_name'

# Search for the definition and usage of the `get_table` function.
rg --type r 'get_table'

Length of output: 1400


10-18: Verify library usage.

Ensure that all imported libraries are necessary for the script's functionality. The rlang::global_entrace() call is unusual and should be verified for its intended purpose.


42-50: Ensure secure database connection.

The database connection uses an anonymous user with no password. Verify that this is secure and intended for the use case. Consider potential security implications.

test.py (2)

5676-5688: Verify the setup and dependencies for the test.

Ensure that the run function is correctly implemented and that the necessary files for the Snakemake workflow are present.

Verification successful

Setup and dependencies for the test are correct.

The run function is implemented correctly, and the necessary files for the Snakemake workflow are present in the repository. The test setup appears to be properly configured.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the presence of necessary files and the correct setup of the `run` function for the test.

# Test: Check for the presence of the required Snakemake workflow files.
fd -e yaml -e smk -e py -e R -e Rmd -e sh -e txt --search-path bio/reference/ensembl-mysql-table

# Test: Check the implementation of the `run` function.
ast-grep --lang python --pattern $'def run($_, $_, check_log=None):\n  $$$\n'

Length of output: 5362


5691-5703: Verify the setup and dependencies for the test.

Ensure that the run function is correctly implemented and that the necessary files for the Snakemake workflow are present.

Verification successful

Setup and dependencies for the test are correctly implemented.

The run function is well-implemented and handles the execution of the Snakemake workflow with necessary checks and cleanup. The required files for the workflow are present in the specified directory. No issues were found with the test setup.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the presence of necessary files and the correct setup of the `run` function for the test.

# Test: Check for the presence of the required Snakemake workflow files.
fd -e yaml -e smk -e py -e R -e Rmd -e sh -e txt --search-path bio/reference/ensembl-mysql-table

# Test: Check the implementation of the `run` function.
ast-grep --lang python --pattern $'def run($_, $_, check_log=None):\n  $$$\n'

Length of output: 5362

Co-authored-by: Till Hartmann <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

Commits

Files that changed from the base of the PR and between 0244a59 and 5be2d85.

Files selected for processing (1)
  • bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
Additional context used
yamllint
bio/reference/ensembl-mysql-table/meta.yaml

[error] 9-9: trailing spaces

(trailing-spaces)


[error] 50-50: no new line character at the end of file

(new-line-at-end-of-file)

Additional comments not posted (2)
bio/reference/ensembl-mysql-table/meta.yaml (2)

13-15: URL and author information look good.

The URL and author information are correctly specified.


16-22: Output formats are well-defined.

The output formats and documentation links are correctly specified.

bio/reference/ensembl-mysql-table/meta.yaml Outdated Show resolved Hide resolved
bio/reference/ensembl-mysql-table/meta.yaml Outdated Show resolved Hide resolved
bio/reference/ensembl-mysql-table/meta.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

Commits

Files that changed from the base of the PR and between 5be2d85 and 56dd9b1.

Files selected for processing (4)
  • bio/reference/ensembl-mysql-table/environment.yaml (1 hunks)
  • bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
  • bio/reference/ensembl-mysql-table/test/Snakefile (1 hunks)
  • bio/reference/ensembl-mysql-table/wrapper.R (1 hunks)
Additional comments not posted (7)
bio/reference/ensembl-mysql-table/environment.yaml (2)

1-4: Ensure the channel order is optimal.

The nodefaults channel is listed, which means only the specified channels will be used. Ensure that this is intentional and that the order of conda-forge and bioconda is optimal for resolving dependencies.


6-9: Verify the compatibility of R package versions.

Ensure that the specified versions of r-nanoparquet, r-tidyverse, r-dbplyr, and r-rmariadb are compatible with each other and with the intended use case.

bio/reference/ensembl-mysql-table/test/Snakefile (2)

31-46: Ensure parameter consistency across rules.

The create_regulatory_annotations_parquet rule uses similar parameters to the previous rule. Ensure that parameter names and formats are consistent across rules for maintainability.


26-26: Review caching strategy.

The cache: "omit-software" directive is used to save space and time. Ensure that this caching strategy aligns with the workflow requirements and does not affect reproducibility.

Also applies to: 60-60

bio/reference/ensembl-mysql-table/meta.yaml (1)

28-28: Correct grammatical error.

Change "consults" to "consult" for grammatical correctness.

-      For full valid species names, consults the respective table for the
+      For full valid species names, consult the respective table for the

Likely invalid or redundant comment.

bio/reference/ensembl-mysql-table/wrapper.R (2)

6-8: Ensure proper log handling.

The log file is opened and sinks are set up for both standard output and messages. Ensure that the log file is properly closed at the end of the script to avoid file descriptor leaks.


211-220: Verify join column consistency.

Ensure that the join columns specified in the join_tables parameter exist in both the main table and the join tables to prevent runtime errors.

bio/reference/ensembl-mysql-table/meta.yaml Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
bio/reference/ensembl-mysql-table/wrapper.R Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@dlaehnemann
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Aug 13, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

Commits

Files that changed from the base of the PR and between 56dd9b1 and dc74b91.

Files selected for processing (1)
  • bio/reference/ensembl-mysql-table/meta.yaml (1 hunks)
Additional comments not posted (3)
bio/reference/ensembl-mysql-table/meta.yaml (3)

10-15: URL and authors section are well-structured.

The URL and authors sections are correctly formatted and provide necessary information.


16-22: Output section is informative and well-structured.

The output section clearly describes the supported file formats and provides relevant documentation links.


50-50: Add a newline at the end of the file.

Ensure the file ends with a newline character for better compatibility with Unix-based systems.

+  

Likely invalid or redundant comment.

bio/reference/ensembl-mysql-table/meta.yaml Show resolved Hide resolved
bio/reference/ensembl-mysql-table/meta.yaml Show resolved Hide resolved
Copy link
Contributor

@tedil tedil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dlaehnemann dlaehnemann merged commit 169a315 into master Aug 14, 2024
8 checks passed
@dlaehnemann dlaehnemann deleted the feat/add-ensembl-mysql-table-wrapper branch August 14, 2024 11:39
dlaehnemann pushed a commit that referenced this pull request Aug 14, 2024
🤖 I have created a release \*beep\* \*boop\*
---
##
[4.0.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v3.14.2...v4.0.0)
(2024-08-14)


### ⚠ BREAKING CHANGES

* switch to nonpareil utils script for plotting (#3100)

### Features

* add ensembl mysql table wrapper
([#3103](https://www.github.com/snakemake/snakemake-wrappers/issues/3103))
([169a315](https://www.github.com/snakemake/snakemake-wrappers/commit/169a31508f330079799842a21d3b555714365041))
* switch to nonpareil utils script for plotting
([#3100](https://www.github.com/snakemake/snakemake-wrappers/issues/3100))
([d8ee6cb](https://www.github.com/snakemake/snakemake-wrappers/commit/d8ee6cb1d15537cd8a42c38b9f2116cc77fa2880))
---


This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@coderabbitai coderabbitai bot mentioned this pull request Sep 24, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants