Summary

Efficiently query a curated library of open-access computational biology datasets.

Use data frame verbs (filter, select) to push down row filters and column selections to the storage backend, downloading only the data you need into your working session.

Write iterative queries to incorporate massive, otherwise unwieldy datasets into your workflow — without any bulk downloading or preprocessing:

R
Python

# top 10 GLP1R hits per ancestry
for ( anc in c("AFR", "AMR", "CSA", "EAS", "EUR", "MID") ) {
    hits <- load_dataset("ukb_ppp/pqtls") |>
        filter(
            ancestry == anc,
            protein == "P43220"
        ) |>
        select(
            ancestry,
            chromosome,
            position,
            effect_allele,
            other_allele,
            neg_log_10_p_value
        ) |>
        arrange(
            desc(neg_log_10_p_value)
        ) |>
        head(10) |>
        collect()
    print(hits)
}

# top 10 GLP1R hits per ancestry
for anc in ['AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID']:
    hits = bb \
        .load_dataset('ukb_ppp/pqtls') \
        .filter(
            pl.col('ancestry') == anc,
            pl.col('protein') == 'P43220'
        ) \
        .select(
            'ancestry',
            'chromosome',
            'position',
            'effect_allele',
            'other_allele',
            'neg_log_10_p_value'
        ) \
        .top_k(
            10,
            by='neg_log_10_p_value'
        ) \
        .collect()
    print(hits)

Datasets

Use the links in the sidebar to understand the available datasets and their contents.

To request the addition of a new dataset to the library, open a GitHub issue.