GitHub - aertslab/create_cisTarget_databases: Create cisTarget databases (original) (raw)

Create cisTarget databases

ℹ️ The SCENIC+ public motif collection PWMs can be found at https://resources.aertslab.org/cistarget/motif_collections/.

You can find precomputed databases for human, mouse and fly at https://resources.aertslab.org/cistarget/.

Installation

Clone `create_cisTarget_databases` source code

Clone git repo.

git clone https://github.com/aertslab/create_cisTarget_databases

cd create_cisTarget_databases

Display to which value ${create_cistarget_databases_dir} variable should be set.

echo "create_cistarget_databases_dir='""${PWD}""'"

Create conda environment

Create conda environment.

conda create -n create_cistarget_databases
'python=3.10'
'numpy=1.21'
'pandas>=1.4.1'
'pyarrow>=7.0.0'
'numba>=0.55.1'
'python-flatbuffers'

Install Cluster-Buster

Install Cluster-Buster for scoring regulatory regions with motifs.

Install precompiled binary

Activate conda environment.

conda activate create_cistarget_databases

cd "${CONDA_PREFIX}/bin"

Download precompiled Cluster-Buster binary.

wget https://resources.aertslab.org/cistarget/programs/cbust

Make downloaded binary executable.

chmod a+x cbust

Compile from source

Clone Cluster-Buster repo.

#git clone https://github.com/weng-lab/cluster-buster/ git clone -b change_f4_output https://github.com/ghuls/cluster-buster/

cd cluster-buster

Compile Cluster-Buster.

make cbust

Compile Cluster-Buster with AMD Math Library (LibM).

This can be much faster (like twice as fast) than using the

glibc math library on older distributions.

https://developer.amd.com/amd-aocl/amd-math-library-libm/

mkdir ../aocl-libm

cd ../aocl-libm

AMD_LIB_VERSION=3.1.0

tar xzf aocl-libm-linux-aocc-${AMD_LIB_VERSION}.tar.gz

mv amd-libm amd-libm-aocc

make cbust_amd_libm_aocc

Activate conda environment.

conda activate create_cistarget_databases

Copy CLuster-Buster binary of your choice in conda environment.

cp -a cbust "${CONDA_PREFIX}/bin/cbust" cp -a cbust_amd_libm_aocc "${CONDA_PREFIX}/bin/cbust"

Install UCSC tools

Install some UCSC tools:

liftOver: Move regulatory regions from one assembly to another. Needed when generating regulatory regions in other species when creating cross-species regulatory regions.
bigWigAverageOverBed: Compute average/max score of bigWig files for each regulatory region. Used for scoring TF ChIP-seq bigWig files.

Activate conda environment.

conda activate create_cistarget_databases

cd "${CONDA_PREFIX}/bin"

Download liftOver.

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver

Download bigWigAverageOverBed.

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed

Make downloaded binaries executable.

chmod a+x liftOver bigWigAverageOverBed

Activate environment

Before running any of the scripts, load create_cistarget_databases conda environment and set the create_cistarget_databases_dir variable to the dir that contains the cloned repo.

Activate conda environment.

conda activate create_cistarget_databases

Set ${create_cistarget_databases_dir} variable to path where the repo was cloned to.

create_cistarget_databases_dir=""

Memory requirements

Creating cisTarget databases can be a very memory intensive job as it needs to create/store a 2D matrix with dimensions (number of motifs/tracks vs number of regions/genes) or vice versa. Besides this, it needs (relatively little) memory to store motif/tracks and regions/genes names.

Memory size of cisTarget scores database when loaded in memory:

4 bytes x number of regions/genes x number of motifs/tracks
memory needed to store region/genes names and motifs/tracks names

Memory size of cisTarget rankings database when loaded in memory:

32768 regions/genes or less:
- 2 bytes x number of regions/genes x number of motifs/tracks
- memory needed to store region/genes names and motifs/tracks names
more than 32768 regions/genes:
- 4 bytes x number of regions/genes x number of motifs/tracks
- memory needed to store region/genes names and motifs/tracks names

Examples:

cisTarget database type	byte size (1 element)	#genes/regions	#motifs/tracks	RAM requirement for 2D matrix
cisTarget scores database	4 bytes	20000 genes	10000 motifs	4 bytes x 20000 x 10000 = 0.745 GB
cisTarget rankings database	2 bytes	20000 genes	10000 motifs	2 bytes x 20000 x 10000 = 0.373 GB
cisTarget scores database	4 bytes	1000000 regions	10000 motifs	4 bytes x 1000000 x 10000 = 37.253 GB
cisTarget rankings database	4 bytes	1000000 regions	10000 motifs	4 bytes x 1000000 x 10000 = 37.253 GB
cisTarget scores database	4 bytes	1000000 regions	1000 tracks	4 bytes x 1000000 x 1000 = 3.725 GB
cisTarget rankings database	4 bytes	1000000 regions	1000 tracks	4 bytes x 1000000 x 1000 = 3.725 GB

When running the scripts in this repo, you might need around 3 times the amount of RAM of the actual database.

Scripts overview

script	description
create_cistarget_motif_databases.py	Create cisTarget motif databases.
create_cistarget_track_databases.py	Create cisTarget track databases.
combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py	Combine partial cisTarget motifs or tracks vs regions or genes scores databases to: 1) a complete cisTarget motifs or tracks regions or genes scores database and 2) a complete cisTarget regions or genes vs motifs or tracks scores database.
combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py	Combine partial cisTarget regions or genes vs motifs or tracks scores databases to: 1) a complete cisTarget regions or genes vs motifs or tracks scores database and 2) a complete cisTarget motifs or tracks vs regions or genes scores database.
convert_motifs_or_tracks_vs_ regions_or_genes_scores_to_ rankings_cistarget_dbs.py	Convert cisTarget motifs or tracks vs regions or genes scores database to cisTarget rankings database.
create_cross_species_motifs_rankings_db.py	Create cisTarget cross-species motifs rankings databases.

Usage

create_cistarget_motif_databases.py

❯ ${create_cistarget_databases_dir}/create_cistarget_motif_databases.py --help usage: create_cistarget_motif_databases.py [-h] -f FASTA_FILENAME [-F ORIGINAL_SPECIES_FASTA_FILENAME] -M MOTIFS_DIR -m MOTIFS_LIST_FILENAME [-5 MOTIF_MD5_TO_MOTIF_ID_FILENAME] -o DB_PREFIX [-c CLUSTER_BUSTER_PATH] [-t NBR_THREADS] [-p CURRENT_PART NBR_TOTAL_PARTS] [-g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE] [-b BG_PADDING] [--min MIN_NBR_MOTIFS] [--max MAX_NBR_MOTIFS] [-l] [-s SEED] [-r SSH_COMMAND]

Create cisTarget motif databases.

options: -h, --help show this help message and exit -f FASTA_FILENAME, --fasta FASTA_FILENAME FASTA filename which contains the regions/genes to score with Cluster-Buster for each motif. When creating a cisTarget species database from regions/genes lifted over from a different species, provide the original FASTA file for that species to -F. -F ORIGINAL_SPECIES_FASTA_FILENAME, --fasta-original-species ORIGINAL_SPECIES_FASTA_FILENAME FASTA filename which contains all the regions/genes of the original species. The fasta file provided to -f can contain less regions (not all regions could be lifted over) than the one provided to -F, but to create a cisTarget cross- species database later, all individual cisTarget species databases need to contain the same amount of regions/genes. -M MOTIFS_DIR, --motifs_dir MOTIFS_DIR Path to directory with Cluster-Buster motifs. -m MOTIFS_LIST_FILENAME, --motifs MOTIFS_LIST_FILENAME Filename with list of motif IDs or motif MD5 names to be scored from directory specified by "--motifs_dir". -5 MOTIF_MD5_TO_MOTIF_ID_FILENAME, --md5 MOTIF_MD5_TO_MOTIF_ID_FILENAME Filename with motif MD5 to motif ID mappings to map Cluster-Buster motif MD5 filenames to motif IDs. -o DB_PREFIX, --output DB_PREFIX Feather database prefix output filename. -c CLUSTER_BUSTER_PATH, --cbust CLUSTER_BUSTER_PATH Path to Cluster-Buster (https://github.com/weng-lab/cluster-buster/). Default: "cbust". -t NBR_THREADS, --threads NBR_THREADS Number of threads to use when scoring motifs. Default: 1. -p CURRENT_PART NBR_TOTAL_PARTS, --partial CURRENT_PART NBR_TOTAL_PARTS Divide the motif list in a number of total parts (of similar size) and score only the part defined by current_part. This allows creating partial databases on machines which do not have enough RAM to score all motifs in one iteration. This will only create a partial regions/genes vs motifs scoring database ({db_prefix} .part_000{current_part}_of_000{nbr_total_parts}.regions_vs_motifs.scores.feather or {db_prefix}.part_000{current_part}_of_000{nbr_total_parts}.genes_vs_motifs.sc ores.feather). -g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE, --genes EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE Take top CRM score for a gene by taking the maximum CRM score of multiple regions for that gene. Define a regex which will remove the non-gene part of the region ID, so only the gene ID remains. Examples: "gene_id#some_number": "#[0-9]+$" or "region_id@@gene_id": "^.+@@". -b BG_PADDING, --bgpadding BG_PADDING Background padding in bp that was added for each sequence in FASTA file. Default: 0. --min MIN_NBR_MOTIFS Minimum number of motifs needed per Cluster-Buster motif file to be considered for scoring (filters motifs list). Default: 1. --max MAX_NBR_MOTIFS Maximum number of motifs needed per Cluster-Buster motif file to be considered for scoring (filters motifs list). Default: None. -l, --mask Consider masked (lowercase) nucleotides as Ns. -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same rankings databases as output. -r SSH_COMMAND, --ssh SSH_COMMAND If defined, run Cluster-Buster over ssh by running the provided command to make the connection before running Cluster-Buster itself. Example: 'ssh -o ControlMaster=auto -o ControlPath=/tmp/ssh-control-path-%l-%h-%p-%r -o ControlPersist=600 '

create_cistarget_track_databases.py

❯ ${create_cistarget_databases_dir}/create_cistarget_track_databases.py --help usage: create_cistarget_track_databases.py [-h] -b BED_FILENAME -T TRACKS_DIR -d TRACKS_LIST_FILENAME -o DB_PREFIX [-a BIGWIG_AVERAGE_OVER_BED_PATH] [-t NBR_THREADS] [-p CURRENT_PART NBR_TOTAL_PARTS] [-g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE] [-s SEED] [-r SSH_COMMAND]

Create cisTarget track databases.

options: -h, --help show this help message and exit -b BED_FILENAME, --bed BED_FILENAME BED filename which contains the regions/genes to score with bigWigAverageOverBed for each bigwig track (ChIP-seq) files. -T TRACKS_DIR, --tracks_dir TRACKS_DIR Path to directory with bigwig track (ChIP-seq) files. -d TRACKS_LIST_FILENAME, --tracks TRACKS_LIST_FILENAME Filename with list of track IDs to be scored from directory specified by "-- tracks_dir". -o DB_PREFIX, --output DB_PREFIX Feather database prefix output filename. -a BIGWIG_AVERAGE_OVER_BED_PATH, --bwaob BIGWIG_AVERAGE_OVER_BED_PATH Path to bigWigAverageOverBed (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed). Default: "bigWigAverageOverBed". -t NBR_THREADS, --threads NBR_THREADS Number of threads to use when scoring tracks. Default: 1. -p CURRENT_PART NBR_TOTAL_PARTS, --partial CURRENT_PART NBR_TOTAL_PARTS Divide the tracks list in a number of total parts (of similar size) and score only the part defined by current_part. This allows creating partial databases on machines which do not have enough RAM to score all tracks in one iteration. This will only create a partial regions/genes vs tracks scoring database ({db_p refix}.part_000{current_part}_of_000{nbr_total_parts}.regions_vs_tracks.scores. feather or {db_prefix}.part_000{current_part}of_000{nbr_total_parts}.genes_vs tracks.scores.feather). -g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE, --genes EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE Take top score for a gene by taking the maximum score of multiple regions for that gene. Define a regex which will remove the non-gene part of the region ID, so only the gene ID remains. Examples: "gene_id#some_number": "#[0-9]+$" or "region_id@@gene_id": "^.+@@". -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same rankings databases as output. -r SSH_COMMAND, --ssh SSH_COMMAND If defined, run bigWigAverageOverBed over ssh by running the provided command to make the connection before running bigWigAverageOverBed itself. Example: 'ssh -o ControlMaster=auto -o ControlPath=/tmp/ssh-control-path-%l-%h-%p-%r -o ControlPersist=600 '

combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py

❯ ${create_cistarget_databases_dir}/combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py --help usage: combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py [-h] -i INPUT -o OUTPUT_DIR

Combine partial cisTarget motifs or tracks vs regions or genes scores databases to: 1) a complete cisTarget motifs or tracks vs regions or genes scores database and2) a complete cisTarget regions or genes vs motifs or tracks scores database.

options: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with partial cisTarget motif or track vs regions or genes scores database Feather files. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the 1) complete cisTarget motifs or tracks vs regions or genes scores database Feather files and 2) complete cisTarget regions or genes vs motif or track scores database Feather files will be written.

combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py

❯ ${create_cistarget_databases_dir}/combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py --help usage: combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py [-h] -i INPUT -o OUTPUT_DIR

Combine partial cisTarget regions or genes vs motifs or tracks scores databases to: 1) a complete cisTarget regions or genes vs motifs or tracks scores database and 2) a complete cisTarget motifs or tracks vs regions or genes scores database.

options: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with partial cisTarget regions or genes vs motif or track scores database Feather files. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the 1) complete cisTarget regions or genes vs motif or track scores database Feather files and 2) complete cisTarget motifs or tracks vs regions or genes scores database Feather files will be written.

convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py

❯ ${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py --help usage: convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py [-h] -i CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME [-s SEED]

Convert cisTarget motifs or tracks vs regions or genes scores database to cisTarget rankings database.

optional arguments: -h, --help show this help message and exit -i CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME, --db CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME cisTarget motifs or tracks vs regions or genes scores database filename. The cisTarget rankings database Feather file will be written to the same directory. -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same cisTarget rankings databases as output.

create_cross_species_motifs_rankings_db.py

❯ ${create_cistarget_databases_dir}/create_cross_species_motifs_rankings_db.py -h usage: create_cross_species_motifs_rankings_db.py [-h] -i INPUT -o OUTPUT_DIR

Create cisTarget cross-species motifs rankings databases.

optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with cisTarget motifs vs regions or genes rankings databases per species. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the cisTarget cross-species motifs rankings database files will be written.

Creating cisTarget databases: details

To create cisTarget databases:

FASTA file with regulatory regions:
- gene-based
- region-based
motifs or TF ChIP-seq tracks:
- motifs: in Cluster-Buster format
- tracks: bigWig files of TF ChIP-seq data

Creating cisTarget motif databases

cisTarget motif databases can be created in 2 ways:

Score all motifs at once and create rankings
Score motifs in different parts and generate rankings in a separate step

Score all motifs at once and create rankings

Create cisTarget motif databases:

create_cistarget_motif_databases.py
- for each motif score all regulatory regions and create a cisTarget motifs vs regions/genes scores db:
  * *.motifs_vs_regions.scores.feather
  * *.motifs_vs_genes.scores.feather
- transpose cisTarget motifs vs regions/genes scores db to cisTarget regions/genes vs motifs scores db:
  * *.regions_vs_motifs.scores.feather
  * *.genes_vs_motifs.scores.feather
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
  * *.motifs_vs_regions.rankings.feather
  * *.motifs_vs_genes.rankings.feather
- transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
  * *.regions_vs_motifs.rankings.feather
  * *.genes_vs_motifs.rankings.feather

FASTA file with sequences per region IDs / gene IDs.

fasta_filename=

Directory with motifs in Cluster-Buster format.

motifs_dir=

File with motif IDs (base name of motif file in ${motifs_dir}).

motifs_list_filename=

cisTarget motif database output prefix.

db_prefix=

nbr_threads=22

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"

Score all tracks at once and create rankings

Create cisTarget tracks databases:

create_cistarget_track_databases.py
- for each track score all regulatory regions and create a cisTarget tracks vs regions/genes scores db:
  * *.tracks_vs_regions.scores.feather
  * *.tracks_vs_genes.scores.feather
- transpose cisTarget track vs regions/genes scores db to cisTarget regions/genes vs tracks scores db:
  * *.regions_vs_tracks.scores.feather
  * *.genes_vs_tracks.scores.feather
- creating a ranking for each regulatory region per motif based on the track score of the track for that region and create a cisTarget track vs regions/genes rankings db:
  * *.tracks_vs_regions.rankings.feather
  * *.tracks_vs_genes.rankings.feather
- transpose cisTarget tracks vs regions/genes rankings db to cisTarget regions/genes vs tracks rankings db:
  * *.regions_vs_tracks.rankings.feather
  * *.genes_vs_tracks.rankings.feather

BED file with regions to score.

regions_bed_filename=

Directory with bigWig tracks of TF-ChIP-seq data.

tracks_dir=

File with track IDs (base names of bigWig files in ${tracks_dir}).

tracks_list_filename=

cisTarget track database output prefix.

db_prefix=

nbr_threads=22

"${create_cistarget_databases_dir}/create_cistarget_track_databases.py"
-b "${regions_bed_filename}"
-T "${tracks_dir}"
-d "${tracks_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"

Score motifs in different parts and generate rankings in a separate step

Create cisTarget motif databases:

create_cistarget_motif_databases.py:
- score the whole list of motifs in several parts by running create_cistarget_motif_databases.py with the-p <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>c</mi><mi>u</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>n</mi><msub><mi>t</mi><mi>p</mi></msub><mi>a</mi><mi>r</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">{current_part} </annotation></semantics></math>currentpart{nbr_total_parts} option with ${current_part} set from 1 to ${nbr_total_parts}. Each run will create motif scores (for the current subset of motifs) for all regulatory regions and create a partial cisTarget motifs vs regions/genes scores db:
 * *.part_${current_part}_of_${nbr_total_parts}.motifs_vs_regions.scores.feather
 * *.part_${current_part}_of_${nbr_total_parts}.motifs_vs_genes.scores.feather
combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py:
- Combine partial cisTarget motifs vs regions/genes scores db to:
  * a complete cisTarget motifs vs regions/genes scores database:
  * *.motifs_vs_regions.scores.feather
  * *.motifs_vs_genes.scores.feather
  * a complete cisTarget regions/genes vs motifs scores database:
  * *.regions_vs_motifs.scores.feather
  * *.genes_vs_motifs.scores.feather
  * partial cisTarget scores databases can be deleted afterwards.
convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
  * *.motifs_vs_regions.rankings.feather
  * *.motifs_vs_genes.rankings.feather
- transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
  * *.regions_vs_motifs.rankings.feather
  * *.genes_vs_motifs.rankings.feather

Step 1

Using -p or --partial of create_cistarget_motif_databases.py (help) will divide the motif list in a number of total parts (${nbr_total_parts}) (of similar size) and score only the part defined by${current_part}.

This allows creating partial databases on machines which do not have enough RAM to score all motifs in one iteration and/or running the motif scoring on multiple nodes (where each node runs the motif scoring with a different value for${current_part}) in parallel. This is quite useful if the number of region/genes is quite high.

This will only create a partial cisTarget motifs vs regions/genes scores database files:

${output_db_prefix}.part_0*${current_part}_of_0*${nbr_total_parts}.motifs_vs_regions.scores.feather
${output_db_prefix}.part_0*${current_part}_of_0*${nbr_total_parts}.motifs_vs_genes.scores.feather

FASTA file with sequences per region IDs / gene IDs.

fasta_filename=

Directory with motifs in Cluster-Buster format.

motifs_dir=

File with motif IDs (base name of motif file in ${motifs_dir}).

motifs_list_filename=

cisTarget motif database output prefix.

db_prefix=

nbr_threads=22 nbr_parts=10

Create a partial directory, so partial cisTarget database files can be deleted easily afterwards.

mkdir partial

Each invocation of the for loop (with different ${current_part}) can also be submitted to a different node to speedup

the motif scoring.

for current_part in (seq1(seq 1 (seq1{nbr_total_parts}) ; do "${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-p "${current_part}" "${nbr_total_parts}"
-o "partial/${db_prefix}"
-t "${nbr_threads}" done

When creating cross-species databases, motif scoring should be done with regulatory regions after liftover (fasta_filename), while the original regulatory regions fasta file (original_species_fasta_filename) should also be provided. The latter is only used to make sure that all regions/genes of the original regulatory regions are in the generated cisTarget database as some regions might get lost after liftover. cisTarget motif score databases need to be generated for each species of interest.

FASTA file with sequences per region IDs / gene IDs.

fasta_filename=

FASTA file with sequences per region IDs / gene IDs of the original species.

original_species_fasta_filename=

Directory with motifs in Cluster-Buster format.

motifs_dir=

File with motif IDs (base name of motif file in ${motifs_dir}).

motifs_list_filename=

cisTarget motif database output prefix.

db_prefix=

nbr_threads=22

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-F "${original_species_fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"

Step 2

See Memory requirements to have a rough guess about the amount of memory needed in case you have problems running this step.

When all partial cisTarget regions/genes vs motifs scores database files are created:

${db_prefix}.part_0*of_0*${nbr_total_parts}.motifs_vs_regions.scores.feather
${db_prefix}.part_0*_of_0*${nbr_total_parts}.motifs_vs_genes.scores.feather

they can be combined withcombine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py (help)to:

a complete cisTarget regions/genes vs motifs scores database:
- *.motifs_vs_regions.scores.feather
- *.motifs_vs_genes.scores.feather
a complete cisTarget motifs vs regions/genes scores database:
- *.regions_vs_motifs.scores.feather
- *.genes_vs_motifs.scores.feather

Partial cisTarget scores databases can be deleted afterwards.

"${create_cistarget_databases_dir}/combine_partial_motifs_or_tracks_vs_regions_or_genes_cistarget_dbs.py
-i partial/
-o .

Partial cisTarget databases can be removed.

#rm -r partial

Step 3

See Memory requirements to have a rough guess about the amount of memory needed in case you have problems running this step.

Create rankings from a complete cisTarget regions/genes vs motifs scores database withconvert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py (help):

creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
- *.motifs_vs_regions.rankings.feather
- *.motifs_vs_genes.rankings.feather
transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
- *.regions_vs_motifs.rankings.feather
- *.genes_vs_motifs.rankings.feather

cisTarget database filename:

- *.motifs_vs_regions.rankings.feather

- *.motifs_vs_genes.rankings.feather

db_filename=

"${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
-i "${db_filename}"

Create cisTarget cross-species motifs rankings database

To create cisTarget cross-species motifs rankings database:

liftover regulatory regions of main species to each species of interest.
Create cisTarget motif databases for each lifted over regulatory regions FASTA file.convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
Create cisTarget cross-species motifs rankings from individual (per species) cisTarget motifs rankings databases with create_cross_species_motifs_rankings_db.py

cisTarget database prefix which matches the common part of all cisTarget rankings databases (or just the directory

that contains them).

db_prefix=

Output directory.

output_dir=

"${create_cistarget_databases_dir}/create_cross_species_motifs_rankings_db.py
-i "${db_prefix}"
-o "${output_dir}"

GitHub - aertslab/create_cisTarget_databases: Create cisTarget databases (original) (raw)

Create cisTarget databases

Installation

Clone create_cisTarget_databases source code

Clone git repo.

Display to which value ${create_cistarget_databases_dir} variable should be set.

Create conda environment

Create conda environment.

Install Cluster-Buster

Install precompiled binary

Activate conda environment.

Download precompiled Cluster-Buster binary.

Make downloaded binary executable.

Compile from source

Clone Cluster-Buster repo.

Compile Cluster-Buster.

Compile Cluster-Buster with AMD Math Library (LibM).

This can be much faster (like twice as fast) than using the

glibc math library on older distributions.

https://developer.amd.com/amd-aocl/amd-math-library-libm/

mkdir ../aocl-libm

cd ../aocl-libm

AMD_LIB_VERSION=3.1.0

tar xzf aocl-libm-linux-aocc-${AMD_LIB_VERSION}.tar.gz

mv amd-libm amd-libm-aocc

Activate conda environment.

Copy CLuster-Buster binary of your choice in conda environment.

Install UCSC tools

Activate conda environment.

Download liftOver.

Download bigWigAverageOverBed.

Make downloaded binaries executable.

Activate environment

Activate conda environment.

Set ${create_cistarget_databases_dir} variable to path where the repo was cloned to.

Memory requirements

Scripts overview

Usage

create_cistarget_motif_databases.py

create_cistarget_track_databases.py

combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py

combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py

convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py

create_cross_species_motifs_rankings_db.py

Creating cisTarget databases: details

Creating cisTarget motif databases

Score all motifs at once and create rankings

FASTA file with sequences per region IDs / gene IDs.

Directory with motifs in Cluster-Buster format.

File with motif IDs (base name of motif file in ${motifs_dir}).

cisTarget motif database output prefix.

Score all tracks at once and create rankings

BED file with regions to score.

Directory with bigWig tracks of TF-ChIP-seq data.

File with track IDs (base names of bigWig files in ${tracks_dir}).

cisTarget track database output prefix.

Score motifs in different parts and generate rankings in a separate step

Step 1

FASTA file with sequences per region IDs / gene IDs.

Directory with motifs in Cluster-Buster format.

File with motif IDs (base name of motif file in ${motifs_dir}).

cisTarget motif database output prefix.

Create a partial directory, so partial cisTarget database files can be deleted easily afterwards.

Each invocation of the for loop (with different ${current_part}) can also be submitted to a different node to speedup

the motif scoring.

FASTA file with sequences per region IDs / gene IDs.

FASTA file with sequences per region IDs / gene IDs of the original species.

Directory with motifs in Cluster-Buster format.

File with motif IDs (base name of motif file in ${motifs_dir}).

cisTarget motif database output prefix.

Step 2

Partial cisTarget databases can be removed.

Step 3

cisTarget database filename:

- *.motifs_vs_regions.rankings.feather

- *.motifs_vs_genes.rankings.feather

Create cisTarget cross-species motifs rankings database

cisTarget database prefix which matches the common part of all cisTarget rankings databases (or just the directory

that contains them).

Output directory.

Clone `create_cisTarget_databases` source code