GitHub - aertslab/create_cisTarget_databases: Create cisTarget databases (original) (raw)
Create cisTarget databases
ℹ️ The SCENIC+ public motif collection PWMs can be found at https://resources.aertslab.org/cistarget/motif_collections/.
You can find precomputed databases for human, mouse and fly at https://resources.aertslab.org/cistarget/.
Installation
Clone create_cisTarget_databases
source code
Clone git repo.
git clone https://github.com/aertslab/create_cisTarget_databases
cd create_cisTarget_databases
Display to which value ${create_cistarget_databases_dir} variable should be set.
echo "create_cistarget_databases_dir='""${PWD}""'"
Create conda environment
Create conda environment.
conda create -n create_cistarget_databases
'python=3.10'
'numpy=1.21'
'pandas>=1.4.1'
'pyarrow>=7.0.0'
'numba>=0.55.1'
'python-flatbuffers'
Install Cluster-Buster
Install Cluster-Buster for scoring regulatory regions with motifs.
Install precompiled binary
Activate conda environment.
conda activate create_cistarget_databases
cd "${CONDA_PREFIX}/bin"
Download precompiled Cluster-Buster binary.
wget https://resources.aertslab.org/cistarget/programs/cbust
Make downloaded binary executable.
chmod a+x cbust
Compile from source
Clone Cluster-Buster repo.
#git clone https://github.com/weng-lab/cluster-buster/ git clone -b change_f4_output https://github.com/ghuls/cluster-buster/
cd cluster-buster
Compile Cluster-Buster.
make cbust
Compile Cluster-Buster with AMD Math Library (LibM).
This can be much faster (like twice as fast) than using the
glibc math library on older distributions.
https://developer.amd.com/amd-aocl/amd-math-library-libm/
mkdir ../aocl-libm
cd ../aocl-libm
AMD_LIB_VERSION=3.1.0
tar xzf aocl-libm-linux-aocc-${AMD_LIB_VERSION}.tar.gz
mv amd-libm amd-libm-aocc
make cbust_amd_libm_aocc
Activate conda environment.
conda activate create_cistarget_databases
Copy CLuster-Buster binary of your choice in conda environment.
cp -a cbust "${CONDA_PREFIX}/bin/cbust" cp -a cbust_amd_libm_aocc "${CONDA_PREFIX}/bin/cbust"
Install UCSC tools
Install some UCSC tools:
liftOver
: Move regulatory regions from one assembly to another. Needed when generating regulatory regions in other species when creating cross-species regulatory regions.bigWigAverageOverBed
: Compute average/max score of bigWig files for each regulatory region. Used for scoring TF ChIP-seq bigWig files.
Activate conda environment.
conda activate create_cistarget_databases
cd "${CONDA_PREFIX}/bin"
Download liftOver.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
Download bigWigAverageOverBed.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed
Make downloaded binaries executable.
chmod a+x liftOver bigWigAverageOverBed
Activate environment
Before running any of the scripts, load create_cistarget_databases
conda environment and set the create_cistarget_databases_dir
variable to the dir that contains the cloned repo.
Activate conda environment.
conda activate create_cistarget_databases
Set ${create_cistarget_databases_dir} variable to path where the repo was cloned to.
create_cistarget_databases_dir=""
Memory requirements
Creating cisTarget databases can be a very memory intensive job as it needs to create/store a 2D matrix with dimensions (number of motifs/tracks vs number of regions/genes) or vice versa. Besides this, it needs (relatively little) memory to store motif/tracks and regions/genes names.
Memory size of cisTarget scores database when loaded in memory:
4 bytes x number of regions/genes x number of motifs/tracks
- memory needed to store region/genes names and motifs/tracks names
Memory size of cisTarget rankings database when loaded in memory:
- 32768 regions/genes or less:
2 bytes x number of regions/genes x number of motifs/tracks
- memory needed to store region/genes names and motifs/tracks names
- more than 32768 regions/genes:
4 bytes x number of regions/genes x number of motifs/tracks
- memory needed to store region/genes names and motifs/tracks names
Examples:
cisTarget database type | byte size (1 element) | #genes/regions | #motifs/tracks | RAM requirement for 2D matrix |
---|---|---|---|---|
cisTarget scores database | 4 bytes | 20000 genes | 10000 motifs | 4 bytes x 20000 x 10000 = 0.745 GB |
cisTarget rankings database | 2 bytes | 20000 genes | 10000 motifs | 2 bytes x 20000 x 10000 = 0.373 GB |
cisTarget scores database | 4 bytes | 1000000 regions | 10000 motifs | 4 bytes x 1000000 x 10000 = 37.253 GB |
cisTarget rankings database | 4 bytes | 1000000 regions | 10000 motifs | 4 bytes x 1000000 x 10000 = 37.253 GB |
cisTarget scores database | 4 bytes | 1000000 regions | 1000 tracks | 4 bytes x 1000000 x 1000 = 3.725 GB |
cisTarget rankings database | 4 bytes | 1000000 regions | 1000 tracks | 4 bytes x 1000000 x 1000 = 3.725 GB |
When running the scripts in this repo, you might need around 3 times the amount of RAM of the actual database.
Scripts overview
script | description |
---|---|
create_cistarget_motif_databases.py | Create cisTarget motif databases. |
create_cistarget_track_databases.py | Create cisTarget track databases. |
combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py | Combine partial cisTarget motifs or tracks vs regions or genes scores databases to: 1) a complete cisTarget motifs or tracks regions or genes scores database and 2) a complete cisTarget regions or genes vs motifs or tracks scores database. |
combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py | Combine partial cisTarget regions or genes vs motifs or tracks scores databases to: 1) a complete cisTarget regions or genes vs motifs or tracks scores database and 2) a complete cisTarget motifs or tracks vs regions or genes scores database. |
convert_motifs_or_tracks_vs_ regions_or_genes_scores_to_ rankings_cistarget_dbs.py | Convert cisTarget motifs or tracks vs regions or genes scores database to cisTarget rankings database. |
create_cross_species_motifs_rankings_db.py | Create cisTarget cross-species motifs rankings databases. |
Usage
create_cistarget_motif_databases.py
❯ ${create_cistarget_databases_dir}/create_cistarget_motif_databases.py --help usage: create_cistarget_motif_databases.py [-h] -f FASTA_FILENAME [-F ORIGINAL_SPECIES_FASTA_FILENAME] -M MOTIFS_DIR -m MOTIFS_LIST_FILENAME [-5 MOTIF_MD5_TO_MOTIF_ID_FILENAME] -o DB_PREFIX [-c CLUSTER_BUSTER_PATH] [-t NBR_THREADS] [-p CURRENT_PART NBR_TOTAL_PARTS] [-g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE] [-b BG_PADDING] [--min MIN_NBR_MOTIFS] [--max MAX_NBR_MOTIFS] [-l] [-s SEED] [-r SSH_COMMAND]
Create cisTarget motif databases.
options: -h, --help show this help message and exit -f FASTA_FILENAME, --fasta FASTA_FILENAME FASTA filename which contains the regions/genes to score with Cluster-Buster for each motif. When creating a cisTarget species database from regions/genes lifted over from a different species, provide the original FASTA file for that species to -F. -F ORIGINAL_SPECIES_FASTA_FILENAME, --fasta-original-species ORIGINAL_SPECIES_FASTA_FILENAME FASTA filename which contains all the regions/genes of the original species. The fasta file provided to -f can contain less regions (not all regions could be lifted over) than the one provided to -F, but to create a cisTarget cross- species database later, all individual cisTarget species databases need to contain the same amount of regions/genes. -M MOTIFS_DIR, --motifs_dir MOTIFS_DIR Path to directory with Cluster-Buster motifs. -m MOTIFS_LIST_FILENAME, --motifs MOTIFS_LIST_FILENAME Filename with list of motif IDs or motif MD5 names to be scored from directory specified by "--motifs_dir". -5 MOTIF_MD5_TO_MOTIF_ID_FILENAME, --md5 MOTIF_MD5_TO_MOTIF_ID_FILENAME Filename with motif MD5 to motif ID mappings to map Cluster-Buster motif MD5 filenames to motif IDs. -o DB_PREFIX, --output DB_PREFIX Feather database prefix output filename. -c CLUSTER_BUSTER_PATH, --cbust CLUSTER_BUSTER_PATH Path to Cluster-Buster (https://github.com/weng-lab/cluster-buster/). Default: "cbust". -t NBR_THREADS, --threads NBR_THREADS Number of threads to use when scoring motifs. Default: 1. -p CURRENT_PART NBR_TOTAL_PARTS, --partial CURRENT_PART NBR_TOTAL_PARTS Divide the motif list in a number of total parts (of similar size) and score only the part defined by current_part. This allows creating partial databases on machines which do not have enough RAM to score all motifs in one iteration. This will only create a partial regions/genes vs motifs scoring database ({db_prefix} .part_000{current_part}_of_000{nbr_total_parts}.regions_vs_motifs.scores.feather or {db_prefix}.part_000{current_part}_of_000{nbr_total_parts}.genes_vs_motifs.sc ores.feather). -g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE, --genes EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE Take top CRM score for a gene by taking the maximum CRM score of multiple regions for that gene. Define a regex which will remove the non-gene part of the region ID, so only the gene ID remains. Examples: "gene_id#some_number": "#[0-9]+$" or "region_id@@gene_id": "^.+@@". -b BG_PADDING, --bgpadding BG_PADDING Background padding in bp that was added for each sequence in FASTA file. Default: 0. --min MIN_NBR_MOTIFS Minimum number of motifs needed per Cluster-Buster motif file to be considered for scoring (filters motifs list). Default: 1. --max MAX_NBR_MOTIFS Maximum number of motifs needed per Cluster-Buster motif file to be considered for scoring (filters motifs list). Default: None. -l, --mask Consider masked (lowercase) nucleotides as Ns. -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same rankings databases as output. -r SSH_COMMAND, --ssh SSH_COMMAND If defined, run Cluster-Buster over ssh by running the provided command to make the connection before running Cluster-Buster itself. Example: 'ssh -o ControlMaster=auto -o ControlPath=/tmp/ssh-control-path-%l-%h-%p-%r -o ControlPersist=600 '
create_cistarget_track_databases.py
❯ ${create_cistarget_databases_dir}/create_cistarget_track_databases.py --help usage: create_cistarget_track_databases.py [-h] -b BED_FILENAME -T TRACKS_DIR -d TRACKS_LIST_FILENAME -o DB_PREFIX [-a BIGWIG_AVERAGE_OVER_BED_PATH] [-t NBR_THREADS] [-p CURRENT_PART NBR_TOTAL_PARTS] [-g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE] [-s SEED] [-r SSH_COMMAND]
Create cisTarget track databases.
options: -h, --help show this help message and exit -b BED_FILENAME, --bed BED_FILENAME BED filename which contains the regions/genes to score with bigWigAverageOverBed for each bigwig track (ChIP-seq) files. -T TRACKS_DIR, --tracks_dir TRACKS_DIR Path to directory with bigwig track (ChIP-seq) files. -d TRACKS_LIST_FILENAME, --tracks TRACKS_LIST_FILENAME Filename with list of track IDs to be scored from directory specified by "-- tracks_dir". -o DB_PREFIX, --output DB_PREFIX Feather database prefix output filename. -a BIGWIG_AVERAGE_OVER_BED_PATH, --bwaob BIGWIG_AVERAGE_OVER_BED_PATH Path to bigWigAverageOverBed (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed). Default: "bigWigAverageOverBed". -t NBR_THREADS, --threads NBR_THREADS Number of threads to use when scoring tracks. Default: 1. -p CURRENT_PART NBR_TOTAL_PARTS, --partial CURRENT_PART NBR_TOTAL_PARTS Divide the tracks list in a number of total parts (of similar size) and score only the part defined by current_part. This allows creating partial databases on machines which do not have enough RAM to score all tracks in one iteration. This will only create a partial regions/genes vs tracks scoring database ({db_p refix}.part_000{current_part}_of_000{nbr_total_parts}.regions_vs_tracks.scores. feather or {db_prefix}.part_000{current_part}of_000{nbr_total_parts}.genes_vs tracks.scores.feather). -g EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE, --genes EXTRACT_GENE_ID_FROM_REGION_ID_REGEX_REPLACE Take top score for a gene by taking the maximum score of multiple regions for that gene. Define a regex which will remove the non-gene part of the region ID, so only the gene ID remains. Examples: "gene_id#some_number": "#[0-9]+$" or "region_id@@gene_id": "^.+@@". -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same rankings databases as output. -r SSH_COMMAND, --ssh SSH_COMMAND If defined, run bigWigAverageOverBed over ssh by running the provided command to make the connection before running bigWigAverageOverBed itself. Example: 'ssh -o ControlMaster=auto -o ControlPath=/tmp/ssh-control-path-%l-%h-%p-%r -o ControlPersist=600 '
combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py
❯ ${create_cistarget_databases_dir}/combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py --help usage: combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py [-h] -i INPUT -o OUTPUT_DIR
Combine partial cisTarget motifs or tracks vs regions or genes scores databases to: 1) a complete cisTarget motifs or tracks vs regions or genes scores database and2) a complete cisTarget regions or genes vs motifs or tracks scores database.
options: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with partial cisTarget motif or track vs regions or genes scores database Feather files. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the 1) complete cisTarget motifs or tracks vs regions or genes scores database Feather files and 2) complete cisTarget regions or genes vs motif or track scores database Feather files will be written.
combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py
❯ ${create_cistarget_databases_dir}/combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py --help usage: combine_partial_regions_or_genes_vs_motifs_or_tracks_scores_cistarget_dbs.py [-h] -i INPUT -o OUTPUT_DIR
Combine partial cisTarget regions or genes vs motifs or tracks scores databases to: 1) a complete cisTarget regions or genes vs motifs or tracks scores database and 2) a complete cisTarget motifs or tracks vs regions or genes scores database.
options: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with partial cisTarget regions or genes vs motif or track scores database Feather files. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the 1) complete cisTarget regions or genes vs motif or track scores database Feather files and 2) complete cisTarget motifs or tracks vs regions or genes scores database Feather files will be written.
convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
❯ ${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py --help usage: convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py [-h] -i CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME [-s SEED]
Convert cisTarget motifs or tracks vs regions or genes scores database to cisTarget rankings database.
optional arguments: -h, --help show this help message and exit -i CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME, --db CT_SCORES_DB_MOTIFS_OR_TRACKS_VS_REGIONS_OR_GENES_FILENAME cisTarget motifs or tracks vs regions or genes scores database filename. The cisTarget rankings database Feather file will be written to the same directory. -s SEED, --seed SEED Random seed used for breaking ties when creating rankings for a range of tied scores. When setting this seed to a specific value and running this script with the same input, will result in the same cisTarget rankings databases as output.
create_cross_species_motifs_rankings_db.py
❯ ${create_cistarget_databases_dir}/create_cross_species_motifs_rankings_db.py -h usage: create_cross_species_motifs_rankings_db.py [-h] -i INPUT -o OUTPUT_DIR
Create cisTarget cross-species motifs rankings databases.
optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Input directory or database prefix with cisTarget motifs vs regions or genes rankings databases per species. -o OUTPUT_DIR, --output OUTPUT_DIR Output directory to which the cisTarget cross-species motifs rankings database files will be written.
Creating cisTarget databases: details
To create cisTarget databases:
- FASTA file with regulatory regions:
- gene-based
- region-based
- motifs or TF ChIP-seq tracks:
- motifs: in Cluster-Buster format
- tracks: bigWig files of TF ChIP-seq data
Creating cisTarget motif databases
cisTarget motif databases can be created in 2 ways:
- Score all motifs at once and create rankings
- Score motifs in different parts and generate rankings in a separate step
Score all motifs at once and create rankings
Create cisTarget motif databases:
- create_cistarget_motif_databases.py
- for each motif score all regulatory regions and create a cisTarget motifs vs regions/genes scores db:
**.motifs_vs_regions.scores.feather
**.motifs_vs_genes.scores.feather
- transpose cisTarget motifs vs regions/genes scores db to cisTarget regions/genes vs motifs scores db:
**.regions_vs_motifs.scores.feather
**.genes_vs_motifs.scores.feather
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
**.motifs_vs_regions.rankings.feather
**.motifs_vs_genes.rankings.feather
- transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
**.regions_vs_motifs.rankings.feather
**.genes_vs_motifs.rankings.feather
- for each motif score all regulatory regions and create a cisTarget motifs vs regions/genes scores db:
FASTA file with sequences per region IDs / gene IDs.
fasta_filename=
Directory with motifs in Cluster-Buster format.
motifs_dir=
File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename=
cisTarget motif database output prefix.
db_prefix=
nbr_threads=22
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"
Score all tracks at once and create rankings
Create cisTarget tracks databases:
- create_cistarget_track_databases.py
- for each track score all regulatory regions and create a cisTarget tracks vs regions/genes scores db:
**.tracks_vs_regions.scores.feather
**.tracks_vs_genes.scores.feather
- transpose cisTarget track vs regions/genes scores db to cisTarget regions/genes vs tracks scores db:
**.regions_vs_tracks.scores.feather
**.genes_vs_tracks.scores.feather
- creating a ranking for each regulatory region per motif based on the track score of the track for that region and create a cisTarget track vs regions/genes rankings db:
**.tracks_vs_regions.rankings.feather
**.tracks_vs_genes.rankings.feather
- transpose cisTarget tracks vs regions/genes rankings db to cisTarget regions/genes vs tracks rankings db:
**.regions_vs_tracks.rankings.feather
**.genes_vs_tracks.rankings.feather
- for each track score all regulatory regions and create a cisTarget tracks vs regions/genes scores db:
BED file with regions to score.
regions_bed_filename=
Directory with bigWig tracks of TF-ChIP-seq data.
tracks_dir=
File with track IDs (base names of bigWig files in ${tracks_dir}).
tracks_list_filename=
cisTarget track database output prefix.
db_prefix=
nbr_threads=22
"${create_cistarget_databases_dir}/create_cistarget_track_databases.py"
-b "${regions_bed_filename}"
-T "${tracks_dir}"
-d "${tracks_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"
Score motifs in different parts and generate rankings in a separate step
Create cisTarget motif databases:
- create_cistarget_motif_databases.py:
- score the whole list of motifs in several parts by running
create_cistarget_motif_databases.py
with the-p <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>c</mi><mi>u</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>n</mi><msub><mi>t</mi><mi>p</mi></msub><mi>a</mi><mi>r</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">{current_part} </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9012em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">c</span><span class="mord mathnormal">u</span><span class="mord mathnormal">rre</span><span class="mord mathnormal">n</span><span class="mord"><span class="mord mathnormal">t</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">p</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">t</span></span></span></span></span>{nbr_total_parts}
option with${current_part}
set from1
to${nbr_total_parts}
. Each run will create motif scores (for the current subset of motifs) for all regulatory regions and create a partial cisTarget motifs vs regions/genes scores db:
**.part_${current_part}_of_${nbr_total_parts}.motifs_vs_regions.scores.feather
**.part_${current_part}_of_${nbr_total_parts}.motifs_vs_genes.scores.feather
- score the whole list of motifs in several parts by running
- combine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py:
- Combine partial cisTarget motifs vs regions/genes scores db to:
* a complete cisTarget motifs vs regions/genes scores database:
**.motifs_vs_regions.scores.feather
**.motifs_vs_genes.scores.feather
* a complete cisTarget regions/genes vs motifs scores database:
**.regions_vs_motifs.scores.feather
**.genes_vs_motifs.scores.feather
* partial cisTarget scores databases can be deleted afterwards.
- Combine partial cisTarget motifs vs regions/genes scores db to:
- convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
**.motifs_vs_regions.rankings.feather
**.motifs_vs_genes.rankings.feather
- transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
**.regions_vs_motifs.rankings.feather
**.genes_vs_motifs.rankings.feather
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
Step 1
Using -p
or --partial
of create_cistarget_motif_databases.py (help) will divide the motif list in a number of total parts (${nbr_total_parts}
) (of similar size) and score only the part defined by${current_part}
.
This allows creating partial databases on machines which do not have enough RAM to score all motifs in one iteration and/or running the motif scoring on multiple nodes (where each node runs the motif scoring with a different value for${current_part}
) in parallel. This is quite useful if the number of region/genes is quite high.
This will only create a partial cisTarget motifs vs regions/genes scores database files:
${output_db_prefix}.part_0*${current_part}_of_0*${nbr_total_parts}.motifs_vs_regions.scores.feather
${output_db_prefix}.part_0*${current_part}_of_0*${nbr_total_parts}.motifs_vs_genes.scores.feather
FASTA file with sequences per region IDs / gene IDs.
fasta_filename=
Directory with motifs in Cluster-Buster format.
motifs_dir=
File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename=
cisTarget motif database output prefix.
db_prefix=
nbr_threads=22 nbr_parts=10
Create a partial directory, so partial cisTarget database files can be deleted easily afterwards.
mkdir partial
Each invocation of the for loop (with different ${current_part}) can also be submitted to a different node to speedup
the motif scoring.
for current_part in (seq1(seq 1 (seq1{nbr_total_parts}) ; do
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-p "${current_part}" "${nbr_total_parts}"
-o "partial/${db_prefix}"
-t "${nbr_threads}"
done
When creating cross-species databases, motif scoring should be done with regulatory regions after liftover (fasta_filename
), while the original regulatory regions fasta file (original_species_fasta_filename
) should also be provided. The latter is only used to make sure that all regions/genes of the original regulatory regions are in the generated cisTarget database as some regions might get lost after liftover. cisTarget motif score databases need to be generated for each species of interest.
FASTA file with sequences per region IDs / gene IDs.
fasta_filename=
FASTA file with sequences per region IDs / gene IDs of the original species.
original_species_fasta_filename=
Directory with motifs in Cluster-Buster format.
motifs_dir=
File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename=
cisTarget motif database output prefix.
db_prefix=
nbr_threads=22
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py"
-f "${fasta_filename}"
-F "${original_species_fasta_filename}"
-M "${motifs_dir}"
-m "${motifs_list_filename}"
-o "${db_prefix}"
-t "${nbr_threads}"
Step 2
See Memory requirements to have a rough guess about the amount of memory needed in case you have problems running this step.
When all partial cisTarget regions/genes vs motifs scores database files are created:
${db_prefix}.part_0*of_0*${nbr_total_parts}.motifs_vs_regions.scores.feather
${db_prefix}.part_0*_of_0*${nbr_total_parts}.motifs_vs_genes.scores.feather
they can be combined withcombine_partial_motifs_or_tracks_vs_regions_or_genes_scores_cistarget_dbs.py (help)to:
- a complete cisTarget regions/genes vs motifs scores database:
*.motifs_vs_regions.scores.feather
*.motifs_vs_genes.scores.feather
- a complete cisTarget motifs vs regions/genes scores database:
*.regions_vs_motifs.scores.feather
*.genes_vs_motifs.scores.feather
Partial cisTarget scores databases can be deleted afterwards.
"${create_cistarget_databases_dir}/combine_partial_motifs_or_tracks_vs_regions_or_genes_cistarget_dbs.py
-i partial/
-o .
Partial cisTarget databases can be removed.
#rm -r partial
Step 3
See Memory requirements to have a rough guess about the amount of memory needed in case you have problems running this step.
Create rankings from a complete cisTarget regions/genes vs motifs scores database withconvert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py (help):
- creating a ranking for each regulatory region per motif based on the CRM score of the motif for that region and create a cisTarget motifs vs regions/genes rankings db:
*.motifs_vs_regions.rankings.feather
*.motifs_vs_genes.rankings.feather
- transpose cisTarget motifs vs regions/genes rankings db to cisTarget regions/genes vs motifs rankings db:
*.regions_vs_motifs.rankings.feather
*.genes_vs_motifs.rankings.feather
cisTarget database filename:
- *.motifs_vs_regions.rankings.feather
- *.motifs_vs_genes.rankings.feather
db_filename=
"${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
-i "${db_filename}"
Create cisTarget cross-species motifs rankings database
To create cisTarget cross-species motifs rankings database:
- liftover regulatory regions of main species to each species of interest.
- Create cisTarget motif databases for each lifted over regulatory regions FASTA file.convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
- Create cisTarget cross-species motifs rankings from individual (per species) cisTarget motifs rankings databases with create_cross_species_motifs_rankings_db.py
cisTarget database prefix which matches the common part of all cisTarget rankings databases (or just the directory
that contains them).
db_prefix=
Output directory.
output_dir=
"${create_cistarget_databases_dir}/create_cross_species_motifs_rankings_db.py
-i "${db_prefix}"
-o "${output_dir}"