5 Genuinely Useful Bash Scripts for Data Science - KDnuggets (original) (raw)

5 Genuinely Useful Bash Scripts for Data Science
Image by author

Python, R, and SQL are often cited as the most-used languages for processing, modeling, and exploring data. While that may be true, there is no reason that others can't be — or are not being — used to do this work.

The Bash shell is a Unix and Unix-like operating system shell, along with the commands and programming language that go along with it. Bash scripts are programs written using this Bash shell scripting language. These scripts are executed sequentially by the Bash interpreter, and can include all of the constructs typically found in other programming languages, including conditional statements, loops, and variables.

Common Bash script uses include:

automating system administration tasks
performing backups and maintenance
parsing log files and other data
creating command-line tools and utilities

Bash scripting is also used to orchestrate the deployment and management of complex distributed systems, making it an incredibly useful skill in the arenas of data engineering, cloud computing environments, and DevOps.

In this article, we are going to take a look at five different data science-related scripting-friendly tasks, where we should see how flexible and useful Bash can be.

Clean and Format Raw Data

Here is an example bash script for cleaning and formatting raw data files:

#!/bin/bash

# Set the input and output file paths
input_file="raw_data.csv"
output_file="clean_data.csv"

# Remove any leading or trailing whitespace from each line
sed 's/^[ \t]*//;s/[ \t]*$//' <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mi>n</mi><mi>p</mi><mi>u</mi><msub><mi>t</mi><mi>f</mi></msub><mi>i</mi><mi>l</mi><mi>e</mi><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">input_file &gt; </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9805em;vertical-align:-0.2861em;"></span><span class="mord mathnormal">in</span><span class="mord mathnormal">p</span><span class="mord mathnormal">u</span><span class="mord"><span class="mord mathnormal">t</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em;">f</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">e</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;</span></span></span></span>output_file

# Replace any commas within quoted fields with a placeholder
sed -i 's/","/,/g' $output_file

# Replace any newlines within quoted fields with a placeholder
sed -i 's/","/ /g' $output_file

# Remove the quotes around each field
sed -i 's/"//g' $output_file

# Replace the placeholder with the original comma separator
sed -i 's/,/","/g' $output_file

echo "Data cleaning and formatting complete. Output file: $output_file"

This script:

assumes that your raw data file is in a CSV file called raw_data.csv
saves the cleaned data as clean_data.csv
uses the sed command to:
- remove leading/trailing whitespace from each line and replace any commas within quoted fields with a placeholder
- replace newlines within quoted fields with a placeholder
- remove the quotes around each field
- replace the placeholder with the original comma separator
prints a message indicating that the data cleaning and formatting is complete, along with the location of the output file

Automate Data Visualization

Here is an example bash script for automating data visualization tasks:

#!/bin/bash

# Set the input file path
input_file="data.csv"

# Create a line chart of column 1 vs column 2
gnuplot -e "set datafile separator ','; set term png; set output 'line_chart.png'; plot '$input_file' using 1:2 with lines"

# Create a bar chart of column 3
gnuplot -e "set datafile separator ','; set term png; set output 'bar_chart.png'; plot '$input_file' using 3:xtic(1) with boxes"

# Create a scatter plot of column 4 vs column 5
gnuplot -e "set datafile separator ','; set term png; set output 'scatter_plot.png'; plot '$input_file' using 4:5 with points"

echo "Data visualization complete. Output files: line_chart.png, bar_chart.png, scatter_plot.png"

The above script:

assumes that your data is in a CSV file called data.csv
uses the gnuplot command to create three different types of plots:
- a line chart of column 1 vs column 2
- a bar chart of column 3
- a scatter plot of column 4 vs column 5
outputs the plots in png format and saves them as line_chart.png, bar_chart.png, and scatter_plot.png respectively
prints a message indicating that the data visualization is complete and the location of the output files

Please note that for this script to function, one would need to adjust the column numbers and types of charts based on your data and requirements.

Statistical Analysis

Here is an example bash script for performing statistical analysis on a dataset:

#!/bin/bash

# Set the input file path
input_file="data.csv"

# Set the output file path
output_file="statistics.txt"

# Use awk to calculate the mean of column 1
mean=$(awk -F',' '{sum+=$1} END {print sum/NR}' $input_file)

# Use awk to calculate the standard deviation of column 1
stddev=$(awk -F',' '{sum+=$1; sumsq+=$1*$1} END {print sqrt(sumsq/NR - (sum/NR)**2)}' $input_file)

# Append the results to the output file
echo "Mean of column 1: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mi mathvariant="normal">&quot;</mi><mo>&gt;</mo><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">mean&quot; &gt;&gt; </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">m</span><span class="mord mathnormal">e</span><span class="mord mathnormal">an</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;&gt;</span></span></span></span>output_file
echo "Standard deviation of column 1: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>t</mi><mi>d</mi><mi>d</mi><mi>e</mi><mi>v</mi><mi mathvariant="normal">&quot;</mi><mo>&gt;</mo><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">stddev&quot; &gt;&gt; </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">dd</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;&gt;</span></span></span></span>output_file

# Use awk to calculate the mean of column 2
mean=$(awk -F',' '{sum+=$2} END {print sum/NR}' $input_file)

# Use awk to calculate the standard deviation of column 2
stddev=$(awk -F',' '{sum+=$2; sumsq+=$2*$2} END {print sqrt(sumsq/NR - (sum/NR)**2)}' $input_file)

# Append the results to the output file
echo "Mean of column 2: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mi mathvariant="normal">&quot;</mi><mo>&gt;</mo><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">mean&quot; &gt;&gt; </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">m</span><span class="mord mathnormal">e</span><span class="mord mathnormal">an</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;&gt;</span></span></span></span>output_file
echo "Standard deviation of column 2: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>t</mi><mi>d</mi><mi>d</mi><mi>e</mi><mi>v</mi><mi mathvariant="normal">&quot;</mi><mo>&gt;</mo><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">stddev&quot; &gt;&gt; </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">dd</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;&gt;</span></span></span></span>output_file

echo "Statistical analysis complete. Output file: $output_file"

This script:

assumes that your data is in a CSV file called data.csv
uses the awk command to calculate the mean and standard deviation of 2 columns
separates the data by a comma
saves the results to a text file statistics.txt.
prints a message indicating that the statistical analysis is complete and the location of the output file

Note that you can add more awk commands to calculate other statistical values or for more columns.

Manage Python Package Dependencies

Here is an example bash script for managing and updating dependencies and packages required for data science projects:

#!/bin/bash

# Set the path of the virtual environment
venv_path="venv"

# Activate the virtual environment
source $venv_path/bin/activate

# Update pip
pip install --upgrade pip

# Install required packages from requirements.txt
pip install -r requirements.txt

# Deactivate the virtual environment
deactivate

echo "Dependency and package management complete."

This script:

assumes that you have a virtual environment set up, and a file named requirements.txt containing the package names and versions that you want to install
uses the source command to activate a virtual environment specified by the path venv_path.
uses pip to upgrade pip to the latest version
installs the packages specified in the requirements.txt file
uses the deactivate command to deactivate the virtual environment after the packages are installed
prints a message indicating that the dependency and package management is complete

This script should be run every time you want to update your dependencies or install new packages for a data science project.

Manage Jupyter Notebook Execution

Here is an example bash script for automating the execution of Jupyter Notebook or other interactive data science environments:

#!/bin/bash

# Set the path of the notebook file
notebook_file="analysis.ipynb"

# Set the path of the virtual environment
venv_path="venv"

# Activate the virtual environment
source $venv_path/bin/activate

# Start Jupyter Notebook
jupyter-notebook $notebook_file

# Deactivate the virtual environment
deactivate

echo "Jupyter Notebook execution complete."

The above script:

assumes that you have a virtual environment set up and Jupyter Notebook installed in it
uses the source command to activate a virtual environment, specified by the path venv_path
uses the jupyter-notebook command to start Jupyter Notebook and open the specified notebook_file
uses the deactivate command to deactivate the virtual environment after the execution of Jupyter Notebook
prints a message indicating that the execution of Jupyter Notebook is complete

This script should be run every time you want to execute a Jupyter Notebook or other interactive data science environments.

I'm hoping that these simple scripts were enough to show you the simplicity and power of scripting with Bash. It might not be your go-to solution for every situation, but it certainly has its place. Best of luck in your scripting.

Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.