Dataproc optional Jupyter component (original) (raw)

Dataproc optional Jupyter component

Stay organized with collections Save and categorize content based on your preferences.

You can install additional components like Jupyter when you create a Dataproc cluster using theOptional componentsfeature. This page describes the Jupyter component.

The Jupyter component is a Web-based single-user notebook for interactive data analytics and supports theJupyterLabWeb UI. The Jupyter Web UI is available on port 8123 on the cluster's first master node.

Launch notebooks for multiple users. You can create a Dataproc-enabledVertex AI Workbench instanceor install the Dataproc JupyterLab pluginon a VM to to serve notebooks to multiple users.

Configure Jupyter. Jupyter can be configured by providing dataproc:jupyter cluster properties. To reduce the risk of remote code execution over unsecured notebook server APIs, the default dataproc:jupyter.listen.all.interfaces cluster property setting is false, which restricts connections to localhost (127.0.0.1) when the Component Gateway is enabled (Component Gateway activation is required when installing the Jupyter component).

The Jupyter notebook provides a Python kernel to run Spark code, and a PySpark kernel. By default, notebooks are saved in Cloud Storagein the Dataproc staging bucket, which is specified by the user orauto-createdwhen the cluster is created. The location can be changed at cluster creation time using thedataproc:jupyter.notebook.gcs.dir cluster property.

Work with data files. You can use a Jupyter notebook to work with data files that have beenuploaded to Cloud Storage. Since the Cloud Storage connectoris pre-installed on a Dataproc cluster, you can reference the files directly in your notebook. Here's an example that accesses CSV files in Cloud Storage:

df = spark.read.csv("gs://bucket/path/file.csv") df.show()

SeeGeneric Load and Save Functionsfor PySpark examples.

Install Jupyter

Install the component when you create a Dataproc cluster. The Jupyter component requires activation of the DataprocComponent Gateway.

Console

Enable the component.
- In the Google Cloud console, open the DataprocCreate a cluster page. The Set up cluster panel is selected.
- In the Components section:
  * Under Optional components, select theJupyter component.
  * Under Component Gateway, selectEnable component gateway (seeViewing and Accessing Component Gateway URLs).

gcloud CLI

To create a Dataproc cluster that includes the Jupyter component, use thegcloud dataproc clusters create cluster-name command with the --optional-components flag.

Latest default image version example

The following example installs the Jupyter component on a cluster that uses the latest default image version.

gcloud dataproc clusters create cluster-name
    --optional-components=JUPYTER
    --region=region
    --enable-component-gateway
    ... other flags

REST API

The Jupyter component can be installed through the Dataproc API usingSoftwareConfig.Componentas part of aclusters.create request.

Set the EndpointConfig.enableHttpPortAccessproperty to true as part of the clusters.createrequest to enable connecting to the Jupyter notebook Web UI using theComponent Gateway.

Open the Jupyter and JupyterLab UIs

Click the Google Cloud console Component Gateway linksto open in your local browser the Jupyter notebook or JupyterLab UI running on the cluster master node.

Select "GCS" or "Local Disk" to create a new Jupyter Notebook in either location.

Attach GPUs to master and worker nodes

You can add GPUsto your cluster's master and worker nodes when using a Jupyter notebook to:

Preprocess data in Spark, then collect aDataFrameonto the master and runTensorFlow
Use Spark to orchestrate TensorFlow runs in parallel
Run Tensorflow-on-YARN
Use with other machine learning scenarios that use GPUs