Azure Cloud
New in version 21.04.0.
Azure Blob Storage
Nextflow has built-in support for Azure Blob Storage. Files stored in an Azure blob container can be accessed transparently in your pipeline script like any other file in the local file system.
The Blob storage account name and key need to be provided in the Nextflow configuration file as shown below:
azure {
storage {
accountName = "<YOUR BLOB ACCOUNT NAME>"
accountKey = "<YOUR BLOB ACCOUNT KEY>"
}
}
Alternatively, the Shared Access Token can be specified with the sasToken
option instead of accountKey
.
Tip
When creating the Shared Access Token, make sure to allow the resource types Container
and Object
and allow the permissions: Read
, Write
, Delete
, List
, Add
, Create
.
Tip
The value of sasToken
is the token stripped by the character ?
from the beginning of the token.
Once the Blob Storage credentials are set, you can access the files in the blob container like local files by prepending the file path with az://
followed by the container name. For example, a blob container named my-data
with a file named foo.txt
can be specified in your Nextflow script as az://my-data/foo.txt
.
Azure Batch
Tip
This section describes how to manually set up and use Nextflow with Azure Batch. You may be interested in using Batch Forge in Seqera Platform, which automatically creates the required Azure infrastructure for you with minimal intervention.
Azure Batch is a managed computing service that allows the execution of containerised workloads in the Azure cloud infrastructure.
Nextflow provides built-in support for Azure Batch, allowing the seamless deployment of Nextflow pipelines in the cloud, in which tasks are offloaded as Batch jobs.
Read the Azure Batch executor section to learn more about the azurebatch
executor in Nextflow.
Get started
Create a Batch account in the Azure portal. Take note of the account name and key.
Make sure to adjust your quotas to the pipeline’s needs. There are limits on certain resources associated with the Batch account. Many of these limits are default quotas applied by Azure at the subscription or account level. Quotas impact the number of Pools, CPUs and Jobs you can create at any given time.
Create a Storage account and, within that, an Azure Blob Container in the same location where the Batch account was created. Take note of the account name and key.
If you plan to use Azure Files, create an Azure File share within the same Storage account and upload your input data.
Associate the Storage account with the Azure Batch account.
Make sure every process in your pipeline specifies one or more Docker containers with the container directive.
Make sure all of your container images are published in a Docker registry that can be accessed by your Azure Batch environment, such as Docker Hub, Quay, or Azure Container Registry .
A minimal Nextflow configuration for Azure Batch looks like the following snippet:
process {
executor = 'azurebatch'
}
azure {
storage {
accountName = "<YOUR STORAGE ACCOUNT NAME>"
accountKey = "<YOUR STORAGE ACCOUNT KEY>"
}
batch {
location = '<YOUR LOCATION>'
accountName = '<YOUR BATCH ACCOUNT NAME>'
accountKey = '<YOUR BATCH ACCOUNT KEY>'
autoPoolMode = true
}
}
In the above example, replace the location placeholder with the name of your Azure region and the account placeholders with the values corresponding to your configuration.
Tip
The list of Azure regions can be found by executing the following Azure CLI command:
az account list-locations -o table
Finally, launch your pipeline with the above configuration:
nextflow run <PIPELINE NAME> -w az://YOUR-CONTAINER/work
Replacing <PIPELINE NAME>
with a pipeline name e.g. nextflow-io/rnaseq-nf
and YOUR-CONTAINER
with a blob container in the storage account defined in the above configuration.
See the Batch documentation for further details about the configuration for Azure Batch.
Pools configuration
When using the autoPoolMode
option, Nextflow automatically creates a pool
of compute nodes appropriate for your pipeline.
By default, the cpus
and memory
directives are used to find the smallest machine type that fits the requested resources in the Azure machine family, specified by machineType
. If memory
is not specified, 1 GB of memory is allocated per CPU. When no options are specified, it only uses one compute node of the type Standard_D4_v3
.
To specify multiple Azure machine families, use a comma separated list with glob (*
) values in the machineType
directive. For example, the following will select any machine size from D or E v5 machines, with additional data disk, denoted by the d
suffix:
process.machineType = "Standard_D*d_v5,Standard_E*d_v5"
For example, the following process will create a pool of Standard_E4d_v5
machines based when using autoPoolMode
:
process EXAMPLE_PROCESS {
machineType "Standard_E*d_v5"
cpus 16
memory 8.GB
script:
"""
echo "cpus: ${task.cpus}"
"""
}
Note when creating tasks that use fewer than 4 CPUs, Nextflow will create a pool with machines that have 4 times the number of CPUs required in order to pack more tasks onto each machine. This means the pipeline spends less time waiting for machines to be created, startup and join the Azure Batch pool. Similarly, if a process requires fewer than 8 CPUs Nextflow will use a machine with double the number of CPUs required. If you wish to override this behaviour you can use a specific machineType
directive, e.g. using a machineType
directive of Standard_E2d_v5
will use always use a Standard_E2d_v5 machine.
The pool is not removed when the pipeline terminates, unless the configuration setting deletePoolsOnCompletion = true
is added in your Nextflow configuration file.
Pool specific settings should be provided in the auto
pool configuration scope. If you wish to specify a single machine size for all processes, you can specify a fixed vmSize
for the auto
pool.
azure {
batch {
pools {
auto {
vmType = 'Standard_D2_v2'
}
}
}
}
Warning
To avoid any extra charges in the Batch account, remember to clean up the Batch pools or use auto scaling.
Warning
Make sure your Batch account has enough resources to satisfy the pipeline’s requirements and the pool configuration.
Warning
Nextflow uses the same pool ID across pipeline executions, if the pool features have not changed. Therefore, when using deletePoolsOnCompletion = true
, make sure the pool is completely removed from the Azure Batch account before re-running the pipeline. The following message is returned when the pool is still shutting down:
Error executing process > '<process name> (1)'
Caused by:
Azure Batch pool '<pool name>' not in active state
Named pools
If you want to have more precise control over the compute node pools used in your pipeline, such as using a different pool depending on the task in your pipeline, you can use the queue directive in Nextflow to specify the ID of a Azure Batch compute pool that should be used to execute that process.
The pool is expected to be already available in the Batch environment, unless the setting allowPoolCreation = true
is provided in the azure.batch
config scope in the pipeline configuration file. In the latter case, Nextflow will create the pools on-demand.
The configuration details for each pool can be specified using a snippet as shown below:
azure {
batch {
pools {
foo {
vmType = 'Standard_D2_v2'
vmCount = 10
}
bar {
vmType = 'Standard_E2_v3'
vmCount = 5
}
}
}
}
The above example defines the configuration for two node pools. The first will provision 10 compute nodes of type Standard_D2_v2
, the second 5 nodes of type Standard_E2_v3
. See the Azure configuration section for the complete list of available configuration options.
Warning
The pool name can only contain alphanumeric, hyphen and underscore characters.
Warning
If the pool name includes a hyphen, make sure to wrap it with single quotes. For example:
azure {
batch {
pools {
'foo-2' {
...
}
}
}
}
Requirements on pre-existing named pools
When Nextflow is configured to use a pool already available in the Batch account, the target pool must satisfy the following requirements:
The pool must be declared as
dockerCompatible
(Container Type
property).The task slots per node must match the number of cores for the selected VM. Otherwise, Nextflow will return an error like “Azure Batch pool ‘ID’ slots per node does not match the VM num cores (slots: N, cores: Y)”.
Unless you are using Fusion, all tasks must have AzCopy available in the path. If
azure.batch.copyToolInstallMode = 'node'
this will require every node to have the azcopy binary located at$AZ_BATCH_NODE_SHARED_DIR/bin/
.
Pool autoscaling
Azure Batch can automatically scale pools based on parameters that you define, saving you time and money. With automatic scaling, Batch dynamically adds nodes to a pool as task demands increase, and removes compute nodes as task demands decrease.
To enable this feature for pools created by Nextflow, add the option autoScale = true
to the corresponding pool configuration scope. For example, when using the autoPoolMode
, the setting looks like:
azure {
batch {
pools {
auto {
autoScale = true
vmType = 'Standard_D2_v2'
vmCount = 5
maxVmCount = 50
}
}
}
}
Nextflow uses the formula shown below to determine the number of VMs to be provisioned in the pool:
// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}};
// Compute the target nodes based on pending tasks.
// $PendingTasks == The sum of $ActiveTasks and $RunningTasks
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;
The above formula initialises a pool with the number of VMs specified by the vmCount
option, and scales up the pool on-demand, based on the number of pending tasks, up to maxVmCount
nodes. If no jobs are submitted for execution, it scales down to zero nodes automatically.
If you need a different strategy, you can provide your own formula using the scaleFormula
option. See the Azure Batch documentation for details.
Pool nodes
When Nextflow creates a pool of compute nodes, it selects:
the virtual machine image reference to be installed on the node
the Batch node agent SKU, a program that runs on each node and provides an interface between the node and the Batch service
Together, these settings determine the Operating System and version installed on each node.
By default, Nextflow creates pool nodes based on CentOS 8, but this behavior can be customised in the pool configuration. Below are configurations for image reference/SKU combinations to select two popular systems.
Ubuntu 20.04 (default):
azure.batch.pools.<name>.sku = "batch.node.ubuntu 20.04" azure.batch.pools.<name>.offer = "ubuntu-server-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
CentOS 8:
azure.batch.pools.<name>.sku = "batch.node.centos 8" azure.batch.pools.<name>.offer = "centos-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
In the above snippet, replace <name>
with the name of your Azure node pool.
See the Azure configuration section and the Azure Batch nodes documentation for more details.
Private container registry
New in version 21.05.0-edge.
A private container registry for Docker images can be specified as follows:
azure {
registry {
server = '<YOUR REGISTRY SERVER>' // e.g.: docker.io, quay.io, <ACCOUNT>.azurecr.io, etc.
userName = '<YOUR REGISTRY USER NAME>'
password = '<YOUR REGISTRY PASSWORD>'
}
}
The private registry is an addition, not a replacement, to the existing configuration. Public images from other registries will still be pulled as normal, if they are requested.
Note
When using containers hosted in a private registry, the registry name must also be provided in the container name specified via the container directive using the format: [server]/[your-organization]/[your-image]:[tag]
. Read more about fully qualified image names in the Docker documentation.
Virtual Network
New in version 23.03.0-edge.
Sometimes it might be useful to create a pool in an existing Virtual Network. To do so, the
virtualNetwork
option can be added to the pool settings as follows:
azure {
batch {
pools {
auto {
autoScale = true
vmType = 'Standard_D2_v2'
vmCount = 5
virtualNetwork = '<YOUR SUBNET ID>'
}
}
}
}
The value of the setting must be the identifier of a subnet available in the virtual network to join. A valid subnet ID has the following form:
/subscriptions/<YOUR SUBSCRIPTION ID>/resourceGroups/<YOUR RESOURCE GROUP NAME>/providers/Microsoft.Network/virtualNetworks/<YOUR VIRTUAL NETWORK NAME>/subnets/<YOUR SUBNET NAME>
Warning
Batch Authentication with Shared Keys does not allow to link external resources (like Virtual Networks) to the pool. Therefore, Active Directory Authentication must be used in conjunction with the virtualNetwork
setting.
Microsoft Entra (formerly Active Directory Authentication)
New in version 22.11.0-edge.
Service Principal credentials can optionally be used instead of Shared Keys for Azure Batch and Storage accounts.
The Service Principal should have the at least the following role assignments:
Contributor
Storage Blob Data Reader
Storage Blob Data Contributor
Note
To assign the necessary roles to the Service Principal, refer to the official Azure documentation.
The credentials for Service Principal can be specified as follows:
azure {
activeDirectory {
servicePrincipalId = '<YOUR SERVICE PRINCIPAL CLIENT ID>'
servicePrincipalSecret = '<YOUR SERVICE PRINCIPAL CLIENT SECRET>'
tenantId = '<YOUR TENANT ID>'
}
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
}
batch {
accountName = '<YOUR BATCH ACCOUNT NAME>'
location = '<YOUR BATCH ACCOUNT LOCATION>'
}
}
Managed identities
New in version 24.05.0-edge.
An Azure managed identity can be used to authenticate with Azure Blob storage without using secrets such as a storage account key.
The managed identity can be specified as follows:
azure {
managedIdentity {
clientId = '<YOUR MANAGED IDENTITY>'
tenantId = '<YOUR TENANT ID>'
}
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
}
batch {
accountName = '<YOUR BATCH ACCOUNT NAME>'
location = '<YOUR BATCH ACCOUNT LOCATION>'
}
}
Alternatively, you can set azure.managedIdentity.system = true
to authenticate using the system-assigned identity.
Advanced configuration
Read the Azure configuration section to learn more about advanced configuration options.