Copy directly from azure blob storage to azure data lake.

A cool snippet copying files directly from azure blob storage to azure data lake.

Neither the Azure Data Lake client nor the Azure Storage client allows for directly copying files between these two systems even if they are used in parallell. However, combining the possibility to get a stream from the blob, and the open file function from azure data lake we can copy the blob without downloading it first.

First, get the blob client:

from azure.storage.blob import BlockBlobService

container = "mycontainer"
account_name = secrets["blob_account_name"]
account_key = secrets["blob_key"]
block_blob_service = BlockBlobService(account_name, account_key)

Then, get the data lake client:

from azure.datalake.store import core, lib

tenant = config.secrets["tenant"]
resource = config.secrets["datalakeresource"]
client_id = config.secrets["datalakeclientid"]
client_secret = config.secrets["datalakeclientsecret"]

credentials = lib.auth(tenant_id=tenant,
                    client_secret=client_secret,
                    client_id=client_id,
                    resource=resource)

adlfsClient  =  core.AzureDLFileSystem(credentials, store_name=config.secrets["datalakestorename"])

Now, we can create the following useful function:

def copy_blob_to_lake(adlfsClient, block_blob_service, blob_name,lake_path):
    with adlfsClient.open(lake_path, 'wb') as file:
        block_blob_service.get_blob_to_stream(container_name=container, blob_name=blob_name, stream=file, max_connections=1)

links

social