Using a Data Factory to load data into CosmosDB without code

2020-04-27

CosmosDB

If you’re looking at a quick, serverless and no-code solution to load data into your Cosmos DB database, you should look no further than Data Factory. With it’s multitude of connectors for sourcing data and its built in sink connector for CosmosDB, you can quickly create a data pipeline to ingest, refresh or keep your Cosmos Database in sync.

Read the rest of the article for a quick step-by-step guide on setting this up. One caveat, this only works for a SQL CosmosDB account for now.

Create Data Factory

Create and configure secrets in a KeyVault

Add the Connection String and Access Keys for your CosmosDB account and other and storage account (link to article where we add that in the ARM template). This allows for a secure connection between your Data Factory and your CosmosDB account.

Create Data Factory Connection to KeyVault

In order to be able to use the secrets from the KeyVault, we need to create a connection to it.

Create Data Factory Connection to KeyVault

In order for Data Factory to be able to access the KeyVault, its identity must be added to the KeyVault Access Policies otherwise you will get an error when trying to use secrets from it:

Error When Data Factory identity does not have KeyVault access

To obtain the Data Factory identity, you can check the info box on the Key Vault connection you just created in the previous step

Get Data Factory Identity for assigning permissions in the Key Vault

Alternatively, you can use a manual Connection string and just get the Access Key from the KeyVault.

Create connections to Cosmos DB

Once the KeyVault has been configured, it’s time to add the CosmosDB connection, following the same steps as for the Key Vault connection. There are a couple of ways to configure the connection, either using full connection string from the KeyVault (in which case your Connection String must include the Database) or using a Manual connection string where just the Access Key is grabbed from the Key Vault and the Account URL and Database are entered manually

Create CosmosDB Connection

Create any additional Connections for your data sources

You will probably need extra connections to your source of data, so configure them in the same way you did the CosmosDB one. In my case, I am using a Storage Account.

Publish

Publishing is the Data Factory equivalent of a Save operation. Ensure that the changes you made so far are stored in the Data Factory configuration.

Configure datasets

With the connections to your sources configured, it is now time to configure the data sets.

Add a new Data set

Add a JSON Dataset

For my source of data, I am using a JSON file, containing the list of items I want to insert. Creating a new file based dataset is simple, just select the format of the file and point it to the right connection (a storage account connection in my case) and a path to the file or a filename match pattern in case there are more than one files you want to process

Add a CosmosDB dataset

Datasets are concepts used for both a source of data or the destination. Configuring a CosmosDB dataset creates a sink for the data to be copied into. The process is similar to any other dataset, just pick the connection and the collection.

Configuring a CosmosDB Dataset

Create the copy pipeline

Once the Connections and Datasets are created, last thing to do is to create the copy job.

Use a simple Move and Transform -> Copy Data job for that. The source will be one of the input sources, and the sink is the CosmosDB dataset. For best performance, ensure that your source data matches exactly what you need to get into CosmosDB so that there is no mapping to be done in between.

Stats

Since the Data Factory is using the CosmosDB Bulk Executor library behind the scenes, the insert performance is more than decent. Here are some stats below, for a collection provisioned with 1000 RUs

4000 items (~ 1KB / doc), unique Partition Keys - total processing time: 27 seconds

4000 items copy time

100000 items (~0.3KB / doc), distributed across 4k Partition Keys - total processing time: 600 seconds

100000 items copy time

Integration

Using the Data Factory pipeline is very easy, as it can easily integrate in your pipeline. Some of the best methods I’ve used are via the Logic App connector, meaning you can call your pipeline from an Azure Logic App job, or directly hook it up to an Azure Storage Blob creation event via the built in triggers.

Conclusion

There’s a lot of flexibility and capability in the Azure Data Factory that was not covered in this article that you should explore if you need more complex scenarios. For more details I recommend checking out these more in-depth articles: Migrating Relational Data into Azure Cosmos DB using Azure Data Factory and Azure Databricks and Copy and transform data in Azure Cosmos DB (SQL API) by using Azure Data Factory