If you’re looking at a quick, serverless and no-code solution to load data into your Cosmos DB database, you should look no further than Data Factory. With it’s multitude of connectors for sourcing data and its built in sink connector for CosmosDB, you can quickly create a data pipeline to ingest, refresh or keep your Cosmos Database in sync.
Read the rest of the article for a quick step-by-step guide on setting this up. One caveat, this only works for a SQL CosmosDB account for now.
Create Data Factory
Create and configure secrets in a KeyVault
Add the Connection String and Access Keys for your CosmosDB account and other and storage account (link to article where we add that in the ARM template). This allows for a secure connection between your Data Factory and your CosmosDB account.
Create Data Factory Connection to KeyVault
In order to be able to use the secrets from the KeyVault, we need to create a connection to it.
Create Data Factory Connection to KeyVault
In order for Data Factory to be able to access the KeyVault, its identity must be added to the KeyVault Access Policies otherwise you will get an error when trying to use secrets from it:
Error When Data Factory identity does not have KeyVault access
To obtain the Data Factory identity, you can check the info box on the Key Vault connection you just created in the previous step
Get Data Factory Identity for assigning permissions in the Key Vault
Alternatively, you can use a manual Connection string and just get the Access Key from the KeyVault.
Create connections to Cosmos DB
Once the KeyVault has been configured, it’s time to add the CosmosDB connection, following the same steps as for the Key Vault connection. There are a couple of ways to configure the connection, either using full connection string from the KeyVault (in which case your Connection String must include the Database) or using a Manual connection string where just the Access Key is grabbed from the Key Vault and the Account URL and Database are entered manually
Create any additional Connections for your data sources
You will probably need extra connections to your source of data, so configure them in the same way you did the CosmosDB one. In my case, I am using a Storage Account.
Publish
Publishing is the Data Factory equivalent of a Save operation. Ensure that the changes you made so far are stored in the Data Factory configuration.
Configure datasets
With the connections to your sources configured, it is now time to configure the data sets.
Add a JSON Dataset
For my source of data, I am using a JSON file, containing the list of items I want to insert. Creating a new file based dataset is simple, just select the format of the file and point it to the right connection (a storage account connection in my case) and a path to the file or a filename match pattern in case there are more than one files you want to process
Add a CosmosDB dataset
Datasets are concepts used for both a source of data or the destination. Configuring a CosmosDB dataset creates a sink for the data to be copied into. The process is similar to any other dataset, just pick the connection and the collection.
Configuring a CosmosDB Dataset
Create the copy pipeline
Once the Connections and Datasets are created, last thing to do is to create the copy job.
Use a simple Move and Transform -> Copy Data job for that. The source will be one of the input sources, and the sink is the CosmosDB dataset. For best performance, ensure that your source data matches exactly what you need to get into CosmosDB so that there is no mapping to be done in between.
Stats
Since the Data Factory is using the CosmosDB Bulk Executor library behind the scenes, the insert performance is more than decent. Here are some stats below, for a collection provisioned with 1000 RUs
4000 items (~ 1KB / doc), unique Partition Keys - total processing time: 27 seconds
100000 items (~0.3KB / doc), distributed across 4k Partition Keys - total processing time: 600 seconds
Integration
Using the Data Factory pipeline is very easy, as it can easily integrate in your pipeline. Some of the best methods I’ve used are via the Logic App connector, meaning you can call your pipeline from an Azure Logic App job, or directly hook it up to an Azure Storage Blob creation event via the built in triggers.
Conclusion
There’s a lot of flexibility and capability in the Azure Data Factory that was not covered in this article that you should explore if you need more complex scenarios. For more details I recommend checking out these more in-depth articles: Migrating Relational Data into Azure Cosmos DB using Azure Data Factory and Azure Databricks and Copy and transform data in Azure Cosmos DB (SQL API) by using Azure Data Factory