CosmosDB graph and PowerBI - Setting up refreshes

In previous articles I’ve shown how to connect a CosmosDB Graph database to PowerBI for easy reporting and analytics visualization.
An important piece of any dashboard is that it’s up to date and to do that, we need to refresh the data periodically to ensure that we’re visualizing the latest available information.
By default, when you hit refresh in PowerBI, what happens is that all queries behind the dataset are being re-run essentially grabbing all data again. This is certanly not advisable with larger databases both from time and performance perspectives. In order to deal with this problem, PowerBI came with a feature called incremental refresh however it requires a PowerBI Premium license and won’t work with all types of source data.

In this article I’ll be going tghough a technique that mimics the incremental refresh feature and works with our CosmosDB Graph database.

Breaking up the dataset

The crux of our incremental refresh is to break our datasets into 2 or more portions. One that will always be refreshed, and that only contains a small amount of data (current month, current week, depending on your volume) and one (or more) datasets for historical records that won’t be refreshed.

If your data model contains a property that you can rely on to determine the freshness of the data (i.e. LastUpdatedAt) you can use that to split the query but if you don’t, you’re still covered because CosmosDB contains a built in property that tracks the latest change done to a document (graph or not) in a queriable property called _ts.

To demonstrate, we’ll pick up the same example from the initial post
Starting from the initial query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let
Source = DocumentDB.Contents("https://alex-poc.documents.azure.com:443", "dev", "core2", [Query = "
select
c.id,
c.PartitionKey,
c.OrderDateTime[0]._value as OrderDateTime,
c.OrderStatus[0]._value as OrderStatus,
c.TransactionNumber[0]._value as TransactionNumber,
c.Paid[0]._value as IsPaid
from c
where
c.label = 'Order'"]),
#"Expanded Document" = Table.ExpandRecordColumn(Source, "Document", {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"}, {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"})
in
#"Expanded Document"

we can update that with a time condition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let
Source = DocumentDB.Contents("https://alex-poc.documents.azure.com:443", "dev", "core2", [Query = "
select
c.id,
c.PartitionKey,
c.OrderDateTime[0]._value as OrderDateTime,
c.OrderStatus[0]._value as OrderStatus,
c.TransactionNumber[0]._value as TransactionNumber,
c.Paid[0]._value as IsPaid
from c
where
c.label = 'Order' and c.OrderDateTime[0]._value >= '2019-02-01' "]),
#"Expanded Document" = Table.ExpandRecordColumn(Source, "Document", {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"}, {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"})
in
#"Expanded Document"

This query will now return only orders that are newer that March 1st. If we did not have that date field available for filtering, we could have used the built in _ts property in a similar fashion: and c._ts >= 1548979200. The _ts is an epoch time representation hence the “weird” number. I’m using https://www.epochconverter.com/ to easily convert dates into the epoch time. Notice that we’re passing that as a number not a string, that will be important later when we will parametrize the queries.

Ok, now we are done with what we’re going to call the “current” query - because it’s meant to always refresh and bring the latest / current data only. Next we need to duplicate the query and modify the filter to return the previous data, a.k.a the rest of the data or the historical.

The query will be identical, just change the condition to return data before the date and we will disable this query to run on every refresh by right clicking on the dataset and un-checking the “include in every refresh” option. From now on, every month, or quarter or on a cadence depending on your data volume you’ll be able to change this value in both the queries and always keep a small current dataset that refreshes every day, and a larger historical dataset that does not.

Now, depending on the quantity of data you are dealing with, you might choose to have multiple historical queries. Because you still need to refresh this historical query every time you update the current one, you might want to have an yearly or monthly historical dataset.

Merging the datasets

We’ve solved the refresh performance issue but now we have two or more datasets but we still want to present this to BI consumers as one unified dataset. To do that we will merge the datasets together into one by using the Append Queries feature (and remove eventual duplicates - this could happen if the a record already present in a historical dataset was updated and is not present in the current dataset).

Once this is done, you have once again, one dataset that abstracts away an incremental refresh strategy behind the scenes.

Setting up parameters

Lastly, we want to create a parameter and replace the hard-coded time value in the query so that it’s easier to manage and update in the future (It will also allow for easier modification if you publish the report on PowerBI online).

Creating a parameter is easy, from the Manage Parameters menu:

Creating a parameter

Once the parameter is in, we need to update the queries to use it. The syntax to replace a parameter value in a query looks like this “&parameterName&” which results in our query looking like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let
Source = DocumentDB.Contents("https://alex-poc.documents.azure.com:443", "dev", "core2", [Query = "
select
c.id,
c.PartitionKey,
c.OrderDateTime[0]._value as OrderDateTime,
c.OrderStatus[0]._value as OrderStatus,
c.TransactionNumber[0]._value as TransactionNumber,
c.Paid[0]._value as IsPaid
from c
where
c.label = 'Order' and c._ts >= "&Historical_Timestamp&" "]),
#"Expanded Document" = Table.ExpandRecordColumn(Source, "Document", {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"}, {"id", "PartitionKey", "OrderDateTime", "OrderStatus", "TransactionNumber", "IsPaid"})
in
#"Expanded Document"
Share Comments