How to get a predicate pushed to a workspace table?

Martin Smith 1 Reputation point
2025-01-05T12:31:52.1566667+00:00

In my Synapse workspace I have created a spark table linked to a Cosmos analytical store instance with the following details (as described here)

create table if not exists  mytable using cosmos.olap options (spark.synapse.linkedService 'my_linked_service',spark.cosmos.container 'mycontainer',spark.cosmos.autoSchemaMerge 'true');

In the dataflow if I choose "Source type" of "Workspace DB" I see this table listed as an available table in the "Lake" tables and can successfully select it and access the data from it.

This works fine when the underlying collection is small but I am now trying to use it on a very large collection (> 5TB) for which I only want a very small subset of rows (a window filtered by _ts).

There does not appear to be any option to specify a Query to run against a workspace DB table and it only has an option to select the whole table.

I have tried adding a filter on _ts inside the dataflow itself but this does not seem to get pushed down and it appears to be returning all rows and then subsequently filtering them in a later step.

The documentation https://zcusa.951200.xyz/en-us/azure/data-factory/data-flow-source#workspace-db-synapse-workspaces-only implies that the workspace DB option is just doing something that we can do ourselves with linked services.

Is this the case for these tables? If so what kind of datasource do I need to select to access these? The linked service type called "Azure Synapse Analytics" appears to connect to the SQL Serverless endpoint and so not show tables from the "lake" database.

NB: I am aware of the article Copy and transform data in Azure Cosmos DB analytical store by using Azure Data Factory but unfortunately this uses the CDC feature and I have found that this has some bugs making it unusable for this case.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,117 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,721 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,100 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Pinaki Ghatak 5,485 Reputation points Microsoft Employee
    2025-01-06T09:22:20.6266667+00:00

    Hello @Martin Smith

    Please know that currently the Workspace DB option in Synapse data flow source transformation does not support specifying a query to run against a workspace DB table. It only has an option to select the whole table. As you mentioned, the documentation implies that the workspace DB option is just doing something that we can do ourselves with linked services.

    In this case, you can use a linked service to connect to the Cosmos analytical store instance and then use a source transformation to read data from the linked service.

    To do this, you can create a linked service of type "Azure Cosmos DB (SQL API)" and provide the necessary connection details. Then, in the data flow source transformation, you can select the linked service as the source and specify the query to filter the data.

    Here is an example of how to create a linked service for Cosmos analytical store instance:

    { 	
    	"name": "my_cosmosdb_linked_service", 
    	"type": "Microsoft.DataFactory/factories/linkedservices", 
    	"properties": { 
    		"type": "AzureCosmosDB", 
    		"typeProperties": { 
    			"connectionString": "AccountEndpoint=;AccountKey=;Database=;"
    		}
    	}
    }
    
    

    Once you have created the linked service, you can use it in the data flow source transformation and specify the query to filter the data. For example:

    source(output( col1 as string, col2 as string, col3 as string ), 
    	allowSchemaDrift: true, 
    	validateSchema: false, 
    	ignoreNoFilesFound: false, 
    	partitionOption: 'None', 
    	query: 'SELECT * FROM c WHERE c._ts > 1234567890', format: 'json', 
    	options: map( 'connectionString', 'my_cosmosdb_linked_service' )
    )
    

    I hope this helps.


  2. Ganesh Gurram 3,025 Reputation points Microsoft Vendor
    2025-01-07T14:34:00.66+00:00

    Hi @Martin Smith

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer

    Ask: How to get a predicate pushed to a workspace table?

    Solution: As a workaround, an alternative approach was implemented by adding code to a notebook for delta table creation and merging and scheduling its execution via a pipeline.

    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information. 

    If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue. 

     ..................................................................................

    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members. 


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.