If you are looking for free DP-203 dumps than here we have some sample question answers available. You can prepare from our Microsoft DP-203 exam questions notes and prepare exam with this practice test. Check below our updated DP-203 exam dumps.
DumpsGroup are top class study material providers and our inclusive range of DP-203 Real exam questions would be your key to success in Microsoft Azure Data Engineer Associate Certification Exam in just first attempt. We have an excellent material covering almost all the topics of Microsoft DP-203 exam. You can get this material in Microsoft DP-203 PDF and DP-203 practice test engine formats designed similar to the Real Exam Questions. Free DP-203 questions answers and free Microsoft DP-203 study material is available here to get an idea about the quality and accuracy of our study material.
Sample Question 4
Note: This question is part of a series of questions that present the same scenario.Each question in the series contains a unique solution that might meet the statedgoals. Some question sets might have more than one correct solution, while othersmight not have a correct solution.After you answer a question in this section, you will NOT be able to return to it. As aresult, these questions will not appear in the review screen.You have an Azure Data Lake Storage account that contains a staging zone.You need to design a daily process to ingest incremental data from the staging zone,transform the data by executing an R script, and then insert the transformed data into adata warehouse in Azure Synapse Analytics.Solution: You schedule an Azure Databricks job that executes an R notebook, and theninserts the data into the data warehouse.Does this meet the goal?
A. Yes B. No
Answer: B
Explanation:
Must use an Azure Data Factory, not an Azure Databricks job.
You plan to use an Apache Spark pool in Azure Synapse Analytics to load data to an AzureData Lake Storage Gen2 account.You need to recommend which file format to use to store the data in the Data Lake Storageaccount. The solution must meet the following requirements:• Column names and data types must be defined within the files loaded to the Data LakeStorage account.• Data must be accessible by using queries from an Azure Synapse Analytics serverlessSQL pool.• Partition elimination must be supported without having to specify a specific partition.What should you recommend?
A. Delta Lake B. JSON C. CSV D. ORC
Answer: D
Sample Question 6
You are designing 2 solution that will use tables in Delta Lake on Azure Databricks.You need to minimize how long it takes to perform the following:*Queries against non-partitioned tables* Joins on non-partitioned columnsWhich two options should you include in the solution? Each correct answer presents part ofthe solution.(Choose Correct Answer and Give Explanation and References to Support the answersbased from Data Engineering on Microsoft Azure)
A. Z-Ordering B. Apache Spark caching C. dynamic file pruning (DFP) D. the clone command
Answer: A,C
Explanation: According to the information I found on the web, two options that you should
include in the solution to minimize how long it takes to perform queries and joins on nonpartitioned
tables are:
Z-Ordering: This is a technique to colocate related information in the same set of
files. This co-locality is automatically used by Delta Lake in data-skipping
algorithms. This behavior dramatically reduces the amount of data that Delta Lake
on Azure Databricks needs to read123.
Apache Spark caching: This is a feature that allows you to cache data in memory
or on disk for faster access. Caching can improve the performance of repeated
queries and joins on the same data. You can cache Delta tables using the CACHE
TABLE or CACHE LAZY commands. To minimize the time it takes to perform queries against non-partitioned tables and joins on
non-partitioned columns in Delta Lake on Azure Databricks, the following options should be
included in the solution:
A. Z-Ordering: Z-Ordering improves query performance by co-locating data that share the
same column values in the same physical partitions. This reduces the need for shuffling
data across nodes during query execution. By using Z-Ordering, you can avoid full table
scans and reduce the amount of data processed.
B. Apache Spark caching: Caching data in memory can improve query performance by
reducing the amount of data read from disk. This helps to speed up subsequent queries
that need to access the same data. When you cache a table, the data is read from the data
source and stored in memory. Subsequent queries can then read the data from memory,
Note: This question is part of a series of questions that present the same scenario.Each question in the series contains a unique solution that might meet the statedgoals. Some question sets might have more than one correct solution, while othersmight not have a correct solution.After you answer a question in this section, you will NOT be able to return to it. As aresult, these questions will not appear in the review screen.You are designing an Azure Stream Analytics solution that will analyze Twitter data.You need to count the tweets in each 10-second window. The solution must ensure thateach tweet is counted only once.Solution: You use a tumbling window, and you set the window size to 10 seconds.Does this meet the goal?
A. Yes B. No
Answer: A
Explanation:
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time
intervals. The following diagram illustrates a stream with a series of events and how they
You have an Azure subscription that contains an Azure Blob Storage account namedstorage1 and an Azure Synapse Analytics dedicated SQL pool named Pool1.You need to store data in storage1. The data will be read by Pool1. The solution must meetthe following requirements:Enable Pool1 to skip columns and rows that are unnecessary in a query.Automatically create column statistics.Minimize the size of files.Which type of file should you use?
A. JSON B. Parquet C. Avro D. CSV
Answer: B
Explanation:
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to
create statistics manually until automatic creation of CSV files statistics is supported.
You have an Azure Databricks workspace that contains a Delta Lake dimension tablenamed Tablet. Table1 is a Type 2 slowly changing dimension (SCD) table. You need toapply updates from a source table to Table1. Which Apache Spark SQL operation shouldyou use?
A. CREATE B. UPDATE C. MERGE D. ALTER
Answer: C
Explanation:
The Delta provides the ability to infer the schema for data input which further reduces the
effort required in managing the schema changes. The Slowly Changing Data(SCD) Type 2
records all the changes made to each key in the dimensional table. These operations
require updating the existing rows to mark the previous values of the keys as old and then
inserting new rows as the latest values. Also, Given a source table with the updates and
the target table with dimensional data, SCD Type 2 can be expressed with the merge.
Example:
// Implementing SCD Type 2 operation using merge function
customersTable
as("customers")
merge(
stagedUpdates.as("staged_updates"),
"customers.customerId = mergeKey")
whenMatched("customers.current = true AND customers.address <>
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains atable named table1.You load 5 TB of data intotable1.You need to ensure that columnstore compression is maximized for table1.Which statement should you execute?
A. ALTER INDEX ALL on table1 REORGANIZE B. ALTER INDEX ALL on table1 REBUILD C. DBCC DBREINOEX (table1) D. DBCC INDEXDEFRAG (pool1,tablel)
Answer: B
Explanation:
Columnstore and columnstore archive compression
Columnstore tables and indexes are always stored with columnstore compression. You can
further reduce the size of columnstore data by configuring an additional compression called
archival compression. To perform archival compression, SQL Server runs the Microsoft
XPRESS compression algorithm on the data. Add or remove archival compression by
using the following data compression types:
Use COLUMNSTORE_ARCHIVE data compression to compress columnstore data with
archival compression.
Use COLUMNSTORE data compression to decompress archival compression. The
resulting data continue to be compressed with columnstore compression.
To add archival compression, use ALTER TABLE (Transact-SQL) or ALTER INDEX
(Transact-SQL) with the REBUILD option and DATA COMPRESSION =
You have two Azure Blob Storage accounts named account1 and account2?You plan to create an Azure Data Factory pipeline that will use scheduled intervals toreplicate newly created or modified blobs from account1 to account?You need to recommend a solution to implement the pipeline. The solution must meet thefollowing requirements:• Ensure that the pipeline only copies blobs that were created of modified since the mostrecent replication event.• Minimize the effort to create the pipeline. What should you recommend?
A. Create a pipeline that contains a flowlet. B. Create a pipeline that contains a Data Flow activity. C. Run the Copy Data tool and select Metadata-driven copy task. D. Run the Copy Data tool and select Built-in copy task.
Answer: A
Sample Question 12
You have an Azure Data Factory pipeline named pipeline1 that is invoked by a tumblingwindow trigger named Trigger1. Trigger1 has a recurrence of 60 minutes.You need to ensure that pipeline1 will execute only if the previous execution completessuccessfully.How should you configure the self-dependency for Trigger1?
A. offset: "-00:01:00" size: "00:01:00" B. offset: "01:00:00" size: "-01:00:00" C. offset: "01:00:00" size: "01:00:00" D. offset: "-01:00:00" size: "01:00:00"
Answer: D
Explanation:
Tumbling window self-dependency properties
In scenarios where the trigger shouldn't proceed to the next window until the preceding
window is successfully completed, build a self-dependency. A self-dependency trigger
that's dependent on the success of earlier runs of itself within the preceding hour will have
You are building a data flow in Azure Data Factory that upserts data into a table in anAzure Synapse Analytics dedicated SQL pool.You need to add a transformation to the data flow. The transformation must specify logicindicating when a row from the input data must be upserted into the sink.Which type of transformation should you add to the data flow?
A. join B. select C. surrogate key D. alter row
Answer: D
Explanation:
The alter row transformation allows you to specify insert, update, delete, and upsert
policies on rows based on expressions. You can use the alter row transformation to
perform upserts on a sink table by matching on a key column and setting the appropriate
row policy
Sample Question 14
You have an Azure Data lake Storage account that contains a staging zone.You need to design a daily process to ingest incremental data from the staging zone,transform the data by executing an R script, and then insert the transformed data into adata warehouse in Azure Synapse Analytics.Solution: You use an Azure Data Factory schedule trigger to execute a pipeline thatexecutes an Azure Databricks notebook, and then inserts the data into the datawarehouse.Dow this meet the goal?
A. Yes B. No
Answer: B
Explanation:
If you need to transform data in a way that is not supported by Data Factory, you can
create a custom activity, not an Azure Databricks notebook, with your own data processing
logic and use the activity in the pipeline. You can create a custom activity to run R scripts
You are designing an Azure Data Lake Storage solution that will transform raw JSON filesfor use in an analytical workload.You need to recommend a format for the transformed files. The solution must meet thefollowing requirements:Contain information about the data types of each column in the files.Support querying a subset of columns in the files.Support read-heavy analytical workloads.Minimize the file size.What should you recommend?
A. JSON B. CSV C. Apache Avro D. Apache Parquet
Answer: D
Explanation:
Parquet, an open-source file format for Hadoop, stores nested data structures in a flat
columnar format.
Compared to a traditional approach where data is stored in a row-oriented approach, Parquet file format is more efficient in terms of storage and performance.
It is especially good for queries that read particular columns from a “wide” (with many
columns) table since only needed columns are read, and IO is minimized.
You have an Azure subscription that contains an Azure Synapse Analytics workspacenamed ws1 and an Azure Cosmos D6 database account named Cosmos1 Costmos1contains a container named container 1 and ws1 contains a serverless1 SQL pool. you need to ensure that you can Query the data in container by using the serverless1 SQLpool.Which three actions should you perform? Each correct answer presents part of the solutionNOTE: Each correct selection is worth one point.
A. Enable Azure Synapse Link for Cosmos1 B. Disable the analytical store for container1. C. In ws1. create a linked service that references Cosmos1 D. Enable the analytical store for container1 E. Disable indexing for container1
Answer: A,C,D
Sample Question 17
You are designing a folder structure for the files m an Azure Data Lake Storage Gen2account. The account has one container that contains three years of data.You need to recommend a folder structure that meets the following requirements:• Supports partition elimination for queries by Azure Synapse Analytics serverless SQLpooh • Supports fast data retrieval for data from the current month• Simplifies data security management by departmentWhich folder structure should you recommend?
A. \YYY\MM\DD\Department\DataSource\DataFile_YYYMMMDD.parquet B. \Depdftment\DataSource\YYY\MM\DataFile_YYYYMMDD.parquet C. \DD\MM\YYYY\Department\DataSource\DataFile_DDMMYY.parquet D. \DataSource\Department\YYYYMM\DataFile_YYYYMMDD.parquet
Answer: B
Explanation:
Department top level in the hierarchy to simplify security management.
Month (MM) at the leaf/bottom level to support fast data retrieval for data from the current
month.
Sample Question 18
You have an Azure Synapse Analytics dedicated SQL pod. You need to create a pipeline that will execute a stored procedure in the dedicated SQLpool and use the returned result set as the input (or a downstream activity. The solutionmust minimize development effort.Which Type of activity should you use in the pipeline?
A. Notebook B. U-SQL C. Script D. Stored Procedure
Answer: D
Sample Question 19
You have an Azure Synapse Analytics dedicated SQL pool that contains a table namedTable1. Table1 contains the following:One billion rowsA clustered columnstore index A hash-distributed column named Product KeyA column named Sales Date that is of the date data type and cannot be nullThirty million rows will be added to Table1 each month.You need to partition Table1 based on the Sales Date column. The solution must optimizequery performance and data loading.How often should you create a partition?
A. once per month B. once per year C. once per day D. once per week
Answer: B
Explanation: Need a minimum 1 million rows per distribution. Each table is 60 distributions. 30 millions
rows is added each month. Need 2 months to get a minimum of 1 million rows per
distribution in a new partition.
Note: When creating partitions on clustered columnstore tables, it is important to consider
how many rows belong to each partition. For optimal compression and performance of
clustered columnstore tables, a minimum of 1 million rows per distribution and partition is
needed. Before partitions are created, dedicated SQL pool already divides each table into
60 distributions.
Any partitioning added to a table is in addition to the distributions created behind the
scenes. Using this example, if the sales fact table contained 36 monthly partitions, and
given that a dedicated SQL pool has 60 distributions, then the sales fact table should
contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a
table contains fewer than the recommended minimum number of rows per partition,
consider using fewer partitions in order to increase the number of rows per partition.
You have an Azure Databricks workspace named workspace! in the Standard pricing tier.Workspace1 contains an all-purpose cluster named cluster). You need to reduce the time ittakes for cluster 1 to start and scale up. The solution must minimize costs. What shouldyou do first?
A. Upgrade workspace! to the Premium pricing tier. B. Create a cluster policy in workspace1. C. Create a pool in workspace1. D. Configure a global init script for workspace1.
Answer: C
Explanation:
You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters
Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to
What should you recommend to prevent users outside the Litware on-premises networkfrom accessing the analytical data store?
A. a server-level virtual network rule B. a database-level virtual network rule C. a database-level firewall IP rule D. a server-level firewall IP rule
Answer: A
Explanation:
Virtual network rules are one firewall security feature that controls whether the database
server for your single databases and elastic pool in Azure SQL Database or for your
databases in SQL Data Warehouse accepts communications that are sent from particular
subnets in virtual networks.
Server-level, not database-level: Each virtual network rule applies to your whole Azure SQL
Database server, not just to one particular database on the server. In other words, virtual
network rule applies at the serverlevel, not at the database-level.
You have an Azure subscription that contains an Azure Data Lake Storage account named
myaccount1. The myaccount1 account contains two containers named container1 and
contained. The subscription is linked to an Azure Active Directory (Azure AD) tenant that
contains a security group named Group1.
You need to grant Group1 read access to contamer1. The solution must use the principle
of least privilege. Which role should you assign to Group1?
A. Storage Blob Data Reader for container1 B. Storage Table Data Reader for container1 C. Storage Blob Data Reader for myaccount1 D. Storage Table Data Reader for myaccount1
Answer: A
Sample Question 24
You are designing an application that will use an Azure Data Lake Storage Gen 2 account
to store petabytes of license plate photos from toll booths. The account will use zoneredundant storage (ZRS).
You identify the following usage patterns:
• The data will be accessed several times a day during the first 30 days after the data is
created. The data must meet an availability SU of 99.9%.
• After 90 days, the data will be accessed infrequently but must be available within 30
seconds.
• After 365 days, the data will be accessed infrequently but must be available within five
minutes.
Answer: See the answer below in explanation. Explanation:
Answer as below
Sample Question 25
You are designing database for an Azure Synapse Analytics dedicated SQL pool to support
workloads for detecting ecommerce transaction fraud.
Data will be combined from multiple ecommerce sites and can include sensitive financial
information such as credit card numbers.
You need to recommend a solution that meets the following requirements: Users must be able to identify potentially fraudulent transactions.
Users must be able to use credit cards as a potential feature in models.
Users must NOT be able to access the actual credit card numbers.
What should you include in the recommendation?
A. Transparent Data Encryption (TDE) B. row-level security (RLS) C. column-level encryption D. Azure Active Directory (Azure AD) pass-through authentication
Answer: C Explanation: Use Always Encrypted to secure the required columns. You can configure Always Encrypted for individual database columns containing your sensitive data. Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/alwaysencrypted-datab...
Sample Question 26
You have an Azure Synapse Analytics dedicated SQL pool.
You need to Create a fact table named Table1 that will store sales data from the last three
years. The solution must be optimized for the following query operations:
Show order counts by week.
• Calculate sales totals by region.
• Calculate sales totals by product.
• Find all the orders from a given month.
Which data should you use to partition Table1?
A. region B. product C. week D. month
Answer: C
Sample Question 27
You plan to create a dimension table in Azure Synapse Analytics that will be less than 1
GB.
You need to create the table to meet the following requirements:
• Provide the fastest Query time.
• Minimize data movement during queries.
Which type of table should you use?
A. hash distributed B. heap C. replicated D. round-robin
Answer: C
Sample Question 28
You are designing an Azure Databricks interactive cluster. The cluster will be used
infrequently and will be configured for auto-termination.
You need to ensure that the cluster configuration is retained indefinitely after the cluster is
terminated. The solution must minimize costs.
What should you do?
A. Clone the cluster after it is terminated. B. Terminate the cluster manually when processing completes. C. Create an Azure runbook that starts the cluster every 90 days. D. Pin the cluster.
You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account
named storage!
New files are uploaded daily to storage1.
• Incrementally process new files as they are upkorage1 as a structured streaming source.
The solution must meet the following requirements:
• Minimize implementation and maintenance effort.
• Minimize the cost of processing millions of files.
• Support schema inference and schema drift.
Which should you include in the recommendation?
A. Auto Loader B. Apache Spark FileStreamSource C. COPY INTO D. Azure Data Factory
Answer: D
Sample Question 30
You have an activity in an Azure Data Factory pipeline. The activity calls a stored
procedure in a data warehouse in Azure Synapse Analytics and runs daily.
You need to verify the duration of the activity when it ran last.
What should you use?
A. activity runs in Azure Monitor B. Activity log in Azure Synapse Analytics C. the sys.dm_pdw_wait_stats data management view in Azure Synapse Analytics D. an Azure Resource Manager template
You are designing a highly available Azure Data Lake Storage solution that will induce geozone-redundant storage (GZRS).
You need to monitor for replication delays that can affect the recovery point objective
(RPO).
What should you include m the monitoring solution?
A. Last Sync Time B. Average Success Latency C. Error errors D. availability
Answer: A Explanation:
Because geo-replication is asynchronous, it is possible that data written to the primary
region has not yet been written to the secondary region at the time an outage occurs. The
Last Sync Time property indicates the last time that data from the primary region was
written successfully to the secondary region. All writes made to the primary region before
the last sync time are available to be read from the secondary location. Writes made to the
primary region after the last sync time property may or may not be available for reads yet.
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/last-sync-time-get
Sample Question 32
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1. You have files that are ingested and loaded into an Azure Data Lake Storage Gen2 container named container1. You plan to insert data from the files in container1 into Table1 and transform the data. Each row of data in the files will produce one row in the serving layer of Table1. You need to ensure that when the source data files are loaded to container1, the DateTime is stored as an additional column in Table1. Solution: You use an Azure Synapse Analytics serverless SQL pool to create an external table that has an additional DateTime column. Does this meet the goal?
A. Yes B. No
Answer: B
Explanation:
Instead use the derived column transformation to generate new columns in your data flow or to modify existing fields.
You have an Azure Stream Analytics job. You need to ensure that the job has enough streaming units provisioned. You configure monitoring of the SU % Utilization metric. Which two additional metrics should you monitor? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point.
A. Backlogged Input Events B. Watermark Delay C. Function Events D. Out of order Events E. Late Input Events
Answer: A,B
Explanation:
To react to increased workloads and increase streaming units, consider setting an alert of 80% on the SU Utilization metric. Also, you can use watermark delay and backlogged events metrics to see if there is an impact. Note: Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn't able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job, by increasing the SUs.
A company uses Azure Stream Analytics to monitor devices. The company plans to double the number of devices that are monitored. You need to monitor a Stream Analytics job to ensure that there are enough processing resources to handle the additional load. Which metric should you monitor?
A. Early Input Events B. Late Input Events C. Watermark delay D. Input Deserialization Errors
Answer: C
Explanation:
There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to: Not enough processing resources in Stream Analytics to handle the volume of input events. Not enough throughput within the input event brokers, so they are throttled. Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used.
You are designing an enterprise data warehouse in Azure Synapse Analytics that will contain a table named Customers. Customers will contain credit card information. You need to recommend a solution to provide salespeople with the ability to view all the entries in Customers. The solution must prevent all the salespeople from viewing or inferring the credit cardinformation. What should you include in the recommendation?
A. data masking B. Always Encrypted C. column-level security D. row-level security
Answer: A
Explanation:
SQL Database dynamic data masking limits sensitive data exposure by masking it to nonprivileged users. The Credit card masking method exposes the last four digits of the designated fields and adds a constant string as a prefix in the form of a credit card. Example: XXXX-XXXX-XXXX-1234
You have an Azure Databricks resource.You need to log actions that relate to changes in compute for the Databricks resource.Which Databricks services should you log?
A. clusters B. workspace C. DBFS D. SSH E lobs
Answer: B
Explanation:
Cloud Provider Infrastructure Logs.Databricks logging allows security and admin teams to
demonstrate conformance to data governance standards within or from a Databricks
workspace. Customers, especially in the regulated industries, also need records on
activities like:– User access control to cloud data storage– Cloud Identity and Access
Management roles– User access to cloud network and compute
Azure Databricks offers three distinct workloads on several VM Instances tailored for your
data analytics workflow—the Jobs Compute and Jobs Light Compute workloads make it
easy for data engineers to build and execute jobs, and the All-Purpose Compute workload
makes it easy for data scientists to explore, visualize, manipulate, and share data and
insights interactively
Sample Question 37
You have an Azure Data lake Storage account that contains a staging zone.You need to design a daily process to ingest incremental data from the staging zone,transform the data by executing an R script, and then insert the transformed data into adata warehouse in Azure Synapse Analytics.Solution You use an Azure Data Factory schedule trigger to execute a pipeline thatexecutes an Azure Databricks notebook, and then inserts the data into the data warehouseDow this meet the goal?
A. Yes B. No
Answer: A
Sample Question 38
You plan to build a structured streaming solution in Azure Databricks. The solution willcount new events in five-minute intervals and report only events that arrive during theinterval. The output will be sent to a Delta Lake table.Which output mode should you use?
A. complete B. update C. append
Answer: C
Explanation: Append Mode: Only new rows appended in the result table since the last
trigger are written to external storage. This is applicable only for the queries where existing
rows in the Result Table are not expected to change.
You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure DataLake Storage Gen2 container.Which resource provider should you enable?
A. Microsoft.Sql B. Microsoft-Automation C. Microsoft.EventGrid D. Microsoft.EventHub
Answer: C
Explanation:
Event-driven architecture (EDA) is a common data integration pattern that involves
production, detection, consumption, and reaction to events. Data integration scenarios
often require Data Factory customers to trigger pipelines based on events happening in
storage account, such as the arrival or deletion of a file in Azure Blob Storage account.
Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on
You are designing an Azure Databricks interactive cluster. The cluster will be usedinfrequently and will be configured for auto-termination.You need to ensure that the cluster configuration is retained indefinitely after the cluster isterminated. The solution must minimize costsWhat should you do?
A. Clone the cluster after it is terminated. B. Terminate the cluster manually when processing completes. C. Create an Azure runbook that starts the cluster every 90 days. D. Pin the cluster.
Answer: D
Explanation:
To keep an interactive cluster configuration even after it has been terminated for more than
30 days, an
administrator can pin a cluster to the cluster list.
You have an enterprise data warehouse in Azure Synapse Analytics named DW1 on aserver named Server1.You need to verify whether the size of the transaction log file for each distribution of DW1 issmaller than 160 GB.What should you do?
A. On the master database, execute a query against thesys.dm_pdw_nodes_os_performance_counters dynamic management view. B. From Azure Monitor in the Azure portal, execute a query against the logs of DW1. C. On DW1, execute a query against the sys.database_files dynamic management view. D. Execute a query against the logs of DW1 by using the Get-AzOperationalInsightSearchResult PowerShell cmdlet.
Answer: A
Explanation:
The following query returns the transaction log size on each distribution. If one of the log
files is reaching 160 GB, you should consider scaling up your instance or limiting your
You are designing a financial transactions table in an Azure Synapse Analytics dedicatedSQL pool. The table will have a clustered columnstore index and will include the followingcolumns:TransactionType: 40 million rows per transaction typeCustomerSegment: 4 million per customer segmentTransactionMonth: 65 million rows per monthAccountType: 500 million per account typeYou have the following query requirements:Analysts will most commonly analyze transactions for a given month.Transactions analysis will typically summarize transactions by transaction type,customer segment, and/or account typeYou need to recommend a partition strategy for the table to minimize query times.On which column should you recommend partitioning the table?
A. CustomerSegment B. AccountType C. TransactionType D. TransactionMonth
Answer: D
Explanation:
For optimal compression and performance of clustered columnstore tables, a minimum of 1
million rows per distribution and partition is needed. Before partitions are created,
dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind
the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and
given that a dedicated SQL pool has 60 distributions, then the sales fact table should
contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a
table contains fewer than the recommended minimum number of rows per partition,
consider using fewer partitions in order to increase the number of rows per partition.
Sample Question 43
You plan to ingest streaming social media data by using Azure Stream Analytics. The datawill be stored in files in Azure Data Lake Storage, and then consumed by using AzureDatiabricks and PolyBase in Azure Synapse Analytics.You need to recommend a Stream Analytics data output format to ensure that the queriesfrom Databricks and PolyBase against the files encounter the fewest possible errors. Thesolution must ensure that the tiles can be queried quickly and that the data type informationis retained.What should you recommend?
Note: This question is part of a series of questions that present the same scenario.Each question in the series contains a unique solution that might meet the statedgoals. Some question sets might have more than one correct solution, while othersmight not have a correct solution.After you answer a question in this section, you will NOT be able to return to it. As aresult, these questions will not appear in the review screen.You plan to create an Azure Databricks workspace that has a tiered structure. Theworkspace will contain the following three workloads:A workload for data engineers who will use Python and SQL.A workload for jobs that will run notebooks that use Python, Scala, and SOL.A workload that data scientists will use to perform ad hoc analysis in Scala and R.The enterprise architecture team at your company identifies the following standards forDatabricks environments: The data engineers must share a cluster.The job cluster will be managed by using a request process whereby datascientists and data engineers provide packaged notebooks for deployment to thecluster.All the data scientists must be assigned their own cluster that terminatesautomatically after 120 minutes of inactivity. Currently, there are three datascientists.You need to create the Databricks clusters for the workloads.Solution: You create a Standard cluster for each data scientist, a High Concurrency clusterfor the data engineers, and a Standard cluster for the jobs.Does this meet the goal?
A. Yes B. No
Answer: B
Explanation:
We would need a High Concurrency cluster for the jobs.
Note:
Standard clusters are recommended for a single user. Standard can run workloads
developed in any language:
Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high
concurrency clusters are that
they provide Apache Spark-native fine-grained sharing for maximum resource utilization
You have an Azure Stream Analytics job.You need to ensure that the job has enough streaming units provisionedYou configure monitoring of the SU % Utilization metric.Which two additional metrics should you monitor? Each correct answer presents part of thesolution.NOTE Each correct selection is worth one point
A. Out of order Events B. Late Input Events C. Baddogged Input Events D. Function Events
Answer: C
Sample Question 46
You are creating an Azure Data Factory data flow that will ingest data from a CSV file, castcolumns to specified types of data, and insert the data into a table in an Azure SynapseAnalytic dedicated SQL pool. The CSV file contains three columns named username,comment, and date.The data flow already contains the following:A source transformation.A Derived Column transformation to set the appropriate types of data.A sink transformation to land the data in the pool.You need to ensure that the data flow meets the following requirements:All valid rows must be written to the destination table.Truncation errors in the comment column must be avoided proactively.Any rows containing comment values that will cause truncation errors upon insertmust be written to a file in blob storage.Which two actions should you perform? Each correct answer presents part of the solution.NOTE: Each correct selection is worth one point.
A. To the data flow, add a sink transformation to write the rows to a file in blob storage. B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors. C. To the data flow, add a filter transformation to filter out rows that will cause truncation errors D. Add a select transformation to select only the rows that will cause truncation errors.
Answer: A,B
Explanation:
B: Example:
1. This conditional split transformation defines the maximum length of "title" to be five. Any
row that is less than or equal to five will go into the GoodRows stream. Any row that is
You are developing a solution that will stream to Azure Stream Analytics. The solution willhave both streaming data and reference data.Which input type should you use for the reference data?
A. Azure Cosmos DB B. Azure Blob storage C. Azure IoT Hub D. Azure Event Hubs
Answer: B
Explanation:
Stream Analytics supports Azure Blob storage and Azure SQL Database as the storage
You have an Azure Synapse Analytics dedicated SQL pool that contains a table namedTable1.You have files that are ingested and loaded into an Azure Data Lake Storage Gen2container named container1.You plan to insert data from the files into Table1 and azure Data Lake Storage Gen2container named container1.You plan to insert data from the files into Table1 and transform the data. Each row of datain the files will produce one row in the serving layer of Table1.You need to ensure that when the source data files are loaded to container1, the DateTimeis stored as an additional column in Table1.Solution: You use a dedicated SQL pool to create an external table that has a additionalDateTime column.Does this meet the goal?
A. Yes B. No
Answer: A
Sample Question 49
You plan to perform batch processing in Azure Databricks once daily.Which type of Databricks cluster should you use?
A. High Concurrency B. automated C. interactive
Answer: B
Explanation:
Azure Databricks has two types of clusters: interactive and automated. You use interactive
clusters to analyze data collaboratively with interactive notebooks. You use automated
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a databasenamed DB1. DB1 contains a fact table named Table1.You need to identify the extent of the data skew in Table1.What should you do in Synapse Studio?
A. Connect to the built-in pool and query sysdm_pdw_sys_info. B. Connect to Pool1 and run DBCC CHECKALLOC. C. Connect to the built-in pool and run DBCC CHECKALLOC. D. Connect to Pool! and query sys.dm_pdw_nodes_db_partition_stats.
Answer: D
Explanation:
Microsoft recommends use of sys.dm_pdw_nodes_db_partition_stats to analyze any
You are creating a new notebook in Azure Databricks that will support R as the primarylanguage but will also support Scale and SOL Which switch should you use to switchbetween languages?
A. @<Language> B. %<Language> C. \\(<Language>) D. \\(<Language>)
Answer: B
Explanation:
To change the language in Databricks’ cells to either Scala, SQL, Python or R, prefix the
You use Azure Data Lake Storage Gen2.You need to ensure that workloads can use filter predicates and column projections to filterdata at the time the data is read from disk.Which two actions should you perform? Each correct answer presents part of the solution.NOTE: Each correct selection is worth one point.
A. Reregister the Microsoft Data Lake Store resource provider. B. Reregister the Azure Storage resource provider. C. Create a storage policy that is scoped to a container. D. Register the query acceleration feature. E. Create a storage policy that is scoped to a container prefix filter.
Answer: B,D
Sample Question 53
Note: This question is part of a series of questions that present the same scenario. Eachquestion in the series contains a unique solution that might meet the stated goals. Somequestion sets might have more than one correct solution, while others might not have acorrect solution.After you answer a question in this scenario, you will NOT be able to return to it. As aresult, these questions will not appear in the review screen.You have an Azure Storage account that contains 100 GB of files. The files contain textand numerical values. 75% of the rows contain description data that has an average lengthof 1.1 MB.You plan to copy the data from the storage account to an enterprise data warehouse inAzure Synapse Analytics.You need to prepare the files to ensure that the data copies quickly.Solution: You convert the files to compressed delimited text files.Does this meet the goal?
A. Yes B. No
Answer: A
Explanation:
All file formats have different performance characteristics. For the fastest load, use