Free Sample — 15 Practice Questions
Preview 15 of 137 questions from the Data Engineer Associate exam.
Try before you buy — purchase the full study guide for all 137 questions with answers and explanations.
Question 127
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
A. Parquet files can be partitioned
B. CREATE TABLE AS SELECT statements cannot be used on files
C. Parquet files have a well-defined schema
D. Parquet files have the ability to be optimized
E. Parquet files will become Delta tables
Show Answer
Correct Answer: C
Explanation:
When using CREATE TABLE AS SELECT (CTAS), the schema of the target table is derived automatically from the source data or query results, and manual schema or file options cannot be specified. Parquet files embed a well-defined schema within the file metadata, making them well-suited for CTAS. CSV files do not have an inherent schema and rely on inference or external options, which CTAS does not support. Therefore, having a well-defined schema is a key benefit of using Parquet over CSV in this context.
Question 139
A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:
After running the command today, the data engineer notices that the number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied any new records into the table?
A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The names of the files to be copied were not included with the FILES keyword.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows.
Show Answer
Correct Answer: C
Explanation:
In Databricks, COPY INTO is an idempotent operation. The system tracks which source files have already been loaded and automatically skips them on subsequent runs to prevent duplicate ingestion. If the previous day’s file was already copied earlier, rerunning the COPY INTO command would not insert any new records, leaving the row count unchanged.
Question 82
A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.
Which action can the data engineer perform to improve the start up time for the clusters used for the Job?
A. They can use endpoints available in Databricks SQL
B. They can use jobs clusters instead of all-purpose clusters
C. They can configure the clusters to autoscale for larger data sizes
D. They can use clusters that are from a cluster pool
Show Answer
Correct Answer: D
Explanation:
Using clusters backed by a cluster pool significantly reduces startup time because the pool maintains a set of pre-provisioned, idle instances that can be attached to job clusters immediately, avoiding the delay of provisioning new VMs for each task run.
Question 124
A data engineer needs to apply custom logic to identify employees with more than 5 years of experience in array column employees in table stores. The custom logic should create a new column exp_employees that is an array of all of the employees with more than 5 years of experience for each row. In order to apply this custom logic at scale, the data engineer wants to use the FILTER higher-order function.
Which of the following code blocks successfully completes this task?
Show Answer
Correct Answer: A
Explanation:
Option A correctly uses the FILTER higher-order function on the employees array to return only elements (employees) whose years of experience exceed 5, and assigns the result to a new column exp_employees. The other options either reference the wrong source column, contain syntax errors, or do not use the FILTER higher-order function as required.
Question 31
Which of the following commands will return the number of null values in the member_id column?
A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) - count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;
Show Answer
Correct Answer: C
Explanation:
In Databricks SQL, count_if(condition) counts rows where the condition evaluates to true. Using count_if(member_id IS NULL) directly returns the number of NULL values. COUNT(member_id) in A counts only non-NULL values, B uses a non-existent function, and D is not a valid SQL aggregate.
Question 23
A data engineer is developing a small proof of concept in a notebook. When running the entire notebook, the Cluster usage spikes. The data engineer wants to keep the development requirements and get real-time results.
Which Cluster meets these requirements?
A. All Purpose Cluster with autoscaling
B. Job Cluster with Photon enabled and autoscaling
C. Job Cluster with autoscaling enabled
D. All-Purpose Cluster with a large fixed memory size
Show Answer
Correct Answer: A
Explanation:
The scenario is interactive notebook-based development with a need for real-time results. All-Purpose Clusters are designed for interactive workloads like notebooks and proofs of concept. Enabling autoscaling handles usage spikes efficiently without over-provisioning. Job clusters are intended for scheduled or automated jobs, not interactive development, and a fixed-size cluster is inefficient for spiky workloads.
Question 152
A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data.
They run the following command:
DROP TABLE IF EXISTS my_table -
While the object no longer appears when they run SHOW TABLES, the data files still exist.
Which of the following describes why the data files still exist and the metadata files were deleted?
A. The table’s data was larger than 10 GB
B. The table’s data was smaller than 10 GB
C. The table was external
D. The table did not have a location
E. The table was managed
Show Answer
Correct Answer: C
Explanation:
In Spark SQL, dropping an external table removes only the table metadata from the metastore, not the underlying data files. Because the table was external, Spark did not manage the data location, so the files remain even though the table no longer appears in SHOW TABLES.
Question 85
A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.
Which approach can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?
A. They can set up an Alert with a custom template.
B. They can set up an Alert with a new email alert destination.
C. They can set up an Alert with a new webhook alert destination.
D. They can set up an Alert with one-time notifications.
Show Answer
Correct Answer: C
Explanation:
Databricks SQL Alerts can be configured to trigger when a query result meets a threshold. To notify an entire team via a messaging system (such as Slack, Teams, or another service), the alert must use a webhook destination. Email destinations only send emails, custom templates do not change the delivery mechanism, and one-time notifications are not suitable for ongoing monitoring. Therefore, setting up an Alert with a new webhook alert destination is the correct approach.
Question 160
Which of the following is hosted completely in the control plane of the classic Databricks architecture?
A. Worker node
B. JDBC data source
C. Databricks web application
D. Databricks Filesystem
E. Driver node
Show Answer
Correct Answer: C
Explanation:
In the classic Databricks architecture, the control plane hosts Databricks-managed services such as the web application (UI), REST APIs, workspace metadata, authentication, and cluster management. The Databricks web application is entirely hosted in the control plane. Worker nodes and the driver node run in the data plane where computation occurs, DBFS data resides in the customer’s data plane storage (with only some metadata in the control plane), and JDBC data sources are external to Databricks.
Question 61
A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by sales team.
What are the minimum permissions the data engineer would require in addition?
A. Needs SELECT permission on the VIEW and the underlying TABLE.
B. Needs SELECT permission only on the VIEW
C. Needs ALL PRIVILEGES on the VIEW
D. Needs ALL PRIVILEGES at the SCHEMA level
Show Answer
Correct Answer: B
Explanation:
In Databricks Unity Catalog on a shared cluster, views support authorization transfer. Since the data engineer already has USE CATALOG and USE SCHEMA, the only additional permission required to query the view is SELECT on the view itself. SELECT on the underlying tables is not required in this setup.
Question 130
Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?
A. None of these
B. Data lake
C. Data warehouse
D. All of these
E. Data lakehouse
Show Answer
Correct Answer: E
Explanation:
A data lakehouse is designed to unify and simplify siloed data architectures by combining the flexibility and scalability of data lakes with the governance, performance, and reliability of data warehouses. This allows multiple specialized use cases (BI, analytics, ML) to operate on a single platform, reducing fragmentation.
Question 70
A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.
They have the following incomplete code block:
____(f"SELECT customer_id, spend FROM {table_name}")
What can be used to fill in the blank to successfully complete the task?
A. spark.delta.sql
B. spark.sql
C. spark.table
D. dbutils.sql
Show Answer
Correct Answer: B
Explanation:
In PySpark, dynamic SQL strings that reference Python variables are executed using spark.sql(). The f-string constructs the SQL text, and spark.sql(...) runs it against the Spark SQL engine. The other options either do not exist (spark.delta.sql, dbutils.sql) or do not accept arbitrary SQL strings (spark.table expects a table name only).
Question 112
A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:
DROP TABLE IF EXISTS my_table;
After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.
What is the reason behind the deletion of all these files?
A. The table was managed
B. The table's data was smaller than 10 GB
C. The table did not have a location
D. The table was external
Show Answer
Correct Answer: A
Explanation:
In Spark SQL, dropping a managed table removes both the table metadata and the underlying data files. Since running DROP TABLE deleted all data and metadata from the file system, the table must have been a managed table. External tables only remove metadata and leave the data files intact.
Question 55
A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?
Show Answer
Correct Answer: A
Explanation:
A SQL UDF is created using the CREATE FUNCTION syntax, defining input parameters, a return type, and a SQL expression or body. Option A follows the correct CREATE FUNCTION pattern for a SQL user-defined function, whereas the other options either use invalid syntax or do not define a SQL UDF correctly.
Question 117
Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?
A. Silver tables contain a less refined, less clean view of data than Bronze data.
B. Silver tables contain aggregates while Bronze data is unaggregated.
C. Silver tables contain more data than Bronze tables.
D. Silver tables contain a more refined and cleaner view of data than Bronze tables.
E. Silver tables contain less data than Bronze tables.
Show Answer
Correct Answer: D
Explanation:
In the Medallion (Bronze–Silver–Gold) architecture, Bronze tables store raw, minimally processed ingested data, while Silver tables apply cleaning, standardization, and basic transformations. Therefore, Silver tables always represent a more refined and cleaner view of the data than Bronze tables.