DP-100 Exam Overview
Prepare for the Microsoft DP-100 certification exam
with our comprehensive study guide. This study material contains 477 practice questions
sourced from real exams and expert-verified for accuracy. Each question includes the correct answer
and a detailed explanation to help you understand the material thoroughly.
The DP-100 exam — Designing and Implementing a Data Science Solution on Azure — is offered by Microsoft.
Passing this exam earns you the Microsoft Certified: Data Science Associate credential,
an industry-recognized certification that validates your expertise.
Our study materials were last updated on 2026-02-14 to reflect the
most recent exam objectives and content.
About the Microsoft Certified: Data Science Associate
The Microsoft Certified: Data Science Associate is awarded by Microsoft
to professionals who demonstrate competence in the skills measured by the DP-100 exam.
According to the
official Microsoft certification page,
this certification validates your ability to work with the technologies covered in the exam objectives.
According to the
Global Knowledge IT Skills and Salary Report,
certified IT professionals earn 15-25% more than their non-certified peers.
Certifications from Microsoft are among the most recognized credentials in the IT industry,
with strong demand across enterprise organizations worldwide.
Free Sample — 15 Practice Questions
Preview 15 of 477 questions from the DP-100 exam.
Try before you buy — purchase the full study guide for all 477 questions with answers and explanations.
Question 387
You are creating a classification model for a banking company to identify possible instances of credit card fraud. You plan to create the model in Azure Machine
Learning by using automated machine learning.
The training dataset that you are using is highly unbalanced.
You need to evaluate the classification model.
Which primary metric should you use?
A. normalized_mean_absolute_error
B. AUC_weighted
C. accuracy
D. normalized_root_mean_squared_error
E. spearman_correlation
Show Answer
Correct Answer: B
Explanation:
For highly imbalanced classification problems like credit card fraud, accuracy is misleading because it is dominated by the majority class. AUC_weighted evaluates the model’s ability to discriminate between classes while weighting each class by its sample proportion, making it more robust and appropriate for imbalanced datasets in Azure AutoML. The other options are regression or correlation metrics and are not suitable for classification.
Question 164
You manage an Azure Machine Learning workspace by using the Azure CLI ml extension v2.
You need to define a YAML schema to create a compute cluster.
Which schema should you use?
A. https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json
B. https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
C. https://azuremlschemas.azureedge.net/latest/vmCompute.schema.json
D. https://azuremlschemas.azureedge.net/latest/kubernetesCompute.schema.json
Show Answer
Correct Answer: B
Explanation:
In Azure Machine Learning CLI v2, a compute *cluster* is defined using the **amlCompute** resource type. The corresponding YAML schema for creating and managing scalable AML compute clusters is `amlCompute.schema.json`. The other schemas apply to different compute types: computeInstance (single-user VM), vmCompute (attached VM), and kubernetesCompute (AKS/Arc Kubernetes).
Question 37
HOTSPOT
-
You design a data processing strategy for a machine learning project.
The data that must be processed includes unstructured flat files that must be processed in real time.
The data transformation must be executed on a serverless compute and optimized for big data analytical workloads.
You need to select the Azure services for the data science team.
Which storage and data processing service should you use? To answer, select the appropriate option in the answer area.
NOTE: Each correct selection is worth one point.
Show Answer
Correct Answer: Data storage for model training workloads:
Azure Data Lake Storage Gen2
Data processing solution:
Azure Databricks
Explanation:
Azure Data Lake Storage Gen2 is optimized for big data analytics and unstructured data used in model training. Azure Databricks provides scalable, serverless Spark-based processing suitable for real-time transformation and large analytical workloads.
Question 144
HOTSPOT
-
You are implementing hyperparameter tuning for a model training from a notebook. The notebook is in an Azure Machine Learning workspace. You add code that imports all relevant Python libraries.
You must configure Bayesian sampling over the search space for the num_hidden_layers and batch_size hyperparameters.
You need to complete the following Python code to configure Bayesian sampling.
Which code segments should you use? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Show Answer
Correct Answer: choice
range
Explanation:
For Bayesian sampling in Azure ML, discrete hyperparameters like batch_size with specific step values are defined using a choice over a generated range (e.g., range(16, 128, 16)). The learning rate already uses a supported continuous distribution.
Question 507
You are performing feature engineering on a dataset.
You must add a feature named CityName and populate the column value with the text London.
You need to add the new feature to the dataset.
Which Azure Machine Learning Studio module should you use?
A. Extract N-Gram Features from Text
B. Edit Metadata
C. Preprocess Text
D. Apply SQL Transformation
Show Answer
Correct Answer: D
Explanation:
To add a new feature (column) and populate it with a constant value like 'London' for all rows, you need a data transformation that can create columns and assign values. The Apply SQL Transformation module supports SQL statements (e.g., SELECT *, 'London' AS CityName) to add and populate a new column. Edit Metadata only modifies properties of existing columns and cannot create new ones.
Question 449
You register a model that you plan to use in a batch inference pipeline.
The batch inference pipeline must use a ParallelRunStep step to process files in a file dataset. The script has the ParallelRunStep step runs must process six input files each time the inferencing function is called.
You need to configure the pipeline.
Which configuration setting should you specify in the ParallelRunConfig object for the PrallelRunStep step?
A. process_count_per_node= "6"
B. node_count= "6"
C. mini_batch_size= "6"
D. error_threshold= "6"
Show Answer
Correct Answer: C
Explanation:
In a ParallelRunStep using a FileDataset, the mini_batch_size setting controls how many input files are passed to the run() (inferencing) function in each invocation. Since the requirement is to process six input files each time the inferencing function is called, mini_batch_size must be set to 6. The other options control compute scaling or error handling, not per-call file grouping.
Question 369
HOTSPOT -
A coworker registers a datastore in a Machine Learning services workspace by using the following code:
You need to write code to access the datastore from a notebook.
How should you complete the code segment? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Show Answer
Correct Answer: Datastore
ws
demo_datastore
Explanation:
Use the Datastore class to retrieve a registered datastore from the workspace. The get() method takes the workspace object and the datastore name used during registration.
Question 1
You are implementing hyperparameter tuning by using Bayesian sampling for an Azure ML Python SDK v2-based model training from a notebook. The notebook is in an Azure Machine Learning workspace. The notebook uses a training script that runs on a compute cluster with 20 nodes.
The code implements Bandit termination policy with slack_factor set to 0.2 and a sweep job with max_concurrent_trials set to 10.
You must increase effectiveness of the tuning process by improving sampling convergence.
You need to select which sampling convergence to use.
What should you select?
A. Set the value of max_concurrent_trials to 20.
B. Set the value of slack_factor of early_termination policy to 0.1.
C. Set the value of slack_factor of early_termination policy to 0.9.
D. Set the value of max_concurrent_trials to 4.
Show Answer
Correct Answer: D
Explanation:
Bayesian sampling in Azure ML is sequential and relies on results from completed trials to guide the next hyperparameter choices. With a high max_concurrent_trials value, many runs start simultaneously and cannot benefit from each other’s outcomes, reducing convergence and behaving more like random sampling. Lowering max_concurrent_trials (for example, to 4) allows more completed results to inform subsequent trials, improving Bayesian sampling convergence. Changing the slack_factor affects early termination aggressiveness, not sampling convergence.
Question 315
You write a Python script that processes data in a comma-separated values (CSV) file.
You plan to run this script as an Azure Machine Learning experiment.
The script loads the data and determines the number of rows it contains using the following code:
You need to record the row count as a metric named row_count that can be returned using the get_metrics method of the Run object after the experiment run completes.
Which code should you use?
A. run.upload_file(T3 row_count', './data.csv')
B. run.log('row_count', rows)
C. run.tag('row_count', rows)
D. run.log_table('row_count', rows)
E. run.log_row('row_count', rows)
Show Answer
Correct Answer: B
Explanation:
To record a numeric metric that can be retrieved later with Run.get_metrics(), you must log it as a metric on the run. The run.log(name, value) method is designed for logging scalar metrics like a row count. Tags are metadata, uploads are files, and log_table/log_row are for tabular data, not a single numeric metric.
Question 73
HOTSPOT
-
You create multiple machine learning models by using automated machine learning.
You need to configure a primary metric for each use case.
Which metrics should you configure? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Show Answer
Correct Answer: Bug resolution time (regression): r2_score
Sentiment analysis (classification): accuracy
Explanation:
Regression tasks use r2_score to measure how well predicted values explain variance in a continuous target. Classification tasks like sentiment analysis commonly use accuracy to measure the proportion of correctly classified labels.
Question 420
DRAG DROP -
You create a training pipeline using the Azure Machine Learning designer. You upload a CSV file that contains the data from which you want to train your model.
You need to use the designer to create a pipeline that includes steps to perform the following tasks:
✑ Select the training features using the pandas filter method.
✑ Train a model based on the naive_bayes.GaussianNB algorithm.
✑ Return only the Scored Labels column by using the query
✑ SELECT [Scored Labels] FROM t1;
Which modules should you use? To answer, drag the appropriate modules to the appropriate locations. Each module name may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Show Answer
Correct Answer: Execute Python Script
Create Python Model
Apply SQL Transformation
Explanation:
Execute Python Script is used to apply the pandas filter method on the dataset. Create Python Model allows defining and training a custom model using naive_bayes.GaussianNB, which is not available as a built-in designer algorithm. Apply SQL Transformation is used after scoring to return only the Scored Labels column using the specified SELECT query.
Question 508
HOTSPOT -
You are creating a machine learning model in Python. The provided dataset contains several numerical columns and one text column. The text column represents a product's category. The product category will always be one of the following:
✑ Bikes
✑ Cars
✑ Vans
✑ Boats
You are building a regression model using the scikit-learn Python package.
You need to transform the text data to be compatible with the scikit-learn Python package.
How should you complete the code segment? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Show Answer
Correct Answer: pandas as df
map(ProductCategoryMapping)
Explanation:
pandas is required to load CSV data into a DataFrame. The pandas Series.map() method converts categorical text values into numeric values using the provided dictionary, making the column compatible with scikit-learn regression models.
Question 9
DRAG DROP
-
You are designing an Azure Machine Learning solution by using the Python SDK v2.
You must train and deploy the solution by using a compute target. The compute target must meet the following requirements:
• Enable the use of on-premises compute resources.
• Support autoscaling.
You need to configure a compute target for training and inference.
Which compute targets should you configure?
To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Show Answer
Correct Answer: Training: Azure Machine Learning Kubernetes
Inference: Azure Machine Learning Kubernetes
Explanation:
Azure Machine Learning Kubernetes (AKS or Arc-enabled Kubernetes) supports autoscaling and can be connected to on‑premises resources via Azure Arc. Local compute does not support autoscaling, and Apache Spark pools are cloud-only and not suitable for on‑premises inference.
Question 222
You create an Azure Machine Learning workspace. The workspace contains a dataset named sample_dataset, a compute instance, and a compute cluster.
You must create a two-stage pipeline that will prepare data in the dataset and then train and register a model based on the prepared data.
The first stage of the pipeline contains the following code:
You need to identify the location containing the output of the first stage of the script that you can use as input for the second stage.
Which storage location should you use?
A. workspaceblobstore datastore
B. workspacefilestore datastore
C. compute instance
D. compute_cluster
Show Answer
Correct Answer: A
Explanation:
In Azure Machine Learning pipelines, outputs from a pipeline step (for example, via OutputFileDatasetConfig) are written to the workspace’s default datastore unless otherwise specified. The default datastore is an Azure Blob Storage container named workspaceblobstore, which is designed to store experiment artifacts, intermediate data, and pipeline outputs. Compute instances or clusters are transient execution environments, and workspacefilestore is intended mainly for user files such as notebooks, not pipeline stage outputs.
Question 151
HOTSPOT
-
You plan to use a curated environment to run Azure Machine Learning training experiments in a workspace.
You need to display all curated environments and their respective packages in the workspace.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Show Answer
Correct Answer: AzureML
python
Explanation:
Curated Azure ML environments are identified by names starting with "AzureML". Package details are accessed through the environment’s Python section, where conda_dependencies can be serialized to list all installed packages.