Free Sample — 15 Practice Questions
Preview 15 of 44 questions from the Machine Learning Associate exam.
Try before you buy — purchase the full study guide for all 44 questions with answers and explanations.
Question 13
A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
A. They can refactor their notebook to process the data in parallel.
B. They can refactor their notebook to use the PySpark DataFrame API.
C. They can refactor their notebook to use the Scala Dataset API.
D. They can refactor their notebook to use Spark SQL.
E. They can refactor their notebook to utilize the pandas API on Spark.
Show Answer
Correct Answer: E
Explanation:
The pandas API on Spark is specifically designed to let users run existing pandas-style code on top of Spark with minimal changes. This allows the notebook to scale to big data while requiring the least refactoring effort compared to rewriting logic in PySpark, Scala, or Spark SQL, or redesigning for parallelism manually.
Question 1
A data scientist is developing a machine learning model to predict house prices in a competitive real estate market. They initially select a loss function that heavily penalizes large errors, hoping it will improve the model’s performance. However, after training, they observe that the model struggles to converge and produces unstable predictions, with large variations in price predictions for similar houses. The data scientist suspects that the chosen loss function is causing these issues.
Why is it crucial to select the right loss function in this situation?
A. The loss faction ensures that the data is balanced.
B. The loss function directly influences how the model's parameters are updated during training.
C. The loss function controls the size of the training set.
D. The loss function determines the computational efficiency of the model.
Show Answer
Correct Answer: B
Explanation:
The loss function defines the objective that training optimizes. Its gradients determine how model parameters are updated. If the loss over-penalizes large errors, it can lead to unstable gradients, poor convergence, and highly variable predictions, making the correct choice of loss function crucial for stable and effective learning.
Question 34
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?
A. One-hot encoding categorical features
B. Target encoding categorical features
C. Imputing missing feature values with the mean
D. Imputing missing feature values with the true median
E. Creating binary indicator features for missing values
Show Answer
Correct Answer: D
Explanation:
Computing the true median in a distributed system is inefficient because it generally requires a global sort or order-statistics operation, which involves significant data shuffling and coordination across nodes. In contrast, mean-based operations (mean imputation, target encoding) and feature transformations like one-hot encoding or missing-value indicators can be parallelized using local aggregates and reduce steps, making them more scalable.
Question 33
A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML experiment?
A. Model tuning
B. Model evaluation
C. Model deployment
D. Exploratory data analysis
Show Answer
Correct Answer: D
Explanation:
Databricks AutoML automates model training, hyperparameter tuning, and evaluation within the experiment. However, understanding the dataset—its structure, quality issues, distributions, and potential feature relationships—requires exploratory data analysis, which must be performed by the data scientist outside of the AutoML experiment. Model deployment is also external to training, but the key preparatory step not handled by AutoML is exploratory data analysis.
Question 26
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:
Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?
A. Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values
B. Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values
C. Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data
D. Utilize the Pipeline API to standardize the training data according to the test data's summary statistics
E. Utilize the Pipeline API to standardize the test data according to the training data's summary statistics
Show Answer
Correct Answer: E
Explanation:
Standardizing features before splitting causes data leakage because test-set information influences the scaling parameters. The correct approach is to fit the scaler on the training data only and then apply the learned summary statistics to the test data. Using Spark’s Pipeline API ensures the scaler is fit on the training set and consistently applied to the test set without leakage.
Question 14
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata
B. pandas API on Spark DataFrames are more performant than Spark DataFrames
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
E. pandas API on Spark DataFrames are unrelated to Spark DataFrames
Show Answer
Correct Answer: C
Explanation:
The pandas API on Spark (formerly pandas-on-Spark) is implemented on top of native Spark DataFrames. It wraps a Spark DataFrame and maintains additional metadata to provide a pandas-like interface and semantics while executing operations in a distributed manner. It is not single-node, not inherently more performant, and not unrelated to Spark DataFrames.
Question 18
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
A. Leave-one-out encoding
B. Target encoding
C. One-hot encoding
D. Categorical embeddings
E. String indexing
Show Answer
Correct Answer: C
Explanation:
One-hot encoding converts each category of a categorical feature into separate binary (0/1) indicator variables, which directly matches the description in the question. The other options encode categories using target statistics, embeddings, indices, or leave-one-out schemes rather than binary indicators.
Question 12
A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.
They have developed the following code block to accomplish this task:
The code block is not accomplishing the task.
Which reasons describes why the code block is not accomplishing the imputation task?
A. It does not impute both the training and test data sets.
B. The inputCols and outputCols need to be exactly the same.
C. The fit method needs to be called instead of transform.
D. It does not fit the imputer on the data to create an ImputerModel.
Show Answer
Correct Answer: D
Explanation:
In Spark ML, Imputer is an Estimator and cannot directly transform data. It must first be fit on the DataFrame to compute the median values, producing an ImputerModel, and only that model can perform the transformation. The issue is that the code does not fit the imputer to create an ImputerModel.
Question 24
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10]
Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?
Show Answer
Correct Answer: D
Explanation:
There are 3 × 2 = 6 hyperparameter combinations in the grid. With 3-fold cross-validation, each combination requires training 3 separate models (one per fold), giving 6 × 3 = 18 total model training runs. These training runs are the units that can be executed in parallel, so the correct answer is 18.
Question 35
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
A. One-hot encoding is not supported by most machine learning libraries.
B. One-hot encoding is dependent on the target variable’s values which differ for each application.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
Show Answer
Correct Answer: E
Explanation:
A feature repository is meant to provide reusable, model-agnostic features. One-hot encoding bakes in a specific representation that can be suboptimal or even problematic for certain algorithms (e.g., tree-based models, models handling high-cardinality categories, or systems expecting raw categorical inputs). Therefore, performing one-hot encoding centrally in the repository reduces flexibility and can negatively affect downstream models, justifying why it should be avoided there.
Question 23
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?
A. spark_df.summary ()
B. spark_df.stats()
C. spark_df.describe().head()
D. spark_df.printSchema()
E. spark_df.toPandas()
Show Answer
Correct Answer: A
Explanation:
spark_df.summary() returns summary statistics for numeric columns including count, mean, standard deviation, min, max, and the 25%, 50%, and 75% percentiles. The interquartile range (IQR) can be obtained from the 75% and 25% values. The other options do not provide quartiles or are unrelated.
Question 11
A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFrame features_df. A list of the names of the string columns is assigned to the input_columns variable.
They have developed this code block to accomplish this task:
The code block is returning an error.
Which of the following adjustments does the data scientist need to make to accomplish this task?
A. They need to specify the method parameter to the OneHotEncoder.
B. They need to remove the line with the fit operation.
C. They need to use StringIndexer prior to one-hot encoding the features.
D. They need to use VectorAssembler prior to one-hot encoding the features.
Show Answer
Correct Answer: C
Explanation:
In Spark ML, OneHotEncoder does not accept string columns directly; it requires numeric category indices as input. StringIndexer must first be applied to each categorical string column to convert them into indexed numeric columns, which can then be one-hot encoded. The error arises because this prerequisite step is missing.
Question 19
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:
A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?
A. The model will take longer to train for each unique combination of hyperparameter values
B. The feature engineering stages will be computed using validation data
C. The cross-validation process will no longer be parallelizable
D. The cross-validation process will no longer be reproducible
E. The model will be refit one more per cross-validation fold
Show Answer
Correct Answer: B
Explanation:
Placing the cross-validation object as the final stage of the pipeline causes the feature engineering steps to be fitted outside the CV splits. As a result, those transformations are learned using the full dataset, including validation folds, leading to data leakage where feature engineering is computed using validation data.
Question 10
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema: prediction DOUBLE actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?
Show Answer
Correct Answer: B
Explanation:
The RMSE for a Spark ML regression model is computed using RegressionEvaluator with metricName set to "rmse", and with the prediction and label columns matching the DataFrame schema (prediction and actual). Option B correctly instantiates a RegressionEvaluator, sets the appropriate columns and metric, and calls evaluate(preds_df) to assign the result to rmse. Other options either use an incorrect evaluator (e.g., classification), wrong metric, or incorrect computation.
Question 7
A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.
Which of the following approaches will guarantee a reproducible training and test set for each model?
A. Manually configure the cluster
B. Write out the split data sets to persistent storage
C. Set a speed in the data splitting operation
D. Manually partition the input data
Show Answer
Correct Answer: B
Explanation:
In Spark, operations like randomSplit depend on data partitioning and task execution order. Changing the number of workers can alter partitioning, leading to different train/test splits even with a fixed seed. Writing the split training and test datasets to persistent storage ensures the exact same rows are used in every run, guaranteeing reproducibility regardless of cluster configuration.