Free Sample — 15 Practice Questions
Preview 15 of 310 questions from the Professional Data Engineer exam.
Try before you buy — purchase the full study guide for all 310 questions with answers and explanations.
Question 192
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Dataproc and Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Dataproc
Show Answer
Correct Answer: B
Explanation:
The pipeline consists of multiple dependent steps across different services (Dataproc and Dataflow) and needs daily scheduling using managed services. Cloud Composer is a fully managed Apache Airflow service designed specifically to orchestrate complex, multi-step workflows with dependencies across GCP services and supports scheduling, retries, and monitoring. cron and Cloud Scheduler only trigger jobs, and Dataproc Workflow Templates are limited to Dataproc jobs only.
Question 220
You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?
A. BigQuery
B. Cloud Bigtable
C. Cloud Datastore
D. Cloud SQL for PostgreSQL
Show Answer
Correct Answer: A
Explanation:
BigQuery is designed for large-scale analytics and can easily handle 40 TB of data. It has native machine learning capabilities via BigQuery ML for building prediction models directly in SQL, and built-in geospatial support (GEOGRAPHY types, GeoJSON ingestion, spatial functions) for regional analysis. It also integrates naturally with BI dashboards, making it the best fit among the options.
Question 277
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?
A. Create a Google Cloud Dataflow job to process the data.
B. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
E. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
Show Answer
Correct Answer: D
Explanation:
They want to reuse existing Hadoop jobs, minimize cluster management, and persist data beyond the cluster’s lifetime. Cloud Dataproc is a fully managed Hadoop/Spark service that supports lift-and-shift of existing Hadoop jobs. Using the Google Cloud Storage connector decouples storage from compute, so data persists in GCS even after the Dataproc cluster is deleted, while avoiding the operational overhead of managing HDFS or raw Compute Engine clusters.
Question 36
You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data. For an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?
A. Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.
B. Set the topic retention policy to 30 days.
C. Set the subscriber retention policy to 30 days.
D. Ask the source system to re-push the data to Pub/Sub, and subscribe to it.
Show Answer
Correct Answer: B
Explanation:
In Pub/Sub, a new subscription normally only receives messages published after it is created. To allow new subscribers to read historical data, the messages must still be retained by the topic. Setting the **topic message retention** to 30 days ensures that all published messages are stored for up to 30 days, regardless of acknowledgment. New subscriptions can then seek back to the earliest retained messages and read the last 30 days of data. Subscription retention does not help new subscribers, and republishing data or re-pushing from the source is unnecessary.
Question 216
You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage?
A. Use Transfer Appliance to copy the data to Cloud Storage
B. Use gsutil cp ג€"J to compress the content being uploaded to Cloud Storage
C. Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
D. Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic
Show Answer
Correct Answer: A
Explanation:
Transferring 2 PB over a 20 Mb/sec link would take far longer than six months, making any network-based option impractical. Google Transfer Appliance is designed for petabyte-scale, offline data migrations when bandwidth is limited, allowing data to be copied locally and shipped to Google for ingestion into Cloud Storage.
Question 187
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
A. Threading
B. Serialization
C. Dropout Methods
D. Dimensionality Reduction
Show Answer
Correct Answer: C
Explanation:
The model fits the training data well but performs poorly on new data, which is a classic case of overfitting, especially with a large and deep neural network. Dropout is a regularization technique that randomly deactivates a subset of neurons during training, reducing model complexity and preventing co-adaptation of neurons. This improves generalization to unseen data. Threading only affects training speed, serialization is for saving models, and dimensionality reduction is not the primary remedy here given the context.
Question 85
You are using a Dataflow streaming job to read messages from a message bus that does not support exactly-once delivery. Your job then applies some transformations, and loads the result into BigQuery. You want to ensure that your data is being streamed into BigQuery with exactly-once delivery semantics. You expect your ingestion throughput into BigQuery to be about 1.5 GB per second. What should you do?
A. Use the BigQuery Storage Write API and ensure that your target BigQuery table is regional.
B. Use the BigQuery Storage Write API and ensure that your target BigQuery table is multiregional.
C. Use the BigQuery Streaming API and ensure that your target BigQuery table is regional.
D. Use the BigQuery Streaming API and ensure that your target BigQuery table is multiregional.
Show Answer
Correct Answer: B
Explanation:
To achieve exactly-once delivery semantics when ingesting from a source without exactly-once guarantees, you must use the BigQuery Storage Write API, which supports exactly-once writes via stream offsets. Given the expected ingestion rate of ~1.5 GB/s, a multiregional BigQuery table is required because the Storage Write API throughput limit is much higher for multiregional locations (up to ~3 GB/s) than for regional locations (~300 MB/s). The legacy BigQuery Streaming API does not provide exactly-once semantics.
Question 1
You need to load a dataset with multiple terabytes of clickstream data into BigQuery. The data arrives each day as compressed JSON files in a Cloud Storage bucket. You need a low-cost, programmatic, and scalable solution to load the data into BigQuery. What should you do?
A. Create an external table in BigQuery pointing to the Cloud Storage bucket and run the INSERT INTO ... SELECT * FROM external_table command.
B. Use the BigQuery Data Transfer Service from Cloud Storage.
C. Create a Cloud Run function to run a Python script to read and parse each JSON file, and use the BigQuery streaming insert API.
D. Use Cloud Data Fusion to create a pipeline to load the JSON files into BigQuery.
Show Answer
Correct Answer: B
Explanation:
The requirement is a low-cost, programmatic, and scalable way to load daily JSON files from Cloud Storage into BigQuery. The BigQuery Data Transfer Service for Cloud Storage is designed for this use case: it automatically loads files from a bucket into BigQuery using batch load jobs, which are free (you only pay for BigQuery storage and queries). Option A requires querying external tables and inserting data, which incurs query costs on multi-terabyte data. Option C uses streaming inserts, which are expensive and unnecessary for batch data. Option D (Cloud Data Fusion) adds operational overhead and cost, making it unsuitable for a low-cost solution.
Question 121
You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD. You want to query all of the tables for the past 30 days in legacy SQL. What should you do?
A. Use the TABLE_DATE_RANGE function
B. Use the WHERE_PARTITIONTIME pseudo column
C. Use WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD
D. Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD
Show Answer
Correct Answer: A
Explanation:
In legacy SQL, Firebase Analytics exports create separate daily tables with date suffixes. To query multiple date-suffixed tables over a range (such as the last 30 days), BigQuery legacy SQL requires a table wildcard function. TABLE_DATE_RANGE is specifically designed to scan multiple daily tables matching a prefix and date range. The other options apply to partitioned tables or standard SQL, not legacy SQL date-sharded tables.
Question 163
The Development and External teams have the project viewer Identity and Access Management (IAM) role in a folder named Visualization. You want the
Development Team to be able to read data from both Cloud Storage and BigQuery, but the External Team should only be able to read data from BigQuery. What should you do?
A. Remove Cloud Storage IAM permissions to the External Team on the acme-raw-data project.
B. Create Virtual Private Cloud (VPC) firewall rules on the acme-raw-data project that deny all ingress traffic from the External Team CIDR range.
C. Create a VPC Service Controls perimeter containing both projects and BigQuery as a restricted API. Add the External Team users to the perimeter's Access Level.
D. Create a VPC Service Controls perimeter containing both projects and Cloud Storage as a restricted API. Add the Development Team users to the perimeter's Access Level.
Show Answer
Correct Answer: D
Explanation:
The requirement is to allow both teams to read BigQuery data, but restrict Cloud Storage access so that only the Development Team can read it. IAM alone cannot selectively block Cloud Storage while preserving BigQuery access when both teams already have viewer-level access at a higher level. VPC firewall rules are irrelevant because this is not network traffic to user-managed VMs. VPC Service Controls are designed to restrict access to specific Google-managed services. By creating a service perimeter with Cloud Storage as the restricted API and adding only the Development Team to the access level, Cloud Storage becomes accessible exclusively to the Development Team, while BigQuery remains accessible to both teams. This exactly matches the requirement.
Question 181
You work for a financial institution that lets customers register online. As new customers register, their user data is sent to Pub/Sub before being ingested into
BigQuery. For security reasons, you decide to redact your customers' Government issued Identification Number while allowing customer service representatives to view the original values when necessary. What should you do?
A. Use BigQuery's built-in AEAD encryption to encrypt the SSN column. Save the keys to a new table that is only viewable by permissioned users.
B. Use BigQuery column-level security. Set the table permissions so that only members of the Customer Service user group can see the SSN column.
C. Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic hash.
D. Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic format-preserving encryption token.
Show Answer
Correct Answer: D
Explanation:
The requirement is to redact government-issued IDs while still allowing authorized staff to view the original values when necessary. Cloud DLP format‑preserving encryption (FPE) replaces sensitive values with reversible tokens before ingestion, so BigQuery never stores the raw IDs. Authorized users can decrypt when needed, satisfying both redaction and controlled re-identification. Column-level security alone does not redact the data, hashing is irreversible, and managing encryption directly in BigQuery is not the recommended approach for this use case.
Question 74
You are designing the architecture of your application to store data in Cloud Storage. Your application consists of pipelines that read data from a Cloud Storage bucket that contains raw data, and write the data to a second bucket after processing. You want to design an architecture with Cloud Storage resources that are capable of being resilient if a Google Cloud regional failure occurs. You want to minimize the recovery point objective (RPO) if a failure occurs, with no impact on applications that use the stored data. What should you do?
A. Adopt multi-regional Cloud Storage buckets in your architecture.
B. Adopt two regional Cloud Storage buckets, and update your application to write the output on both buckets.
C. Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture.
D. Adopt two regional Cloud Storage buckets, and create a daily task to copy from one bucket to the other.
Show Answer
Correct Answer: C
Explanation:
The requirement is resilience to a regional failure with the smallest possible recovery point objective (RPO) and no application impact. Dual‑region Cloud Storage buckets automatically replicate data across two specific regions, and enabling turbo replication guarantees that 100% of newly written objects are replicated to both regions within an RPO of about 15 minutes. This provides stronger, explicit RPO guarantees than default replication or multi‑regional buckets, while remaining transparent to applications. Other options either rely on manual copying, introduce higher RPO, or require application changes.
Question 29
You have a table that contains millions of rows of sales data, partitioned by date. Various applications and users query this data many times a minute. The query requires aggregating values by using AVG, MAX, and SUM, and does not require joining to other tables. The required aggregations are only computed over the past year of data, though you need to retain full historical data in the base tables. You want to ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. What should you do?
A. Create a materialized view to aggregate the base table data. Include a filter clause to specify the last one year of partitions.
B. Create a materialized view to aggregate the base table data. Configure a partition expiration on the base table to retain only the last one year of partitions.
C. Create a view to aggregate the base table data. Include a filter clause to specify the last year of partitions.
D. Create a new table that aggregates the base table data. Include a filter clause to specify the last year of partitions. Set up a scheduled query to recreate the new table every hour.
Show Answer
Correct Answer: A
Explanation:
A materialized view is designed for repeatedly executed aggregation queries and significantly reduces computation cost and query duration by precomputing AVG, MAX, and SUM. By filtering the materialized view to only the last year of partitions, you limit the amount of data that needs to be maintained and refreshed while still retaining full historical data in the base table. BigQuery materialized views automatically stay up to date and will read from base tables as needed to ensure fresh results, meeting the requirement to always include the latest data with minimal maintenance overhead.
Question 268
You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of your Cloud Bigtable cluster. Which two actions can you take to accomplish this? (Choose two.)
A. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.
B. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.
C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.
E. Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.
Show Answer
Correct Answer: C, D
Explanation:
To decide when to scale a Cloud Bigtable cluster for a write-heavy pipeline, you should monitor signals that directly reflect write capacity and storage limits.
C is correct because sustained increases in write latency indicate that the cluster is reaching its throughput limits. Adding nodes increases write throughput and reduces latency.
D is correct because Bigtable storage capacity is tied to the number of nodes. Google recommends keeping storage utilization below ~70% of the maximum to avoid write failures and to leave headroom for growth; exceeding this threshold is a clear signal to add nodes.
Key Visualizer pressure metrics (A and B) are primarily for diagnosing hot-spotting and uneven key distribution. Since the question states that row keys are well-designed, these metrics are not the primary trigger for scaling. Monitoring read latency (E) is not relevant because the pipeline is focused on writes.
Question 106
You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbcIO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL. instance is running in Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?
A. Set up VPC Network Peering between Project A and Project B. Add a firewall rule to allow the peered subnet range to access all instances on the network.
B. Turn off the external IP addresses on the Dataflow worker. Enable Cloud NAT in Project A.
C. Add the external IP addresses of the Dataflow worker as authorized networks in the Cloud SQL instance.
D. Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database.
Show Answer
Correct Answer: A
Explanation:
Cloud SQL uses a private IP and resides in Project B, while Dataflow workers run in Project A. To keep traffic off the public internet, the two VPC networks must have private connectivity. VPC Network Peering enables direct private IP communication between the projects, allowing Dataflow workers to reach the Cloud SQL private IP without a proxy or public IPs. A proxy VM (option D) is unnecessary overhead, Cloud NAT does not help for inbound connections to Cloud SQL, and authorizing external IPs would require public internet access.