Free Google Cloud Professional ML Engineer Practice Questions: Data and Model Collaboration

Practice 10 free Google Cloud Professional Machine Learning Engineer (Google Cloud Professional ML Engineer) questions on Data and Model Collaboration, with answers, explanations, and the IT Mastery next step.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Google Cloud Professional ML Engineer on Web

Topic snapshot

FieldDetail
Exam routeGoogle Cloud Professional ML Engineer
Topic areaCollaborating within and Across Teams to Manage Data and Models
Blueprint weight16%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Collaborating within and Across Teams to Manage Data and Models for Google Cloud Professional ML Engineer. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 16% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These are original IT Mastery practice questions aligned to this topic area. They are not official exam questions, copied live-exam content, or exam dumps. Use them for self-assessment, scope review, and deciding what to drill next.

Question 1

Topic: Collaborating within and Across Teams to Manage Data and Models

A financial services team is building a binary default-risk model that scores loan applications at submission time. The team plans to publish shared features in Agent Platform Feature Store. Requirements: features must be available at scoring time, avoid direct PII, and use geography-based protected-attribute proxies only after compliance approval.

FeatureDefinition
income_to_amountReported income / requested amount
avg_balance_90dMean balance before application time
same_zip_default_rate_30dDefault rate in applicant’s 5-digit ZIP after the 30-day repayment window
employment_monthsMonths at current employer

Which engineering decision is best before training?

Options:

  • A. Train with all features, then omit ZIP-derived features at serving

  • B. Hash ZIP codes before computing same_zip_default_rate_30d

  • C. Exclude same_zip_default_rate_30d and require approved point-in-time definitions

  • D. Publish all proposed features with restricted Feature Store access

Best answer: C

Explanation: Feature engineering must preserve point-in-time correctness and comply with privacy and responsible AI governance. same_zip_default_rate_30d is not available when the application is scored because it is computed after the repayment window, so it can leak label information into training. It also uses 5-digit ZIP-derived behavior, which can act as a protected-attribute proxy and requires compliance approval under the stated policy. Agent Platform Feature Store should contain reusable, governed feature definitions with clear lineage, owners, and access controls, but governance starts by rejecting invalid feature definitions rather than only restricting access. Hashing or serving-time omission does not fix a leaking or unapproved feature.

  • Access-only control limits who can read features, but it does not make a leaking or biased feature valid.
  • Hashing geography changes representation, but the aggregated default-rate signal still encodes ZIP-level outcomes and future labels.
  • Serving-time omission creates training-serving skew because the model learned from a feature that will not be available in production.

Question 2

Topic: Collaborating within and Across Teams to Manage Data and Models

A retail company has separate data science teams building churn and recommendation models. Customer events and profiles are stored in BigQuery and include email addresses, phone numbers, and support-ticket text. The teams need reusable, model-ready features for offline training and low-latency online serving, but privacy policy says teams should not access raw PII unless required. Which collaboration pattern is the BEST engineering decision?

Options:

  • A. Grant all teams read-only access to the raw BigQuery tables

  • B. Store all raw profile fields in the feature store for team filtering

  • C. Export daily feature CSVs with customer emails to shared Cloud Storage

  • D. Publish curated features in Agent Platform Feature Store with pseudonymous entity IDs

Best answer: D

Explanation: The right collaboration pattern is to separate raw sensitive data from shared, model-ready features. A central feature store lets teams reuse consistent transformations for offline training and online serving, while access controls and pseudonymous entity keys reduce unnecessary exposure to direct identifiers. Raw PII can remain in a restricted BigQuery project or dataset owned by the source team, and only approved derived features are published for consumers. This supports data minimization, consistency, lineage, and collaboration without requiring every model team to handle sensitive source data.

  • Raw table access fails because it exposes PII broadly and pushes each team to recreate feature logic.
  • CSV sharing weakens governance, lineage, and online serving consistency, especially when direct identifiers are included.
  • Raw fields in feature store misuses the feature store as a sensitive-data dump instead of publishing approved model-ready features.

Question 3

Topic: Collaborating within and Across Teams to Manage Data and Models

A retailer is preparing a tabular churn model. Data engineers build the BigQuery training dataset, ML engineers train with Agent Platform Pipelines, and an application team calls an online Agent Platform Inference endpoint. Prototypes used different defaults for missing loyalty_tier values and timestamp time zones. The first production release must avoid training-serving skew and provide a traceable handoff. What is the best engineering decision?

Options:

  • A. Apply all preprocessing only inside the application service.

  • B. Let each team document assumptions in separate notebooks.

  • C. Create a versioned preprocessing contract and shared transform components.

  • D. Wait for Model Monitoring to detect skew after launch.

Best answer: C

Explanation: Cross-team handoff readiness depends on making preprocessing assumptions explicit, versioned, and executable in both training and serving paths. For this case, the teams should agree on feature definitions, missing-value handling, time-zone normalization, data types, and ownership before training. Those assumptions should be captured in a shared contract and implemented as reusable pipeline or serving components so the BigQuery training data and online inference requests are transformed consistently. Lineage and versions should be recorded through the pipeline and model lifecycle so a future model version can be tied back to the exact preprocessing behavior used.

  • Separate notebooks are easy to diverge from and do not enforce consistent production behavior.
  • Application-only preprocessing creates a mismatch because training would not necessarily use the same transformations.
  • Post-launch monitoring can detect skew, but it does not prevent the first release from shipping with inconsistent assumptions.

Question 4

Topic: Collaborating within and Across Teams to Manage Data and Models

A retail bank is preparing shared training features in BigQuery for reuse in Agent Platform Feature Store. Source tables include email, phone, account ID, age, and a protected class attribute. Multiple ML teams need churn-model features, but only the compliance group may see direct identifiers or protected attributes. Which workflow best preserves privacy while enabling model development?

Options:

  • A. De-identify PII, tokenize entity IDs, and isolate protected attributes

  • B. Deploy the model only through a private inference endpoint

  • C. Publish raw features and rely on model artifacts to hide PII

  • D. Use project-level IAM on notebooks that access raw BigQuery tables

Best answer: A

Explanation: Privacy-preserving feature handling should happen before shared training data or feature values are published. Use Sensitive Data Protection or an equivalent controlled process to inspect and de-identify direct identifiers, and use deterministic tokenization or salted hashing when the same entity must be joined across records. Protected attributes should not be broadly published in Agent Platform Feature Store; keep them in a separate, restricted BigQuery dataset with least-privilege access for approved compliance or fairness evaluation workflows. Serving controls and model artifacts do not fix exposure in upstream datasets.

  • Raw feature sharing fails because identifiers can be exposed before training and reused by multiple teams.
  • Coarse notebook IAM fails because it controls access but does not de-identify or separate sensitive columns.
  • Private inference fails because endpoint isolation protects serving traffic, not training-data privacy.

Question 5

Topic: Collaborating within and Across Teams to Manage Data and Models

A data science team uses Agent Platform Workbench notebooks to prototype feature engineering. Several team members edit the same notebooks, and reviewers need a repeatable way to inspect changes before they are merged. The team wants to keep interactive notebooks but use its existing Git-based review process. Which workflow setup best meets these requirements?

Options:

  • A. Move notebook execution into Model Monitoring on Gemini Enterprise Agent Platform.

  • B. Upload completed notebooks to Cloud Storage with object versioning enabled.

  • C. Register each notebook as a versioned model in Agent Platform Model Registry.

  • D. Connect the notebook environment to the Git repository and use branches and pull requests.

Best answer: D

Explanation: Notebook collaboration should use source control at the development layer. Connecting Agent Platform Workbench to a Git repository lets team members clone the shared project, work on branches, commit notebook and supporting code changes, and use pull requests for review before merging. This improves repeatability because the notebook, helper scripts, dependency files, and configuration can be versioned together. It also improves collaboration because review and merge history are handled by the repository workflow the team already uses.

Storage versioning, model registry, and monitoring can support other lifecycle needs, but they do not replace source repository integration for notebook development and code review.

  • Object versioning preserves files, but it does not provide normal Git branching, diffs, pull requests, or code review workflow.
  • Model registry tracks trained model artifacts and versions, not collaborative notebook source changes.
  • Model monitoring observes deployed AI solutions; it is not the right layer for notebook authoring or review.

Question 6

Topic: Collaborating within and Across Teams to Manage Data and Models

A bank is preparing a tabular ML project to predict whether a small-business customer will miss a loan payment in the next 30 days. Source data is in BigQuery tables for customers, loans, transactions, and payments. Data scientists need repeatable experiments, training jobs must avoid label leakage, and the online service must reuse the same customer-level features. Which data organization should the team implement?

Options:

  • A. Train on transaction-level rows, using payment status as both a feature and the label.

  • B. Export joined CSV snapshots to Cloud Storage, then let each notebook define feature and label columns.

  • C. Curate a customer-time feature table in BigQuery, keep the 30-day label separate, and publish serving features to Agent Platform Feature Store.

  • D. Load raw source tables into Agent Platform Feature Store and compute labels during online serving.

Best answer: C

Explanation: For tabular ML, organize data around the prediction objective: one training example per entity and prediction time, with feature columns available at that time and a label derived from the future outcome window. BigQuery is a good place to create curated, versionable training tables or views for experimentation. Agent Platform Feature Store can then provide approved, reusable online features for serving. Keeping labels separate from features helps prevent leakage, and using the same feature definitions across training and serving reduces skew.

  • Notebook-defined columns make experiments hard to reproduce and can lead to inconsistent feature and label logic.
  • Transaction-level rows use the wrong grain because the business objective is customer-level prediction over a 30-day window.
  • Online label computation confuses serving with training data creation; labels come from observed outcomes, not live predictions.

Question 7

Topic: Collaborating within and Across Teams to Manage Data and Models

A team is comparing three prompt templates for a support-answer generator. Exact-match and ROUGE scores are stable, but they do not reflect reviewer feedback. Manual review cannot keep up. Reviewers have already defined a clear rubric for groundedness against retrieved passages, answer completeness, and tone compliance. What is the most appropriate next evaluation step to diagnose which template produces better outputs?

Options:

  • A. Use an LLM-as-a-judge with the rubric

  • B. Compare only embedding similarity to references

  • C. Rank templates only by ROUGE score

  • D. Start fine-tuning before comparing templates

Best answer: A

Explanation: LLM-as-a-judge is an evaluation approach for generative AI when the team can express quality expectations as review criteria. In this scenario, reviewers already defined criteria for groundedness, completeness, and tone, and simple text-overlap metrics are not matching human feedback. Using an LLM judge with that rubric helps scale evaluation across prompt versions and can produce comparable scores that are tracked with experiment metadata. Human review can still be used for calibration and spot checks, but it no longer has to be the only evaluation mechanism. The key distinction is that the problem is evaluation quality and scale, not model training or serving.

  • ROUGE-only ranking fails because overlap metrics are already shown to miss the qualities reviewers care about.
  • Immediate fine-tuning skips the diagnostic step of comparing prompt quality under defined criteria.
  • Embedding similarity alone may measure semantic closeness, but it does not directly assess groundedness, completeness, and tone compliance.

Question 8

Topic: Collaborating within and Across Teams to Manage Data and Models

A data science team is handing off a notebook-based churn model prototype to an MLOps team. The workflow must let both teams rerun the same experiments, review changes through source control, enforce least-privilege access to BigQuery and Cloud Storage, and compare metrics/artifacts across runs. What should you configure?

Options:

  • A. A shared Cloud Storage folder for notebooks, datasets, and result files

  • B. Agent Platform Workbench with Git, pinned dependencies, IAM, and Experiments tracking

  • C. A single notebook VM with local credentials shared by both teams

  • D. Only register the final model artifact in Agent Platform Model Registry

Best answer: B

Explanation: Cross-team notebook handoff needs controls at several lifecycle layers: reproducible execution environment, versioned source, governed access, and shared run history. Agent Platform Workbench supports managed notebook collaboration, and connecting notebooks to Git provides code review and history. Pinning dependencies with a requirements file or custom container helps another team rerun the same code. IAM should control access to BigQuery and Cloud Storage instead of sharing credentials. Experiments tracking and Agent Platform ML Metadata preserve metrics, parameters, artifacts, and lineage so teams can compare runs and continue work without losing context. A final model artifact alone is not enough for reproducible collaboration.

  • Shared storage only preserves files but does not provide code review, environment reproducibility, or experiment lineage.
  • Shared credentials violates least-privilege access and makes auditability weak.
  • Model Registry only supports model version handoff but misses notebook source, dependencies, and run context.

Question 9

Topic: Collaborating within and Across Teams to Manage Data and Models

A team is troubleshooting a slow feature-engineering prototype in Agent Platform Workbench. The notebook exports a 120 MB stratified training sample from BigQuery, then launches a Dataflow job for each candidate imputation and encoding change. The Dataflow jobs succeed, but startup and orchestration time make each iteration take 25-35 minutes. The team needs faster notebook experiments before productionizing the preprocessing logic. What is the best next diagnostic step?

Options:

  • A. Increase Dataflow worker count for each experiment

  • B. Profile pandas or scikit-learn preprocessing in the notebook

  • C. Move the prototype to Apache Spark on Dataproc

  • D. Rewrite all feature logic as BigQuery scheduled queries

Best answer: B

Explanation: For small-to-moderate data that fits comfortably in notebook memory, in-memory Python frameworks such as pandas and scikit-learn are often the fastest way to iterate on preprocessing ideas. The symptom is not a failed distributed job or insufficient compute; the Dataflow jobs succeed, but their startup and orchestration overhead dominates each experiment. A good diagnostic step is to run and profile the same transformations locally in the notebook, then later move stable logic to a scalable production tool if needed. Distributed systems such as Dataflow or Spark are better when data volume, streaming requirements, or production scale justify their overhead.

  • More workers may reduce processing time, but it does not address orchestration overhead on a 120 MB sample.
  • Spark migration adds another distributed runtime and is unlikely to improve rapid notebook iteration for this data size.
  • Scheduled SQL can be useful for repeatable BigQuery transformations, but it is not the fastest diagnostic path for exploratory Python-based feature experiments.

Question 10

Topic: Collaborating within and Across Teams to Manage Data and Models

A healthcare team is preparing shared features for a hospital readmission model. The features will be reused by training pipelines and online serving through Agent Platform Feature Store.

Candidate engineered features:

FeatureConcern
prior admissions at discharge timeValid history
billing status finalized after dischargePossible leakage
hashed email identifierPossible PII
full postal code and primary languagePossible bias proxy
race and ethnicityNeeded for fairness review

Which workflow setup best addresses privacy, leakage, bias, and governance for this use case?

Options:

  • A. Train with all candidate features, then remove sensitive fields only from online serving.

  • B. Publish only governed features with point-in-time generation, restricted sensitive attributes, and documented lineage.

  • C. Publish all engineered features after hashing identifiers and rely on model metrics to detect issues.

  • D. Store features in team-owned BigQuery tables and let each notebook remove sensitive columns.

Best answer: B

Explanation: Shared feature engineering should be governed before features are used for training or serving. Agent Platform Feature Store is appropriate for reusable features, but the workflow must include point-in-time correctness to avoid leakage from fields finalized after the prediction time. Direct identifiers, even hashed identifiers, still require privacy controls and should not become model signals. Sensitive attributes such as race and ethnicity can be retained in restricted datasets for fairness evaluation, while high-risk proxy features should be reviewed, transformed, or excluded based on the approved use case. Lineage and access controls help teams understand who used each feature and why.

The key takeaway is to control feature creation and access at the shared feature layer, not after models or notebooks have already consumed the data.

  • Hashing identifiers does not make all features safe; leakage and proxy bias can still enter the model.
  • Notebook cleanup is inconsistent governance because each team may apply different privacy and leakage rules.
  • Serving-only removal fails because training on sensitive or leaked features can already bias the learned model.

Continue in the web app

Use IT Mastery for interactive Google Cloud Professional ML Engineer practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Google Cloud Professional ML Engineer on Web