Exam identity
| Item | Detail |
|---|
| Vendor/provider | Microsoft |
| Official title | Microsoft Certified: Machine Learning Operations Engineer Associate (AI-300) |
| Exam code | AI-300 |
| Page purpose | Independent Quick Reference for real-exam preparation and original practice support |
Use this as a compact decision guide for Azure Machine Learning operations: building repeatable training workflows, packaging assets, deploying models, monitoring production behavior, and securing MLOps automation.
High-yield MLOps mental model
flowchart LR
A[Source control<br/>code, YAML, tests] --> B[CI validation<br/>lint, unit tests, schema checks]
B --> C[Training pipeline<br/>data, compute, components]
C --> D[Evaluate<br/>metrics, bias, quality gates]
D -->|passes| E[Register asset<br/>model, env, component]
D -->|fails| C
E --> F[Deploy<br/>online or batch endpoint]
F --> G[Monitor<br/>logs, metrics, drift, quality]
G --> H[Trigger retraining<br/>manual or automated]
H --> C
| Exam decision point | Fast rule |
|---|
| Training needs repeatability | Use Azure Machine Learning jobs, components, environments, data assets, and pipelines rather than notebook-only work. |
| Promotion across workspaces | Use versioned assets and registries; avoid “copy files by hand” patterns. |
| Low-latency scoring | Use an online endpoint. |
| Large offline scoring | Use a batch endpoint. |
| Secret handling | Use managed identities, Key Vault-backed secrets, or workspace connections; do not hard-code secrets. |
| Production change | Use staged deployment, traffic control, tests, and rollback. |
| Monitoring asks “why did it fail?” | Check job/endpoint logs, environment build, identity permissions, data paths, and scoring code. |
Azure Machine Learning object map
| Object | What it represents | Exam-relevant use |
|---|
| Workspace | Top-level Azure Machine Learning boundary for assets, jobs, compute, endpoints, and collaborators. | Central control plane for MLOps. |
| Datastore | Reference to storage such as Azure Blob Storage or Azure Data Lake Storage. | Connects workspace to data without embedding storage credentials in code. |
| Data asset | Versioned reference to data used by jobs and pipelines. | Reproducibility, lineage, input binding. |
| Environment | Runtime definition: base image, conda/pip dependencies, Docker context, or curated environment. | Ensures train/deploy consistency. |
| Compute instance | Managed development workstation. | Interactive authoring and debugging; not ideal as production training compute. |
| Compute cluster | Scalable managed compute for jobs. | Training, batch jobs, parallel workloads. |
| Job | Execution unit such as command, sweep, AutoML, or pipeline job. | Repeatable training and evaluation. |
| Component | Reusable pipeline step with inputs, outputs, code, environment, and command. | Modular pipeline design and reuse. |
| Pipeline | Directed workflow of components/jobs. | End-to-end MLOps orchestration. |
| Model asset | Registered model artifact, often MLflow or custom. | Versioned deployment candidate. |
| Registry | Cross-workspace sharing and promotion of models, components, and environments. | Dev/test/prod separation and enterprise reuse. |
| Online endpoint | HTTPS scoring endpoint with one or more deployments. | Real-time inference. |
| Batch endpoint | Endpoint for asynchronous batch inference over large input datasets. | Scheduled or offline scoring. |
Service and feature selection matrix
Compute choices
| Need | Choose | Avoid choosing when |
|---|
| Interactive notebooks, debugging, small experiments | Compute instance | You need scalable, repeatable production training. |
| Scalable training jobs | Compute cluster | You need always-on interactive development. |
| Pipeline execution with managed scaling | Azure Machine Learning managed compute options | You require a custom Kubernetes platform. |
| Existing Kubernetes operations model | Attached Kubernetes / Kubernetes online deployment pattern | You want Microsoft-managed endpoint infrastructure. |
| Distributed data processing | Spark integration where appropriate | The task is simple model training and does not need Spark. |
| Local smoke test | Local execution or small dev compute | It must represent production security, networking, or scale behavior. |
Job and workflow choices
| Scenario | Best fit | Key exam clue |
|---|
| Run a script with parameters | Command job | “Train this script with inputs and outputs.” |
| Compare hyperparameters | Sweep job | “Find best hyperparameters.” |
| Build reusable multi-step workflow | Pipeline job | “Preprocess, train, evaluate, register.” |
| Automate model search | AutoML job | “Try algorithms/features automatically.” |
| Score many files/rows offline | Batch endpoint/job | “No real-time response required.” |
| Trigger workflow from Git commit | CI/CD pipeline invoking Azure ML CLI/SDK | “Source-controlled MLOps.” |
Deployment choices
| Requirement | Choose | Why |
|---|
| Real-time HTTPS inference | Managed online endpoint | Managed production endpoint with deployments and traffic control. |
| Real-time inference on organization-managed Kubernetes | Kubernetes online endpoint/deployment pattern | Use existing Kubernetes governance and runtime. |
| Offline scoring of large datasets | Batch endpoint | Asynchronous, file/data oriented scoring. |
| Blue/green or canary release | Multiple deployments under one online endpoint | Split or shift traffic between model versions. |
| Fast rollback | Keep previous deployment available and shift traffic back | Rollback should not require rebuilding from scratch. |
| Custom request handling | Custom scoring script | Needed for non-MLflow or custom preprocessing logic. |
| Standard MLflow model serving | MLflow model deployment path | Reduces custom serving code when compatible. |
Asset versioning and lineage
| Asset | Versioning guidance | Common trap |
|---|
| Code | Keep in Git with tests and review gates. | Editing production code directly in a notebook or portal. |
| Data | Use versioned data assets or immutable paths for training inputs. | Training on “latest” data without recording the exact input. |
| Environment | Pin dependencies and version environments. | Using unpinned packages that change between train and deploy. |
| Model | Register only evaluated candidates with metadata and metrics. | Deploying an unregistered artifact with no lineage. |
| Component | Version reusable pipeline steps. | Breaking old pipelines by mutating component behavior. |
| Pipeline YAML | Store with code and parameterize environment-specific values. | Manually recreating pipelines in each workspace. |
Azure ML CLI v2 patterns
Use YAML definitions for repeatability. The exact schema depends on the asset type, but the exam often tests whether you understand what belongs in code, YAML, identities, and CI/CD.
Command job pattern
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
type: command
display_name: train-classifier
experiment_name: churn-training
code: ./src
command: >-
python train.py
--training_data ${{inputs.training_data}}
--max_epochs ${{inputs.max_epochs}}
inputs:
training_data:
type: uri_folder
path: azureml:churn-data:1
max_epochs: 10
environment: azureml:sklearn-train-env:1
compute: azureml:cpu-cluster
outputs:
model_output:
type: uri_folder
az ml job create --file train-job.yml --resource-group <rg> --workspace-name <workspace>
Component pattern
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: train_component
version: 1
display_name: Train model
inputs:
training_data:
type: uri_folder
learning_rate:
type: number
outputs:
model_output:
type: uri_folder
code: ./src
environment: azureml:sklearn-train-env:1
command: >-
python train.py
--training_data ${{inputs.training_data}}
--learning_rate ${{inputs.learning_rate}}
--model_output ${{outputs.model_output}}
Online endpoint and deployment pattern
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: churn-endpoint
auth_mode: key
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: churn-endpoint
model: azureml:churn-model:3
environment: azureml:sklearn-infer-env:2
code_configuration:
code: ./score
scoring_script: score.py
instance_type: <vm_size>
instance_count: <count>
az ml online-endpoint create -f endpoint.yml
az ml online-deployment create -f blue-deployment.yml --all-traffic
az ml online-deployment get-logs \
--endpoint-name churn-endpoint \
--name blue \
--resource-group <rg> \
--workspace-name <workspace>
Traffic shift pattern
az ml online-endpoint update \
--name churn-endpoint \
--traffic blue=90 green=10 \
--resource-group <rg> \
--workspace-name <workspace>
Use this for canary-style validation. For rollback, shift traffic back to the previous known-good deployment.
MLflow quick reference
| Task | MLflow use |
|---|
| Track parameters | mlflow.log_param() |
| Track metrics | mlflow.log_metric() |
| Track artifacts | mlflow.log_artifact() or mlflow.log_artifacts() |
| Package model | Flavor-specific logging such as mlflow.sklearn.log_model() |
| Register model | Register from run artifact or use Azure ML model registration flow. |
| Reduce custom serving code | Prefer MLflow model format when the framework and inference contract fit. |
import mlflow
import mlflow.sklearn
from sklearn.metrics import accuracy_score
with mlflow.start_run():
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
mlflow.log_param("model_type", "random_forest")
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, artifact_path="model")
MLflow vs custom model
| Requirement | Prefer MLflow model | Prefer custom model |
|---|
| Standard framework model packaging | Yes | Maybe |
| Minimal scoring boilerplate | Yes | No |
| Custom preprocessing at inference | Maybe | Yes |
| Nonstandard request/response handling | No | Yes |
Full control of init() and run() logic | No | Yes |
Scoring script essentials
For custom online inference, the scoring script commonly exposes initialization and request handling.
import json
import joblib
import numpy as np
import os
def init():
global model
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model.pkl")
model = joblib.load(model_path)
def run(raw_data):
payload = json.loads(raw_data)
data = np.array(payload["data"])
predictions = model.predict(data)
return {"predictions": predictions.tolist()}
| Function | Purpose | Common issue |
|---|
init() | Load model and shared resources once at startup. | Model path or dependency failure causes deployment startup errors. |
run() | Handle each request and return serializable output. | Input schema mismatch or non-JSON-serializable result. |
| Environment | Supplies runtime packages. | Missing library works locally but fails in endpoint container. |
| Logs | Diagnose startup and request failures. | Not checking endpoint deployment logs before changing infrastructure. |
Pipeline design reference
| Design concern | Recommended pattern | Exam trap |
|---|
| Reuse | Build components for preprocess, train, evaluate, register. | One huge script with no reusable boundaries. |
| Inputs/outputs | Declare typed inputs and outputs. | Hidden file paths inside scripts. |
| Parameters | Pass as component or pipeline parameters. | Hard-coded values that differ by environment. |
| Reproducibility | Version code, data, environment, and model. | Reruns produce untraceable differences. |
| Quality gates | Evaluate metrics before registering or deploying. | Register every run as production-ready. |
| Promotion | Promote versioned assets through environments. | Retrain separately in each environment unless required. |
| Failure handling | Make steps idempotent where possible. | Partial outputs corrupt later steps. |
Quality gate examples
| Gate | Example check |
|---|
| Data validation | Required columns exist; schema and ranges are acceptable. |
| Unit tests | Feature functions and scoring code pass tests. |
| Training metrics | Candidate beats baseline or minimum threshold. |
| Responsible AI review | Error patterns, explainability, or fairness checks reviewed where required. |
| Security check | No secrets in code, images, logs, or YAML. |
| Deployment smoke test | Endpoint responds with expected schema. |
| Production approval | Human or policy approval before full traffic shift. |
Model quality metrics for gates
Use the metric aligned to business risk. Do not optimize accuracy alone when false positives and false negatives have different costs.
\[
\text{Precision} = \frac{TP}{TP + FP}
\]\[
\text{Recall} = \frac{TP}{TP + FN}
\]\[
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\]\[
RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}
\]
| Metric | Use when | Watch for |
|---|
| Accuracy | Classes are balanced and error costs are similar. | Misleading with imbalanced data. |
| Precision | False positives are costly. | May miss many actual positives. |
| Recall | False negatives are costly. | May increase false positives. |
| F1 | Need balance between precision and recall. | Hides business-specific cost differences. |
| ROC AUC | Need ranking/separation quality across thresholds. | Does not choose an operating threshold. |
| RMSE | Regression with larger errors needing stronger penalty. | Sensitive to outliers. |
| MAE | Regression with interpretable average absolute error. | Does not penalize large errors as strongly as RMSE. |
Security, identity, and networking
Identity choices
| Identity/control | Use for | Exam cue |
|---|
| User identity | Interactive development and investigation. | “Data scientist runs notebook.” |
| Service principal | Automation from CI/CD where managed identity is not available. | “Pipeline outside Azure needs to submit jobs.” |
| System-assigned managed identity | Azure resource needs an automatically managed identity. | “No credential rotation; tied to resource lifecycle.” |
| User-assigned managed identity | Shared identity across resources or stable identity lifecycle. | “Reuse same identity across deployments.” |
| Azure RBAC | Grant access to Azure resources such as storage, Key Vault, workspace. | “Least privilege access.” |
| Key Vault | Store secrets that cannot be replaced with identity-based access. | “Do not put secret in code/YAML.” |
Security decision table
| Requirement | Pattern |
|---|
| CI/CD submits Azure ML jobs | Use federated identity or service principal/managed identity with least-privilege RBAC. |
| Job reads private storage | Grant the job identity appropriate storage permissions; reference data through datastore/data asset. |
| Endpoint accesses downstream service | Use managed identity and grant only required permissions. |
| Protect secrets | Use Key Vault or secure workspace connection; never log values. |
| Restrict public access | Use private endpoints and network controls where required. |
| Control outbound traffic | Use managed network/VNet patterns and explicit outbound rules where applicable. |
| Separate dev/test/prod | Use separate workspaces, resource groups, subscriptions, or registries according to governance. |
| Audit activity | Use Azure activity logs, Azure ML job history, and monitoring logs. |
Common security traps
| Trap | Correct exam response |
|---|
| Hard-code storage keys in training script | Use identity-based access or Key Vault-backed secret. |
| Give CI/CD Owner permissions broadly | Use least privilege. |
| Use personal account for production automation | Use managed identity or service principal. |
| Open endpoint publicly when private access is required | Use private networking design. |
| Assume workspace access grants storage access | Grant required permissions on the backing data resource too. |
| Put secrets in Docker image or environment YAML | Inject securely at runtime or use managed identity. |
CI/CD for MLOps
Typical CI/CD stages
| Stage | Activities | Output |
|---|
| Validate | Lint code/YAML, run unit tests, validate schemas. | Build is accepted or rejected. |
| Build/package | Build environment or image, package components. | Versioned runnable assets. |
| Train | Submit Azure ML pipeline/job. | Run history, metrics, candidate model. |
| Evaluate | Compare metrics to thresholds and baseline. | Pass/fail deployment gate. |
| Register | Register model with metadata if gate passes. | Versioned model asset. |
| Deploy to nonprod | Create/update endpoint deployment. | Testable endpoint. |
| Smoke/integration test | Send sample requests, verify schema and latency behavior. | Promotion decision. |
| Promote to prod | Shift traffic or deploy approved model. | Production endpoint update. |
| Monitor | Collect logs, metrics, model quality signals. | Retraining or rollback signal. |
GitHub Actions sketch
name: mlops-ci
on:
push:
branches: [ main ]
jobs:
validate-and-train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Azure login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Install Azure ML extension
run: az extension add -n ml -y
- name: Submit training pipeline
run: |
az ml job create \
--file pipelines/train.yml \
--resource-group $RG \
--workspace-name $WORKSPACE
| Exam point | What to remember |
|---|
| Store YAML in repo | Infrastructure and ML workflows should be reviewable and repeatable. |
| Do not store secrets in repo | Use secure pipeline secrets or federated identity. |
| Use approval gates | Especially before production traffic shift. |
| Keep environments versioned | CI/CD should not depend on mutable runtime state. |
| Capture run outputs | Metrics and model artifacts drive promotion decisions. |
| Need | Workspace asset | Registry |
|---|
| Use asset in one workspace | Yes | Optional |
| Share model across workspaces | Limited | Yes |
| Promote from dev to prod | Possible manually | Better fit |
| Reuse components enterprise-wide | Limited | Yes |
| Keep approved environment versions | Yes | Yes |
| Avoid retraining just to move environments | Harder | Easier with promoted assets |
Promotion pattern:
- Train and evaluate in development workspace.
- Register candidate model with metrics and metadata.
- Promote approved model/environment/component to registry.
- Deploy from registry into test or production workspace.
- Monitor production and create retraining trigger when needed.
Online endpoint operations
| Task | Command/action pattern | Notes |
|---|
| Create endpoint | az ml online-endpoint create | Endpoint is the stable scoring URL. |
| Create deployment | az ml online-deployment create | Deployment holds model, code, environment, compute config. |
| Allocate traffic | Endpoint traffic settings | Enables blue/green and canary. |
| View logs | az ml online-deployment get-logs | First stop for startup and scoring failures. |
| Test request | Invoke endpoint with sample payload | Confirms schema and runtime behavior. |
| Roll back | Shift traffic to previous deployment | Faster than rebuilding old model. |
| Remove unused deployment | Delete after validation period | Avoid confusion and unnecessary resource use. |
Endpoint troubleshooting
| Symptom | Likely area | Check |
|---|
| Deployment fails to provision | Environment/image/compute | Build logs, dependency versions, base image, quota/capacity constraints without assuming exact limits. |
| Container starts then crashes | init() failure | Model path, missing files, package imports. |
| 4xx response | Request/auth/schema | Endpoint keys/tokens, input JSON, content type, route. |
| 5xx response | Scoring code/runtime | run() exceptions, memory, dependency mismatch. |
| Slow responses | Model/runtime/compute | Model size, preprocessing, instance type, autoscale configuration. |
| Works locally, fails in Azure | Environment or identity | Pin packages; check managed identity and resource access. |
| No access to data/service | RBAC/networking | Storage permissions, private endpoints, outbound restrictions. |
Batch endpoint operations
| Batch need | Design choice |
|---|
| Score files or large tabular data | Use batch endpoint with data input. |
| Schedule recurring scoring | Trigger from orchestration or CI/CD scheduler. |
| Parallelize work | Configure batch deployment/job parallelism options appropriate to workload. |
| Store outputs | Write predictions to configured output location. |
| Troubleshoot failed records | Inspect batch job logs and per-task errors. |
| Online endpoint | Batch endpoint |
|---|
| Synchronous request/response | Asynchronous job-style scoring |
| Low latency | Throughput over latency |
| Small payloads per request | Large datasets or many files |
| User/application-facing API | Back-office scoring pipelines |
| Traffic splitting between deployments | Deployment chosen for batch invocation |
Monitoring and observability
| Signal | Why it matters | Where to look conceptually |
|---|
| Job status and duration | Detect failed or slow training pipelines. | Azure ML job history and logs. |
| Training metrics | Determine whether a candidate should be registered. | MLflow/run metrics. |
| Endpoint request count | Understand usage and capacity. | Endpoint/Azure Monitor metrics. |
| Endpoint latency | Detect performance regression. | Endpoint metrics and logs. |
| Endpoint failures | Detect runtime or caller issues. | Deployment logs, application logs. |
| Resource utilization | Right-size compute and troubleshoot bottlenecks. | Azure Monitor metrics. |
| Data distribution | Detect potential feature/data drift. | Model monitoring or custom validation. |
| Prediction quality | Confirm model still performs after labels arrive. | Offline evaluation against ground truth. |
| Responsible AI findings | Identify fairness, explainability, or error-pattern concerns. | Responsible AI artifacts/dashboards where used. |
Monitoring decision cues
| If the question says… | Prefer… |
|---|
| “Endpoint returns errors after deployment” | Deployment logs and scoring script diagnostics. |
| “Need to detect degraded model quality when labels become available” | Compare predictions to ground truth and trigger retraining. |
| “Need operational metrics and alerts” | Azure Monitor-style metrics/alerts for endpoints and resources. |
| “Need experiment metrics and model lineage” | Azure ML run history and MLflow tracking. |
| “Need to understand model behavior across cohorts” | Error analysis, explainability, or responsible AI review artifacts. |
| “Need automatic retraining when data changes” | Scheduled/event-triggered pipeline with validation and approval gates. |
Responsible AI in MLOps
| Concern | Operational control |
|---|
| Explainability | Capture explanations or feature importance where appropriate. |
| Fairness | Evaluate performance across relevant cohorts. |
| Error analysis | Identify segments where the model fails disproportionately. |
| Transparency | Keep model metadata, intended use, limitations, and evaluation results. |
| Human review | Add approval gates for high-impact changes. |
| Monitoring | Recheck quality and cohort behavior after deployment. |
Common exam distinction: responsible AI is not only a training-time concern. Operational workflows should preserve evidence, review results, and monitor production behavior.
Data management decisions
| Need | Use | Avoid |
|---|
| Reference existing cloud data | Datastore plus data asset | Copying data into arbitrary local folders. |
| Reproducible training | Versioned data asset or immutable path | “Latest” path with no version record. |
| Pipeline input binding | Declared input in YAML/component | Hidden path inside script. |
| Large file/folder input | URI folder/file style assets | Embedding large data in repo. |
| Tabular schema-aware input | MLTable-style data asset where appropriate | Manually parsing inconsistent files repeatedly. |
| Secure access | Managed identity/RBAC | Storage keys in scripts. |
Environment and dependency decisions
| Requirement | Recommended pattern |
|---|
| Fast start with common frameworks | Curated environment if it fits. |
| Custom packages | Custom environment with conda/pip dependencies. |
| Native libraries or system packages | Dockerfile/base image approach. |
| Training/deployment parity | Use compatible or same dependency versions for train and inference. |
| Reproducibility | Pin package versions and version the environment. |
| Security | Scan/review images and avoid secrets baked into images. |
| Troubleshooting | Check image build logs and import errors first. |
Infrastructure as code and governance
| Governance need | Pattern |
|---|
| Repeat workspace creation | Use ARM/Bicep/Terraform or approved IaC tooling. |
| Environment separation | Separate dev/test/prod workspaces and controlled promotion. |
| Policy enforcement | Use Azure Policy/RBAC/resource locks where appropriate. |
| Auditability | Keep changes in source control and CI/CD logs. |
| Network consistency | Define private endpoints, VNets, DNS, and outbound rules as code. |
| Least privilege | Assign roles to managed identities/service principals per environment. |
Scenario cue table
| Scenario wording | Likely answer |
|---|
| “Data scientist needs a cloud VM for notebooks” | Compute instance. |
| “Training should scale down when jobs finish” | Compute cluster or managed job compute pattern. |
| “Reusable preprocessing step across pipelines” | Command component. |
| “Pipeline should fail if accuracy is below threshold” | Evaluation component with quality gate before registration/deployment. |
| “Model must be served with HTTPS for applications” | Online endpoint. |
| “Score millions of records overnight” | Batch endpoint. |
| “Deploy new model to 10% of users” | Second online deployment with traffic split. |
| “Rollback quickly after errors” | Shift traffic back to previous deployment. |
| “Share approved model between workspaces” | Registry. |
| “Pipeline needs storage access without credentials in code” | Managed identity with RBAC. |
| “Private access only” | Private endpoint/network-restricted workspace and endpoint design. |
| “Endpoint cannot import Python package” | Fix environment image/dependencies. |
| “Need run metrics and lineage” | MLflow/Azure ML experiment tracking. |
| “Need to trigger retraining from production drift signal” | Monitoring plus scheduled/event-triggered pipeline. |
| “Need manual approval before production” | CI/CD environment approval or release gate. |
Common traps to avoid
| Trap | Better answer |
|---|
| Treat notebooks as production workflows | Convert to scripts, components, jobs, and pipelines. |
| Deploy directly from local files | Register versioned model and environment assets. |
| Use online endpoint for offline bulk scoring | Use batch endpoint. |
| Use batch endpoint for user-facing low-latency API | Use online endpoint. |
| Rebuild a different model in prod | Promote the evaluated model artifact. |
| Ignore environment parity | Align train and inference dependencies. |
| Store secrets in YAML | Use secure identity or Key Vault-backed configuration. |
| Assume RBAC on workspace grants data permissions | Grant access on storage/data resources too. |
| Replace deployment in place with no fallback | Use blue/green or canary with rollback path. |
| Monitor only infrastructure | Also monitor data, predictions, labels, and business quality. |
| Register every experiment | Register only candidates that pass evaluation criteria. |
| Use broad permissions for automation | Apply least privilege to service principals or managed identities. |
Last-minute checklist
- Know the difference between workspace, registry, model, environment, component, job, pipeline, endpoint, and deployment.
- Choose online endpoints for real-time inference and batch endpoints for offline scoring.
- Use versioned data, code, environments, and models for reproducibility.
- Use MLflow/Azure ML tracking for metrics, artifacts, and lineage.
- Use managed identities/RBAC instead of embedded credentials.
- Use CI/CD to validate, train, evaluate, register, deploy, test, and promote.
- Use traffic splitting for canary/blue-green deployment and rollback.
- Troubleshoot endpoints from logs, scoring script behavior, dependencies, identity, and networking.
- Include monitoring for operational health and model quality.
- Treat responsible AI outputs as part of the operational evidence chain.
Practical next step
Use this Quick Reference to drill scenario questions: for each prompt, identify the Azure Machine Learning asset, compute option, identity pattern, deployment type, and monitoring action before checking the answer.