Exam identity and review mindset
| Item | Reference |
|---|
| Vendor/provider | Python Institute |
| Official exam title | Python Institute PCEI - Certified Entry-Level AI Specialist with Python (PCEI-30-01) |
| Exam code | PCEI-30-01 |
| Page purpose | Independent quick review for AI concepts, Python patterns, data workflows, model selection, and evaluation basics |
| Best use | Review terms, choose algorithms from scenarios, read short Python snippets, and identify common AI/ML mistakes |
Focus on practical distinctions: classification vs regression, training vs inference, parameter vs hyperparameter, feature vs label, accuracy vs precision/recall, and model performance vs responsible AI risk.
AI and machine learning concept map
| Term | Compact meaning | Exam-use distinction |
|---|
| Artificial intelligence, AI | Systems that perform tasks associated with human intelligence | Broad umbrella: may include rules, search, ML, robotics, NLP, vision |
| Machine learning, ML | Models learn patterns from data instead of being explicitly programmed for every rule | ML is a subset of AI |
| Deep learning, DL | ML using multi-layer neural networks | Strong for images, speech, text, and large unstructured data |
| Data science | Extracting insights from data using statistics, programming, and domain knowledge | May include analytics without predictive AI |
| Model | Learned or designed function that maps inputs to outputs | Trained model is used for inference |
| Training | Process of fitting model parameters using data | Uses training data and a loss/objective |
| Inference | Using a trained model to make predictions | Should not update learned parameters unless online learning is intended |
| Feature | Input variable used by the model | Example: age, pixel values, token counts |
| Label / target | Output the model should learn to predict | Present in supervised learning |
| Parameter | Learned internal value | Weights in linear models or neural networks |
| Hyperparameter | Chosen before or during training control | Learning rate, tree depth, number of clusters |
| Loss function | Quantity minimized during training | Cross-entropy for classification; MSE often for regression |
| Generalization | Performance on unseen data | Better exam answer than “memorizes training set” |
| Overfitting | Model fits training data too closely and performs poorly on new data | Often high train score, low validation/test score |
| Underfitting | Model is too simple or poorly trained to capture patterns | Low train and validation performance |
| Bias | In ML error: simplifying assumptions; in responsible AI: unfair systematic harm | Read the scenario carefully; the word has two contexts |
| Variance | Sensitivity to training data changes | High variance often means overfitting |
Learning paradigms and task selection
| Paradigm | Data available | Output | Common examples | Choose when | Common trap |
|---|
| Supervised classification | Features plus class labels | Category/class | Spam/not spam, disease/no disease, image class | Target is discrete | Predicting a number does not automatically mean regression if the number is a class code |
| Supervised regression | Features plus numeric target | Continuous value | Price, temperature, demand | Target is numeric and ordered/continuous | Do not use accuracy for regression |
| Unsupervised clustering | Features without labels | Groups/segments | Customer segments, document groups | Need structure discovery | Clusters are not automatically “correct labels” |
| Unsupervised dimensionality reduction | Features without labels | Fewer transformed features | PCA, visualization, compression | Need simplify high-dimensional data | Transformed components may be hard to interpret |
| Reinforcement learning | Agent, environment, rewards | Policy/actions | Game agent, robotics control | Sequential decisions with feedback | Reward design is critical; not the default for normal labeled datasets |
| Semi-supervised learning | Few labels plus many unlabeled samples | Improved supervised model | Label-scarce image/text tasks | Labels are expensive | Unlabeled data must be relevant to the same problem |
| Self-supervised learning | Labels generated from data itself | Representations/pretraining | Masked words, contrastive learning | Large unlabeled text/image data | Not the same as manually labeled supervised learning |
End-to-end AI/ML workflow
flowchart LR
A[Define problem] --> B[Collect data]
B --> C[Explore and clean]
C --> D[Split data]
D --> E[Preprocess and engineer features]
E --> F[Train model]
F --> G[Validate and tune]
G --> H[Test once]
H --> I[Deploy or report]
I --> J[Monitor drift, errors, bias]
| Step | Key questions | Artifacts | High-yield trap |
|---|
| Define problem | Classification, regression, clustering, generation, ranking? | Objective, success metric, constraints | Picking an algorithm before defining the target |
| Collect data | Is data representative, legal to use, and relevant? | Raw datasets, metadata | More data is not useful if it is biased or mislabeled |
| Explore data | Missing values, outliers, class balance, correlations? | Summary stats, plots | Assuming correlation proves causation |
| Split data | Train/validation/test or cross-validation? | Reproducible split | Letting test data influence preprocessing or tuning |
| Preprocess | Scale, encode, tokenize, impute, normalize? | Pipeline or transformation code | Fitting preprocessors on all data causes leakage |
| Train | Which model family and hyperparameters? | Fitted model | High training score alone is not evidence of success |
| Validate/tune | Which hyperparameters improve validation metric? | Scores, selected model | Repeatedly tuning on test data invalidates test result |
| Test | Final unbiased performance estimate? | Test metrics | Testing before final model selection |
| Deploy/monitor | Is performance stable in production? | Model service, logs, dashboards | Data drift can degrade a once-good model |
Data types, features, and preprocessing choices
| Data / issue | Common handling | Choose / remember | Trap |
|---|
| Numeric continuous | Scaling, normalization, outlier checks | Important for kNN, SVM, logistic regression, neural networks | Trees usually need less scaling |
| Categorical nominal | One-hot encoding, embeddings for high-cardinality data | No inherent order | Label encoding may imply false order |
| Categorical ordinal | Ordered integer mapping or ordinal encoding | Keep meaningful order | Treating ordinal values as purely nominal can lose signal |
| Text | Tokenization, vectorization, embeddings | Convert words/tokens to numbers | Raw strings cannot be directly used by most numeric models |
| Images | Pixel arrays, normalization, augmentation, CNNs | Preserve spatial patterns | Flattening may discard useful locality for vision tasks |
| Time series | Time-aware split, lag features, rolling stats | Future data must not leak into past | Random split can leak future information |
| Missing values | Imputation, missingness indicators, row removal | Strategy depends on why data is missing | Dropping rows can bias the dataset |
| Outliers | Investigate, cap, transform, robust models | Some outliers are valid signals | Blind removal can discard rare but important cases |
| Imbalanced classes | Stratified split, class weights, resampling, precision/recall/F1 | Use metrics beyond accuracy | A model predicting only majority class can look accurate |
| Duplicate records | Deduplicate before split where appropriate | Prevent same entity in train and test | Duplicates inflate test performance |
| Data leakage | Keep target/future/test information out of training | Use pipelines fitted on training only | Leakage often creates unrealistically high metrics |
Python essentials for AI code questions
| Python feature | Remember | Common mistake |
|---|
| Indentation | Defines code blocks | Misreading nested if, for, def, or class blocks |
| Lists | Ordered, mutable sequences | Assignment copies references, not deep copies |
| Tuples | Ordered, immutable sequences | Tuple can contain mutable objects |
| Dictionaries | Key-value mappings | Keys must be hashable |
| Sets | Unordered unique elements | No index-based access |
| Slicing | seq[start:stop:step], stop is excluded | Off-by-one errors |
| Negative index | seq[-1] is last item | seq[-0] equals seq[0] |
| List comprehension | Compact transformation/filter | Side effects are less readable |
| Functions | return sends value back | Printing is not returning |
| Lambda | Small anonymous function | Best for simple expressions only |
| Exceptions | try / except handles runtime errors | Catching broad exceptions can hide bugs |
| Modules | Imported with import or from ... import ... | Namespace changes depending on import style |
| Randomness | Use seeds for reproducibility | Seed does not make a poor split representative |
| Boolean logic | and, or, not; truthy/falsy values | Confusing bitwise & with logical and outside array contexts |
Core Python patterns
values = [3, 1, 4, 1, 5]
squares = [x * x for x in values if x > 1]
unique_values = set(values)
counts = {x: values.count(x) for x in unique_values}
def normalize_minmax(x, min_x, max_x):
return (x - min_x) / (max_x - min_x)
Read snippets for data flow: what is created, transformed, fitted, predicted, or measured.
NumPy and pandas quick reference
| Library | Used for | High-yield objects / methods |
|---|
| NumPy | Numeric arrays, vectorized operations, linear algebra basics | array, shape, reshape, mean, sum, argmax, dot, broadcasting |
| pandas | Tabular data loading, cleaning, exploration | DataFrame, Series, read_csv, head, info, describe, isna, value_counts, groupby, loc, iloc |
| matplotlib / plotting tools | Basic visualization | Histograms, scatter plots, line plots, confusion matrix display |
| scikit-learn-style APIs | Classical ML workflows | fit, transform, predict, score, train/test split, metrics, pipelines |
import numpy as np
x = np.array([[1, 2, 3],
[4, 5, 6]])
x.shape # (2, 3)
x.mean(axis=0) # column means
x.mean(axis=1) # row means
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
df.info()
df["target"].value_counts()
df.isna().sum()
| pandas selector | Meaning |
|---|
df["col"] | Select one column as a Series |
df[["a", "b"]] | Select multiple columns as a DataFrame |
df.loc[row_label, col_label] | Label-based selection |
df.iloc[row_index, col_index] | Position-based selection |
df.drop(columns=["x"]) | Remove column x |
df.groupby("class").mean() | Aggregate by group |
Model selection matrix
| Model / method | Best fit | Preprocessing needs | Strengths | Watch for |
|---|
| Linear regression | Numeric regression with roughly linear relationships | Encoding, often scaling | Simple, interpretable baseline | Poor fit for strong nonlinearity unless features are engineered |
| Logistic regression | Binary or multiclass classification | Encoding, often scaling | Strong baseline, probabilistic outputs | Despite name, used for classification |
| k-nearest neighbors, kNN | Classification/regression based on similar examples | Scaling is very important | Simple concept; no complex training | Slow on large data; sensitive to irrelevant features |
| Naive Bayes | Text classification, simple probabilistic classification | Text vectorization for NLP | Fast, works well for bag-of-words text | “Naive” independence assumption may be unrealistic |
| Decision tree | Classification/regression with nonlinear rules | Little scaling needed | Interpretable if small | Easily overfits if unconstrained |
| Random forest | Ensemble of decision trees | Little scaling needed | Reduces variance, strong general-purpose model | Less interpretable than one tree |
| Gradient boosting | Sequential ensemble improving errors | Depends on implementation/data | High predictive performance | Sensitive to tuning; can overfit |
| Support vector machine, SVM | Classification with clear margins; can use kernels | Scaling usually important | Effective in many medium-size problems | Kernel choice and tuning matter |
| k-means | Unsupervised clustering into k groups | Scaling important | Simple clustering baseline | Must choose k; assumes roughly spherical clusters |
| PCA | Dimensionality reduction | Scaling often important | Compresses features, removes correlation | Components may not map to human-readable features |
| Neural network | Complex nonlinear patterns, unstructured data | Scaling/normalization; more data often needed | Flexible, supports deep learning | More parameters, less interpretability, compute needs |
Neural networks and deep learning
| Concept | Meaning | Exam cue |
|---|
| Neuron/unit | Computes weighted input plus bias, then activation | Basic building block |
| Weight | Learned coefficient | Parameter, not hyperparameter |
| Bias term | Learned offset | Lets activation shift |
| Activation function | Adds nonlinearity | Without nonlinear activations, stacked layers act like a linear model |
| Forward pass | Inputs flow through network to output | Prediction computation |
| Loss | Difference between prediction and desired output | Training minimizes loss |
| Backpropagation | Computes gradients through network | Used to update weights |
| Gradient descent | Optimization method moving parameters to reduce loss | Learning rate controls step size |
| Epoch | One pass over training data | More epochs can overfit |
| Batch / mini-batch | Subset used for one update | Common in neural network training |
| CNN | Convolutional neural network | Strong for images and spatial patterns |
| RNN | Recurrent neural network | Designed for sequences; less central than transformers in modern NLP |
| Transformer | Attention-based architecture | Common in modern language models |
| Embedding | Dense vector representation | Used for words, documents, images, users/items |
Common activations:
| Activation | Typical use | Key behavior |
|---|
| ReLU | Hidden layers | Outputs zero for negative input and linear positive values |
| Sigmoid | Binary probability output or gating | Outputs between 0 and 1 |
| Softmax | Multiclass output | Converts class scores into probabilities that sum to 1 |
| Tanh | Hidden layers in some networks | Outputs between -1 and 1 |
Metrics and evaluation
Confusion matrix terms
| Term | Meaning |
|---|
| True positive, TP | Predicted positive and actually positive |
| True negative, TN | Predicted negative and actually negative |
| False positive, FP | Predicted positive but actually negative |
| False negative, FN | Predicted negative but actually positive |
\[
\begin{aligned}
\text{Accuracy} &= \frac{TP + TN}{TP + TN + FP + FN} \\
\text{Precision} &= \frac{TP}{TP + FP} \\
\text{Recall} &= \frac{TP}{TP + FN} \\
\text{F1} &= 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
\end{aligned}
\]
| Metric | Use when | Trap |
|---|
| Accuracy | Classes are reasonably balanced and error costs are similar | Misleading with class imbalance |
| Precision | False positives are costly | High precision can still miss many positives |
| Recall / sensitivity | False negatives are costly | High recall can create many false positives |
| Specificity | True-negative performance matters | Often paired with sensitivity |
| F1 score | Need balance between precision and recall | Hides trade-off between the two |
| ROC AUC | Ranking ability across thresholds | Can look good even when precision is poor in rare-positive tasks |
| PR AUC | Positive class is rare | More informative than ROC in many imbalanced cases |
| Confusion matrix | Need error type breakdown | Must know which class is “positive” |
Regression metrics:
\[
\begin{aligned}
\text{MAE} &= \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| \\
\text{MSE} &= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \\
\text{RMSE} &= \sqrt{\text{MSE}}
\end{aligned}
\]
| Metric | Use when | Trap |
|---|
| MAE | Need average absolute error in target units | Less sensitive to large errors |
| MSE | Penalize larger errors more | Units are squared |
| RMSE | Penalize large errors while keeping target units | Sensitive to outliers |
| R-squared | Explain variance relative to baseline | High value does not prove causation or fairness |
Train, validation, test, and cross-validation
| Dataset part | Purpose | Should be used for |
|---|
| Training set | Fit model parameters | Training model and preprocessing fitted within training workflow |
| Validation set | Tune hyperparameters and compare models | Model selection |
| Test set | Final estimate of generalization | One-time final evaluation |
| Cross-validation | Repeated train/validation splits | More stable model comparison on limited data |
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
model = make_pipeline(
StandardScaler(),
LogisticRegression()
)
model.fit(X_train, y_train)
pred = model.predict(X_test)
print(classification_report(y_test, pred))
Key point: the scaler inside the pipeline is fitted on X_train, not the full dataset. That helps avoid data leakage.
Overfitting, underfitting, and fixes
| Symptom | Likely issue | Practical fixes |
|---|
| High training score, low validation score | Overfitting / high variance | More data, regularization, simpler model, pruning, dropout, early stopping, cross-validation |
| Low training and validation score | Underfitting / high bias | More expressive model, better features, train longer, reduce excessive regularization |
| Validation score unstable across splits | High variance or small dataset | Cross-validation, more data, simpler model |
| Great test score during development but poor production performance | Leakage, distribution shift, or over-tuning | Recheck split, monitor drift, use realistic validation |
| Model performs well overall but fails subgroup | Bias/fairness issue or unrepresentative data | Subgroup evaluation, better data coverage, fairness review |
| Technique | What it does | Common use |
|---|
| Regularization | Penalizes model complexity | Reduce overfitting |
| Dropout | Randomly disables neural units during training | Neural network regularization |
| Early stopping | Stops training when validation stops improving | Avoid overtraining |
| Pruning | Limits decision tree complexity | Reduce tree overfitting |
| Data augmentation | Creates modified training examples | Images, text, audio robustness |
| Cross-validation | Tests performance across multiple splits | Model selection on limited data |
Generative AI, NLP, and embeddings
| Concept | Meaning | Exam-use distinction |
|---|
| Tokenization | Splits text into tokens | Tokens may be words, subwords, or characters |
| Vocabulary | Set of tokens known to a model/vectorizer | Unknown or rare words need handling |
| Bag of words | Counts token occurrences | Ignores word order |
| TF-IDF | Weights words by frequency and rarity | Useful classical text representation |
| Embedding | Dense numeric vector representing meaning/features | Similar items should be close in vector space |
| Language model | Predicts or generates text | Can be used for completion, classification, summarization |
| Generative model | Produces new content | Text, image, audio, code, or synthetic data |
| Prompt | Input instruction/context for a generative model | Prompt wording affects output |
| Hallucination | Plausible but incorrect generated output | Requires verification and guardrails |
| RAG | Retrieval-augmented generation | Retrieves external context before generation |
| Fine-tuning | Further training a model on task-specific data | Changes model behavior more deeply than prompting |
| Temperature | Sampling randomness control | Higher generally means more varied output; lower more deterministic |
| Guardrails | Controls to reduce unsafe or invalid outputs | Can include filtering, validation, human review |
Cosine similarity is commonly used to compare embeddings:
\[
\text{cosine similarity} = \frac{A \cdot B}{\lVert A \rVert \lVert B \rVert}
\]
import numpy as np
def cosine_similarity(a, b):
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Responsible AI, ethics, and risk controls
| Risk | Example | Mitigation idea |
|---|
| Bias / unfairness | Lower performance for a demographic subgroup | Representative data, subgroup metrics, fairness review |
| Privacy exposure | Sensitive data included in training or prompts | Data minimization, anonymization, access control |
| Lack of explainability | User cannot understand why a decision was made | Simpler model, feature importance, documentation |
| Hallucination | Generated answer invents facts | Retrieval, validation, citations, human review |
| Data poisoning | Malicious or corrupted training data | Data provenance, validation, monitoring |
| Adversarial inputs | Small input changes cause wrong predictions | Robust testing, input validation, monitoring |
| Automation bias | Users overtrust AI output | Human-in-the-loop review and clear uncertainty |
| Model drift | Production data changes over time | Monitoring, retraining triggers, performance checks |
| Security leakage | Model or API reveals sensitive information | Authentication, authorization, logging, rate controls |
| Misuse | Model used outside intended scope | Clear documentation, constraints, governance |
High-yield distinction: model accuracy is not the same as model acceptability. A model can score well and still be unsafe, unfair, nontransparent, or inappropriate for deployment.
Scenario decision rules
| If the question says… | Think… |
|---|
| “Predict whether” / “classify as” / “which category” | Classification |
| “Predict price/amount/temperature” | Regression |
| “Find natural groups without labels” | Clustering |
| “Reduce many features while preserving information” | Dimensionality reduction |
| “Agent learns by reward and penalty” | Reinforcement learning |
| “Images with spatial patterns” | CNN or image-focused preprocessing |
| “Text meaning or semantic search” | Embeddings, language models, NLP |
| “Rare positive class” | Precision, recall, F1, PR AUC; not accuracy alone |
| “False negative is dangerous” | Prioritize recall/sensitivity |
| “False positive is expensive” | Prioritize precision |
| “Very high training score, weak validation score” | Overfitting |
| “Preprocessing used before split” | Data leakage risk |
| “New data no longer matches training data” | Drift / distribution shift |
| “Need human-understandable rules” | Simpler interpretable model or explanation method |
Common traps to eliminate
- Logistic regression is for classification, not ordinary numeric regression.
- Accuracy can be a poor metric when classes are imbalanced.
- Test data is not for tuning. Use validation or cross-validation for model selection.
- Fit preprocessing only on training data; transform validation/test using training-fitted steps.
- Correlation does not prove causation.
- Hyperparameters are chosen, while parameters are learned.
- Scaling matters for distance-based and gradient-based methods; it is usually less critical for tree-based models.
- Unsupervised learning has no labels during training.
- A larger model is not automatically better; it can overfit and be harder to explain.
- A seed improves reproducibility, not necessarily model quality.
- Good average performance can hide subgroup failure.
- Generative AI output must be verified when correctness matters.
Last-pass checklist for PCEI-30-01 review
Use this checklist before practice questions for the Python Institute PCEI - Certified Entry-Level AI Specialist with Python (PCEI-30-01):
- Identify the ML task from the target: category, number, cluster, sequence action, or generated content.
- Name the correct data split and what each split is allowed to influence.
- Match metric to business error: false positive, false negative, continuous error, ranking, or imbalance.
- Recognize leakage in preprocessing, feature creation, duplicates, and time-based data.
- Distinguish AI, ML, deep learning, NLP, computer vision, and generative AI.
- Read Python code for mutability, slicing, function return values, array shape, and
fit/predict flow. - Know when scaling, encoding, imputation, tokenization, and embeddings are needed.
- Recognize overfitting and underfitting from train/validation patterns.
- Include responsible AI risks when a scenario mentions privacy, fairness, safety, transparency, or misuse.
Practical next step
Take a short mixed set of PCEI-30-01-style practice questions, then review every missed item against this Quick Reference. For each miss, write one decision rule such as “rare positive class means precision/recall, not accuracy alone” or “fit preprocessing after the train/test split.”