Python Institute PCEI: Data Handling and Visualization

Try 10 focused Python Institute PCEI questions on data handling, analysis, and visualization, with explanations, then continue with IT Mastery.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Python Institute PCEI on Web View full Python Institute PCEI practice page

Topic snapshot

FieldDetail
Exam routePython Institute PCEI
Topic areaBlock 3: Data Handling, Analysis, and Visualization
Blueprint weight16.5%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Block 3: Data Handling, Analysis, and Visualization for Python Institute PCEI. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 16.5% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Block 3: Data Handling, Analysis, and Visualization

A small clinic wants to train a simple model to predict whether appointment reminders should be sent by text or phone. The current dataset has many duplicate rows, several missing contact-preference values, and most examples come from weekday morning appointments, while the model will be used for all appointment times. What is the best action before trusting the model’s performance?

Options:

  • A. Clean the data and collect more representative examples

  • B. Remove all weekday morning examples from the dataset

  • C. Use a more complex model to handle the messy data

  • D. Train the model now and use the highest accuracy score

Best answer: A

Explanation: Model performance depends strongly on the data used to train and evaluate it. Duplicate rows and missing values can distort patterns, while non-representative examples can make a model look better than it will perform in real use. In this case, the dataset does not reflect all appointment times, so accuracy measured on it may not predict performance for evenings, weekends, or other underrepresented cases. The practical first step is to fix obvious data-quality problems and gather examples that match the full task. A more complex model cannot reliably compensate for biased or messy input data.

  • Highest accuracy fails because an accuracy score from messy, narrow data may be misleading.
  • Removing morning data fails because it discards useful examples instead of balancing the dataset.
  • Complex model fails because model complexity does not solve missing values, duplicates, or poor representation by itself.

Question 2

Topic: Block 3: Data Handling, Analysis, and Visualization

A support team plots the number of tickets opened each weekday using Matplotlib.

import matplotlib.pyplot as plt

days = ["Mon", "Tue", "Wed", "Thu", "Fri"]
tickets = [8, 6, 10, 14, 12]

plt.plot(days, tickets, marker="o")
plt.title("Tickets Opened")
plt.xlabel("Day")
plt.ylabel("Tickets")
plt.show()

Which interpretation is best supported by the chart? Select ONE.

Options:

  • A. The chart shows the average number of tickets per day.

  • B. Tickets decrease steadily from Monday through Friday.

  • C. Tickets increase from Tuesday through Thursday, then decrease on Friday.

  • D. Friday has the highest number of tickets.

Best answer: C

Explanation: A line chart connects values in order, making it useful for spotting changes over time or ordered categories. Here, the plotted ticket counts are 8 on Monday, 6 on Tuesday, 10 on Wednesday, 14 on Thursday, and 12 on Friday. The clearest supported pattern is a rise after Tuesday through Thursday, followed by a small decrease on Friday. The chart does not compute an average or show Friday as the maximum; it only displays the listed daily counts.

  • Steady decrease fails because the values rise after Tuesday.
  • Friday as highest fails because Thursday has 14 tickets and Friday has 12.
  • Average shown fails because the code plots raw daily counts, not a calculated mean.

Question 3

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner analyst needs to calculate the average sale amount by region from this file. The amount column must be numeric, missing sale amounts should not be counted as zero, and valid rows should be kept when possible.

sale_id,region,amount
1,North,$25.00
2,North,18.50
3,South,N/A
4,South,$31.00
5,East,

Which cleaning step is the best first action? Select ONE.

Options:

  • A. Strip currency symbols, convert amount to numeric, and keep missing values as missing.

  • B. Remove every row where amount contains a currency symbol.

  • C. Encode region names as integers before averaging.

  • D. Replace missing amount values with 0 before averaging.

Best answer: A

Explanation: The core cleaning need is numeric conversion with proper missing-value handling. The visible amount values mix valid numbers with currency-formatted values and unknown entries such as N/A and a blank cell. For an average, valid amounts should be converted to numbers after removing formatting characters, while unknown amounts should remain missing so they are not treated as real zero-value sales. This keeps usable data and avoids distorting the result.

Changing the region labels is not the first priority because they can already be used for grouping.

  • Filling missing amounts with zero falsely lowers averages because zero was not observed in the data.
  • Dropping currency-formatted rows removes valid sale amounts that can be cleaned.
  • Encoding regions changes category labels but does not fix the non-numeric amount column.

Question 4

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner AI project uses a small quiz dataset represented with pandas. The student runs this code:

import pandas as pd

records = [
    {"name": "Ava", "city": "Lima", "score": 82, "passed": True},
    {"name": "Ben", "city": "Rome", "score": 64, "passed": False},
    {"name": "Chen", "city": "Oslo", "score": 78, "passed": True},
    {"name": "Dia", "city": "Lima", "score": 70, "passed": False}
]

df = pd.DataFrame(records)
result = df.loc[df["passed"], ["city", "score"]]

Which interpretation of result is supported by the code?

Options:

  • A. It contains the city and score for Lima records only.

  • B. It contains the city and score for Ava and Chen.

  • C. It contains all columns for Ava and Chen.

  • D. It contains only the rows with scores above 80.

Best answer: B

Explanation: The core idea is pandas row filtering plus column selection. pd.DataFrame(records) turns the list of dictionaries into a table with one row per dictionary and columns from the dictionary keys. In df.loc[df["passed"], ["city", "score"]], the first part selects rows where the passed column is True, and the second part selects only the named columns. Ava and Chen have passed set to True, so only their rows remain, and only city and score are shown. The selection is not based on city or score unless those columns are used in the filter condition.

  • All columns is incorrect because the column list explicitly limits the result to city and score.
  • Lima records is incorrect because the row filter uses passed, not city.
  • Scores above 80 is incorrect because Chen is included even though Chen’s score is 78.

Question 5

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner updates a Matplotlib chart before sharing it in a project report:

plt.plot(days, model_a, label="Model A")
plt.plot(days, model_b, label="Model B")
plt.title("Daily Prediction Accuracy")
plt.xlabel("Day")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)

Which result best describes what these additions do for the chart?

Options:

  • A. They make the chart easier to interpret without changing the data.

  • B. They remove outliers from the plotted data.

  • C. They sort the data points by accuracy.

  • D. They train the models to become more accurate.

Best answer: A

Explanation: Chart titles, axis labels, legends, and simple styling help a reader understand what a visualization shows. In this snippet, the title states the chart topic, the x-axis label explains the horizontal values, the y-axis label names the measured quantity, and the legend identifies which line belongs to each model. The grid is a simple styling choice that can make values easier to compare visually. These changes affect presentation and interpretation, not the underlying data or model behavior.

  • Model training is not affected because Matplotlib display commands do not change learning or prediction logic.
  • Outlier removal is not happening because no data filtering or cleaning step is shown.
  • Sorting data is not happening because the snippet plots the existing days, model_a, and model_b sequences as provided.

Question 6

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner AI project needs to predict when support tickets are likely to spike during the day. A teammate creates this visualization from historical counts:

import matplotlib.pyplot as plt

hours = [0, 4, 8, 12, 16, 20]
ticket_counts = [2, 3, 4, 8, 15, 28]

plt.pie(ticket_counts, labels=hours)
plt.title("Tickets by hour")
plt.show()

Which conclusion best describes the result of this visualization choice?

Options:

  • A. The counts must be normalized before any chart is useful.

  • B. The pie chart obscures the time-based increase in tickets.

  • C. The pie chart clearly shows the strongest time trend.

  • D. The data is unsuitable because it has too few categories.

Best answer: B

Explanation: The core issue is chart choice for the pattern needed by the AI task. The project needs to notice how ticket volume changes over time, but a pie chart emphasizes each count as a share of the total. That makes it harder to see the ordered rise from hour 0 to hour 20. A line chart or simple bar chart with hours on the x-axis would better show the spike pattern that could become a useful feature or decision signal. The chart is not wrong because the data is small; it is misleading because it hides the important time sequence.

  • Trend claim fails because a pie chart does not make the hour-to-hour increase easy to compare.
  • Normalization requirement fails because raw counts can still be plotted meaningfully for a basic trend view.
  • Too few categories fails because six time points are enough for a simple line or bar chart.

Question 7

Topic: Block 3: Data Handling, Analysis, and Visualization

A data analyst stores simple review records as nested Python lists and dictionaries before training a small classifier.

Exhibit:

reviews = [
    {"id": 1, "labels": ["positive"], "scores": {"quality": 0.70}},
    {"id": 2, "labels": ["negative"], "scores": {"quality": 0.40}}
]

reviews[1]["labels"].append("needs_check")
reviews[0]["scores"]["quality"] = 0.85

result = (reviews[0]["scores"]["quality"], reviews[1]["labels"][1])

Which value does result contain after the code runs? Select ONE.

Options:

  • A. (0.85, "negative")

  • B. (0.40, "needs_check")

  • C. (0.85, "needs_check")

  • D. (0.70, "needs_check")

Best answer: C

Explanation: Nested structures are accessed one level at a time. Here, reviews[0] selects the first dictionary, "scores" selects its nested dictionary, and "quality" selects the value that is updated from 0.70 to 0.85. For the second record, reviews[1]["labels"] selects the list ['negative'], and append("needs_check") adds a new item to the end of that list. After the append, index 0 is "negative" and index 1 is "needs_check".

The tuple therefore combines the modified first quality score with the newly appended second label.

  • Old score fails because the first record’s quality value is reassigned before result is created.
  • Wrong label index fails because reviews[1]["labels"][1] selects the appended item, not the original label.
  • Second record score fails because the score lookup uses reviews[0], not reviews[1].

Question 8

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner is preparing a small dataset for a simple distance-based model. The model will compare a new customer to past rows using Euclidean distance on age and income.

rows = [
    {"age": 18, "income": 24000, "bought": 0},
    {"age": 52, "income": 88000, "bought": 1},
    {"age": 25, "income": 31000, "bought": 0}
]
new_customer = {"age": 19, "income": 26000}

Which feature-preparation step would make the dataset more suitable for this model? Select ONE.

Options:

  • A. Scale age and income to comparable ranges

  • B. Change bought values from 0 and 1 to text labels

  • C. Remove age because it has smaller numbers than income

  • D. Sort the rows by income before comparing distances

Best answer: A

Explanation: For distance-based logic, numeric features should usually be on comparable scales. In this dataset, income ranges in the tens of thousands, while age ranges in tens. A Euclidean distance calculation would mostly reflect income differences unless the features are scaled or normalized first. Scaling does not change the meaning of the records; it prepares the feature values so the simple model can compare rows more fairly.

Sorting rows does not change the distances, and removing a feature just because its raw numbers are smaller throws away potentially useful information.

  • Sorting rows does not change Euclidean distance values, so it does not fix the scale problem.
  • Changing the label text affects the target format, not the numeric input features used for distance.
  • Removing age is too extreme because scaling can preserve the feature while reducing scale imbalance.

Question 9

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner data analyst wants to calculate the average delivery_minutes for each city. Select ONE cleaning step that is best supported by the visible records.

Exhibit: Sample records

order_id,city,delivery_minutes
101,Lagos ,42
102,lagos,38
103,LAGOS,41
104,Accra,39

Options:

  • A. Strip spaces and standardize city case

  • B. Shuffle the rows before calculating averages

  • C. Remove the city column before analysis

  • D. Convert order_id values to floats

Best answer: A

Explanation: The core cleaning issue is inconsistent categorical text. The records show Lagos , lagos, and LAGOS, which likely refer to the same city but would be treated as separate groups by most analysis code. Trimming extra spaces and applying a consistent case, such as lowercase or title case, prepares the data for a correct grouped average by city.

Removing the city field would destroy the grouping variable. Changing the ID type or row order does not address the visible problem affecting the requested analysis. The key takeaway is to clean the field that directly affects the intended calculation.

  • Dropping the category fails because the analysis needs city to group delivery times.
  • Changing IDs fails because order_id is not used to compute city averages.
  • Shuffling rows fails because row order does not fix inconsistent category labels.

Question 10

Topic: Block 3: Data Handling, Analysis, and Visualization

A beginner AI project will store a small training dataset in plain Python, without pandas. Each row represents one plant sample with the same features: height_cm, leaf_count, and species. The team wants to keep each sample’s features together, make feature names readable, and easily append new samples. Which representation is the best action?

Options:

  • A. Use a list of dictionaries, one dictionary per sample.

  • B. Use one flat list containing all values from all samples.

  • C. Use one dictionary for only the most recent sample.

  • D. Use one dictionary with feature names mapped to separate lists.

Best answer: A

Explanation: For a small record-based dataset in plain Python, a list of dictionaries is usually the clearest representation. The outer list represents the dataset, and each dictionary represents one record with named features such as height_cm, leaf_count, and species. This matches the team’s need to keep each sample’s values together while still making the feature names readable. Adding a new sample is also simple because another dictionary can be appended to the list. A dictionary of lists can work for column-style data, but it separates the values for one sample across multiple lists and is easier to misalign in beginner code.

  • Column lists can store features, but values for one sample are split across lists and must stay perfectly aligned.
  • Flat values lose the relationship between each sample and its feature names.
  • Single record cannot represent the full training dataset because it stores only one sample.

Continue with full practice

Use the Python Institute PCEI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Python Institute PCEI on Web View Python Institute PCEI Practice Test

Free review resource

Read the Python Institute PCEI Cheat Sheet for compact concept review before returning to timed practice.

Revised on Monday, May 25, 2026