Free Databricks Generative AI Engineer Associate Practice Questions: Data Preparation
Practice 10 free Databricks Certified Generative AI Engineer Associate (Databricks Generative AI Engineer Associate) questions on Data Preparation, with answers, explanations, and the IT Mastery next step.
Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.
Topic snapshot
| Field | Detail |
|---|---|
| Practice target | Databricks Generative AI Engineer Associate |
| Topic area | Data Preparation |
| Blueprint weight | 14% |
| Page purpose | Focused sample questions before returning to mixed practice |
How to use this topic drill
Use this page to isolate Data Preparation for Databricks Generative AI Engineer Associate. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 14% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
Sample questions
These are original IT Mastery practice questions aligned to this topic area. They are not official Databricks questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with topic drills, mixed sets, and timed mocks in IT Mastery.
Question 1
Topic: Data Preparation
A team has finished extracting and chunking internal policy documents for a RAG application. They run the following step in a Databricks notebook:
chunk_rows = [
{"doc_id": "policy-17", "chunk_id": "policy-17-03", "text": "...", "source_uri": "s3://docs/policy-17.pdf"}
]
(spark.createDataFrame(chunk_rows)
.write.mode("append")
.saveAsTable("main.support.policy_chunks"))
No Vector Search endpoint or index is shown. What does this step accomplish, and what is still needed for semantic retrieval?
Options:
A. It queries an existing Vector Search index and returns nearest matching chunks.
B. It embeds user questions and stores query history for later evaluation.
C. It creates a Vector Search index that can be queried directly by the retriever.
D. It stores prepared chunks in a Unity Catalog Delta table; create a Vector Search index before semantic queries.
Best answer: D
Explanation: Writing chunked text to a Unity Catalog Delta table is a data-preparation step. It creates durable, governed storage for the prepared chunks and their metadata, such as document IDs and source URIs. A Vector Search index is a separate retrieval structure built from a source table or embedding data so an application can perform semantic similarity search. The retriever queries the Vector Search index, not the plain Delta write operation. The key distinction is storage first, retrieval index second.
- Index creation claim fails because
saveAsTablewrites a Delta table; it does not create a Vector Search endpoint or index. - Index query claim fails because the snippet contains no retriever call or similarity search operation.
- Question embedding claim fails because the artifact processes prepared document chunks, not runtime user questions or evaluation logs.
Question 2
Topic: Data Preparation
A team builds a Databricks RAG app over maintenance manuals. Users report that answers omit values from calibration tables. The retriever is consistently finding the expected manual page.
Artifact: retrieval and context sample
Query: What are the pressure limits for Pump A calibration?
Vector Search result: safety_manual.pdf, page 42, score 0.84
Extracted context sent to LLM:
"Pump A calibrat on limits: Temp C | Pres... | Pass/Fail
20 | ??? | OK
40 | | OK
See rotated table header on page image."
Test note: same corruption appears with two LLM endpoints and two embedding models.
Which action best addresses the likely cause?
Options:
A. Improve OCR/table extraction and reindex validated chunks.
B. Switch to a larger LLM context window.
C. Add a stricter citation prompt.
D. Change the embedding model dimension.
Best answer: A
Explanation: This is a data preparation problem in the extraction stage. Vector Search is already returning the expected source page, and the corruption appears before the LLM answers: missing table cells, broken words, and a note about a rotated table header. Because the same issue appears across different LLM and embedding choices, model selection is unlikely to be the root cause. The better fix is to use extraction tooling that handles scanned PDFs, OCR, rotation, and tables, then validate the extracted text before chunking and rebuilding the Vector Search index.
The key takeaway: if retrieved context is missing or malformed, inspect extraction output before changing models or prompts.
- Larger context fails because the context window can only include text that was extracted correctly.
- Embedding dimension fails because retrieval is already locating the expected page.
- Citation prompt fails because prompting cannot recover table values missing from the indexed context.
Question 3
Topic: Data Preparation
An engineer is tuning a Databricks RAG app that retrieves from a Mosaic AI Vector Search index. The retriever applies product = <selected product> as a metadata filter. Retrieval evaluation uses labeled relevant chunks.
| Evaluation signal | Result |
|---|---|
| No metadata filter | recall@5 = 0.83 |
| Current product filter | recall@5 = 0.41 |
| Missed relevant chunks | 78% have product null or incomplete |
| Chunk audit | Answer spans fit within one chunk |
Which configuration change should the engineer make first?
Options:
A. Backfill chunk-level product tags and re-index with corrected filters
B. Rebuild the corpus with smaller chunks and more overlap
C. Switch the answer generation endpoint to a larger LLM
D. Increase
top_kand add a re-ranking step
Best answer: A
Explanation: The retrieval-evaluation results point to a metadata-filter problem, not a chunking or generation problem. Recall is much higher when the filter is removed, and most missed relevant chunks have null or incomplete product metadata. The chunk audit also says the answer spans already fit within one chunk, so chunk boundaries are not the main cause. The right first fix is to backfill reliable chunk-level product metadata, re-index if needed, and apply filters against corrected metadata so relevant chunks remain eligible for vector ranking.
Increasing top_k or adding a reranker cannot recover chunks that were filtered out before ranking. A larger LLM may improve response wording, but it does not fix retrieval recall.
- Smaller chunks target split answer spans, but the audit says answer spans already fit within one chunk.
- More top results cannot recover chunks excluded by the product metadata filter before ranking.
- Larger LLM affects synthesis quality, not whether Vector Search retrieves the relevant chunks.
Question 4
Topic: Data Preparation
A team is preparing product-support PDFs for a Databricks RAG application. The documents have already been extracted, cleaned, and split into chunks. The next pipeline step must create a governed, reusable source of prepared chunks in Unity Catalog with doc_id, chunk_id, chunk_text, and metadata. The embedding model and retrieval endpoint have not been chosen yet. What is the best engineering decision?
Options:
A. Query Vector Search to validate chunks before storing them.
B. Write the chunks as rows in a Unity Catalog Delta table.
C. Create a Vector Search index directly from the PDFs.
D. Store chunk JSON files for the retriever to read directly.
Best answer: B
Explanation: Prepared chunks should first be stored as structured records in a Delta Lake table governed by Unity Catalog. That table becomes the durable, auditable source for chunk text, identifiers, and metadata. A Vector Search index is created later from the prepared Delta table when the application is ready for semantic retrieval and the embedding/index choices are known. Querying Vector Search is a serving-time retrieval operation, not the step that preserves prepared data. The key distinction is source-of-truth storage in Delta Lake versus retrieval acceleration through a Vector Search index.
- Indexing raw PDFs skips the prepared chunk table and loses the required governed chunk-level structure.
- Querying Vector Search is for retrieval after an index exists, not for creating the reusable prepared dataset.
- Reading JSON files directly avoids Unity Catalog governance and does not provide a Delta source for downstream indexing.
Question 5
Topic: Data Preparation
A team is tuning a Databricks RAG application that uses Mosaic AI Vector Search. The generator receives only the first 3 retrieved chunks.
Exhibit: Offline retrieval evaluation
| Signal | Result |
|---|---|
| Retriever setting | initial top_k = 8 |
| Recall@8 | 0.91 |
| nDCG@3 | 0.42 |
| Trace pattern | exact policy clause often at ranks 5-8 |
| Top 3 pattern | broad FAQ chunks, partially related |
Which change best addresses the issue shown in the artifact?
Options:
A. Mask all retrieved text before sending it to the model
B. Rebuild the index with no chunk overlap
C. Pass all 8 retrieved chunks directly to the generator
D. Re-rank the top 8 candidates and pass the best 3 chunks
Best answer: D
Explanation: Re-ranking is useful when initial retrieval finds the right candidate context but does not place it high enough for the generator. Vector Search can quickly retrieve a candidate set using embeddings, then a re-ranker can rescore those query-chunk pairs with a stronger relevance model and reorder them. In this artifact, Recall@8 is high and traces show the exact policy clause appears in ranks 5-8, while only the top 3 chunks are sent to the model. That is an ordering problem, not primarily a source-ingestion or embedding-coverage problem. Passing more chunks can add noise and still does not prioritize the best evidence.
- More chunks may increase prompt noise and does not fix the poor ordering in the top positions.
- No overlap rebuild is premature because the relevant clause is already present in the retrieved candidate set.
- Masking text addresses governance or sensitive-data handling, not retrieval rank quality.
Question 6
Topic: Data Preparation
A team is evaluating a Databricks RAG application. Mosaic AI Vector Search returns 20 candidate chunks per query, but the prompt includes only the top 5. In failed traces, the gold supporting chunk is usually present in the 20 candidates but is often ranked below keyword-similar chunks. What configuration change best addresses the issue?
Options:
A. Use a larger chat model in the serving endpoint
B. Enable MLflow tracing for all production requests
C. Add re-ranking after retrieval and before prompt assembly
D. Increase the chunk size for every indexed document
Best answer: C
Explanation: Re-ranking fits when initial retrieval has acceptable recall but poor ordering. Here, Vector Search is already finding the gold supporting chunk within the top 20, so the main problem is that less relevant chunks are ranked above it before the top 5 are sent to the LLM. A re-ranker scores the candidate chunks against the specific query and reorders them before context assembly. This improves the chance that the most useful chunks enter the prompt without rebuilding the index or changing the generation model. If the gold chunk were missing from the candidate set, the team would instead investigate chunking, embeddings, filters, or retrieval parameters.
- Chunk size may affect recall or context coherence, but the evidence shows the needed chunk is already retrieved.
- Larger chat model does not fix poor ordering of retrieved context before generation.
- MLflow tracing helps diagnose failures, but logging alone does not reorder candidates.
Question 7
Topic: Data Preparation
A team is preparing 200-page product manuals for a Databricks RAG application using Vector Search. The manuals contain nested headings, wide tables, appendices, and a repeated legal footer on every page. Users ask section-specific questions and expect table values to be cited. Which chunking configuration is best before embedding?
Options:
A. Create heading-based chunks, keep tables with captions, tag appendices, and deduplicate boilerplate.
B. Create one chunk per manual, including all tables, appendices, and boilerplate.
C. Create table-only chunks and discard surrounding headings and appendix labels.
D. Create fixed token chunks with overlap, ignoring headings and page artifacts.
Best answer: A
Explanation: Long-form documents should be chunked using their visible structure before creating embeddings for Vector Search. Headings define topical boundaries, so chunks should align to sections or subsections instead of arbitrary page or token splits. Tables should stay intact when possible, or be serialized with nearby captions and heading metadata so retrieved values retain meaning. Appendices often have a different purpose from the main body, so tagging or separating them improves filtering and citation. Repeated boilerplate such as footers should be removed or deduplicated because it can create many near-identical chunks that crowd out useful results. Token limits still matter, but they should be applied within a structure-aware strategy.
- Fixed token chunking is simple, but it can split sections and tables while leaving repeated page artifacts in the index.
- Whole-manual chunks are too broad for precise retrieval and may exceed practical embedding or context-window constraints.
- Table-only chunks preserve values but lose the headings and labels needed to answer section-specific questions.
Question 8
Topic: Data Preparation
A support team runs a Databricks RAG app that answers warranty questions from governed product manuals in Unity Catalog. It uses Mosaic AI Vector Search and a fixed LLM that performs well when given the right context. MLflow evaluation on 200 labeled questions shows:
| Evaluation signal | Result |
|---|---|
| Gold-labeled passage appears in top 5 | 91% of questions |
| Retrieved top-5 chunks that are relevant | 24% |
| Answer score with gold context | 88% |
| Main user complaint | Irrelevant citations |
The team can make one retrieval-stage change before release. Which engineering decision is best?
Options:
A. Increase the retriever top-k value
B. Replace the LLM with a larger model
C. Ingest more manuals into the index
D. Add a re-ranker for retrieved chunks
Best answer: D
Explanation: The evidence points to a retrieval precision problem, not a recall or answer-generation problem. The gold-labeled passage appears in the top 5 for most questions, so retrieval recall is already strong. However, only 24% of the returned chunks are relevant, which means the generator receives too much distracting context and users see irrelevant citations. Because the LLM scores well when supplied the gold context, replacing the model would not address the main bottleneck. A re-ranker is the best targeted retrieval-stage change because it can reorder the initially retrieved chunks so the most relevant context is passed to the LLM.
- Increasing top-k may improve recall, but recall is already high and more chunks would likely add noise.
- Replacing the LLM targets answer-generation quality, but the gold-context score shows generation is not the main issue.
- Ingesting more manuals targets source coverage, but the evaluation shows the needed passage is usually already retrieved.
Question 9
Topic: Data Preparation
A support organization is building a Databricks RAG app for agents. The app must answer refund, cancellation, and escalation questions only from current, official 2026 policy documents; if a topic lacks approved policy coverage, the app must not answer. The current source table contains mixed artifacts: approved policies, older runbooks, and employee-maintained FAQs. Which pipeline step should be added before chunking and creating the Mosaic AI Vector Search index?
Options:
A. Create the index over all artifacts and rely on similarity scores.
B. Add a prompt rule to cite current policy documents.
C. Keep only documents with the latest review date.
D. Run a source-readiness gate for approved status, recency, and required-topic coverage.
Best answer: D
Explanation: For a RAG application, source quality must be assessed before chunking and indexing. Vector Search can retrieve relevant passages, but it cannot prove that the underlying corpus is authoritative, current, or complete for the business goal. A source-readiness gate should compare the required topics against document metadata such as approval status, owner, version or review date, and policy scope. Stale drafts, unofficial FAQs, and gaps should be excluded or routed to the responsible owner or SME before indexing. This keeps the retrieval layer grounded in trusted knowledge instead of trying to compensate later with ranking or prompting.
- Indexing everything makes stale or unofficial artifacts retrievable and does not prove approved coverage.
- Recency-only filtering can keep unofficial content and still miss required topics.
- A citation prompt affects generation behavior, not whether the source corpus is fit for use.
Question 10
Topic: Data Preparation
A company is building a finance-policy RAG assistant on Databricks. The corpus must use only current, approved documents registered in Unity Catalog and must not rely on informal employee examples.
Evaluation fails on this user question: “If I prepaid a hotel for a conference that the organizer cancelled, can I be reimbursed, and what evidence must I upload?” Current retrieval returns booking rules and expense-form instructions, but no exception eligibility or evidence requirements.
Which source document should be added first?
Options:
A. Historical approved expense reports for cancelled conferences
B. Expense system guide for uploading receipt files
C. Current Finance SOP for cancelled-event reimbursement exceptions
D. Corporate travel policy section on hotel rate caps
Best answer: C
Explanation: For RAG source selection, choose the document that contains the exact missing knowledge needed to answer the target question and satisfies the governance boundary. The failed question needs two policy facts: whether prepaid hotel costs are reimbursable after organizer cancellation, and which evidence must be uploaded. The current Finance SOP for cancelled-event reimbursement exceptions is the narrowest approved source that should contain those rules. Documents that are merely related to travel, form submission, or past outcomes may improve context, but they do not provide authoritative missing policy knowledge. The key is to add authoritative source content before changing retrieval, prompts, or model behavior.
- Historical examples may show past approvals, but they are informal, potentially inconsistent, and disallowed by the governance constraint.
- Upload instructions explain how to attach files, not which evidence is required for this exception.
- Hotel rate caps are current travel policy content, but they do not answer cancellation reimbursement eligibility.
Continue in the web app
Use IT Mastery for interactive Databricks Generative AI Engineer Associate practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.
Try Databricks Generative AI Engineer Associate on Web