Skip to main content

PDF Extraction

Convert existing PDF documents into structured questions using AI-powered extraction with a 3-phase pipeline.

Overview

Testify's PDF Extraction feature lets you upload PDF or Word documents containing questions and automatically extract them into structured, editable questions in your question bank. The system uses a multi-phase AI pipeline: first analyzing the document structure, then extracting individual questions, and finally validating the results. The entire process runs in the background with real-time progress tracking.

This feature is ideal for teachers who have existing question papers in PDF format and want to digitize them without manually re-typing every question.

PDF Extraction Upload

How It Works

Step 1: Upload Your Document

  1. Navigate to PDF Extraction from the question bank or tools menu.
  2. Click Upload and select a file:
    • Supported formats: .pdf, .docx, .doc.
    • Maximum file size: 50 MB.
  3. Optionally set parameters:
    • Custom instructions -- guide the AI on how to interpret the document (e.g., "Questions start with Q. and answers are at the end").
    • Extract solutions -- whether to extract answer explanations (enabled by default).
    • Question types -- specify which question types to look for.
    • Taxonomy defaults -- pre-set the subject, board, and grade for extracted questions.
  4. Click Start Extraction. The system creates a job and begins processing.

Step 2: The 3-Phase Pipeline

The extraction runs in three phases:

Phase 1: Document Structure Analysis

  • The system parses the PDF to determine the total page count.
  • AI analyzes the document structure to identify sections, question boundaries, and answer locations.
  • The document layout is mapped for optimal extraction.

Phase 2: Question Extraction

  • AI processes each section of the document, extracting individual questions.
  • For each question, the system identifies:
    • Question text and content blocks.
    • Question type (MCQ, fill-in-the-blank, essay, etc.).
    • Options (for MCQ/matching questions).
    • Correct answers.
    • Solutions and explanations (if extract_solutions is enabled).
    • Marks/points (if specified in the document).

Phase 3: Validation and Formatting

  • Extracted questions are validated for completeness and correctness.
  • Content is formatted into Testify's ContentBlock[] structure.
  • Taxonomy defaults are applied to all questions.
  • The job status is updated to complete.

Step 3: Review Extracted Questions

  1. Once extraction is complete, review the results on the extraction results page.
  2. Each extracted question shows:
    • The detected question type.
    • Question content and options.
    • Correct answer and solution.
    • Confidence indicator.
  3. Edit any questions that need corrections.
  4. Deselect questions you do not want to import.

Step 4: Save to Question Bank

  1. Review your selections.
  2. Click Save to Question Bank.
  3. Questions are created in your question bank with the specified taxonomy metadata.

PDF Extraction Results

Key Features

Image Analysis

The system can also analyze images for diagrams and figures:

  • Upload an image directly for diagram detection.
  • AI identifies distinct diagram regions within the image.
  • Regions are cropped and made available as individual images.
  • If no distinct regions are detected, the full image is preserved.
  • Extracted images can be attached to questions as ContentBlock items of type image.

Credit Costs

PDF extraction consumes AI credits:

  • Base cost: 5 credits per job.
  • Per page: 0.5 credits per page.
  • Example: a 20-page PDF costs approximately 15 credits (5 base + 20 x 0.5).
  • Super Admin users are exempt from credit charges.
  • If you have insufficient credits, the upload is rejected with a redirect to the billing page.

Credits are checked before extraction begins and consumed during processing.

Job Tracking

Each extraction creates a job that can be monitored:

  • Job ID -- unique identifier for the extraction.
  • Status -- PENDING, ANALYZING, EXTRACTING, COMPLETE, or FAILED.
  • Progress -- current phase and percentage.
  • Estimated credits -- calculated at upload time.
  • Total pages -- number of pages in the document.

You can check the status of any job at any time using the job ID.

Background Processing

Extraction runs in the background -- you do not need to keep the page open. The system processes the document asynchronously and updates the job status as it progresses. You can return later to check results.

Supported Document Features

FeaturePDFDOCX
Text questionsYesYes
Multiple choice optionsYesYes
Mathematical notationYes (if text-based)Yes
Images and diagramsYesLimited
TablesYesYes
Multi-column layoutsYesLimited

Security

  • Uploaded files are stored in an isolated directory on the server.
  • Filenames are randomized to prevent path traversal attacks.
  • The image serving endpoint validates filenames to block directory traversal.
  • Files are cleaned up on upload errors.

Tips and Best Practices

  • Use clean, well-formatted PDFs -- the AI performs best on documents with clear question numbering and consistent formatting.
  • Provide custom instructions when the document has an unusual format (e.g., answers on a separate page, questions without numbering).
  • Set taxonomy defaults before uploading to avoid having to tag every question individually afterward.
  • Review every extracted question -- AI extraction is accurate but not perfect, especially for complex formatting or handwritten content.
  • Check credit balance before starting -- large documents can consume significant credits.
  • Use the image analysis feature for documents with diagrams that need to be associated with specific questions.

For Organizations

  • Monitor credit consumption for PDF extraction across your organization.
  • Set credit budgets for teachers to prevent excessive extraction costs.
  • Consider bulk-uploading question papers at the start of the year to build a comprehensive question bank.
  • Question Bank -- extracted questions are saved to the question bank.
  • AI Generation -- AI credits are shared between generation and extraction.
  • Batch Import -- alternative import method using CSV/Excel files.