मुख्य कंटेंट तक स्किप करें

PDF Extraction

Upload a PDF document and let AI extract questions automatically, saving hours of manual data entry.

Overview

The PDF Extraction feature uses AI to analyze uploaded PDFs (textbook chapters, past papers, worksheets) and extract individual questions with their options, answers, and metadata. The extraction pipeline runs in phases -- analyzing the document structure, extracting questions in batches, and validating the results. Extracted questions can be reviewed, edited, and saved directly to your question bank.

Supported File Formats

FormatMax SizeNotes
PDF50 MBText-based and scanned PDFs
DOCX50 MBMicrosoft Word documents
DOC50 MBLegacy Word format

Starting an Extraction

Using the PDF Extraction Modal

  1. Open PDF Extraction

    • Click "AI Tools" in the sidebar
    • Click "Extract from PDF"
    • The PDF Extraction Modal opens

    PDF Extraction Modal

  2. Upload Your File

    • Drag and drop a PDF file into the upload area, or click to browse
    • The file is uploaded to the server and a new extraction job is created
  3. Set Taxonomy Defaults (optional)

    • Before extraction begins, you can set default taxonomy values:
      • Board: e.g., CBSE, ICSE
      • Grade: e.g., Grade 10
      • Subject: e.g., Physics
      • Chapter: e.g., Motion
      • Topic: e.g., Newton's Laws
    • These defaults are applied to all extracted questions, saving you from tagging each one individually

    Taxonomy Options

  4. Start Extraction

    • Click "Start Extraction"
    • The extraction job begins processing

Tip: Setting taxonomy defaults before extraction is highly recommended. It dramatically reduces the time needed to review and organize extracted questions.

Extraction Phases

The extraction runs through five phases with real-time progress updates.

Phase 1: PENDING

  • The job has been created and is queued for processing
  • Progress: 0%

Phase 2: ANALYZING

  • AI reads and analyzes the document structure

  • Identifies:

    • Total page count
    • Document layout (single column, multi-column, etc.)
    • Content type (textbook, question paper, worksheet)
    • Estimated number of questions
    • Estimated credit cost
  • The Analysis tab shows document metadata

    Analyzing Phase

Phase 3: EXTRACTING

  • AI extracts questions batch by batch

  • Progress bar shows current batch out of total batches

  • Each batch extracts a group of questions

  • Extracted data includes:

    • Question text (with LaTeX for math content)
    • Options (for MCQ-type questions)
    • Correct answer
    • Marks/weightage
    • Question type classification

    Extracting Phase

Phase 4: VALIDATING

  • The system validates extracted questions for:
    • Completeness (question text present, options complete)
    • Answer correctness (marked answer matches an option)
    • Duplicate detection
    • Format consistency

Phase 5: COMPLETED

  • Extraction is finished

  • Summary shows:

    • Total questions extracted vs. expected
    • Credits used for the extraction
    • Link to review extracted questions

    Extraction Complete

Tip: You can switch between the Progress, Analysis, and Logs tabs while extraction is running to monitor different aspects of the process.

Monitoring Progress

The modal provides three views during extraction:

Progress Tab

  • Visual progress bar with percentage
  • Current phase name and description
  • Batch progress (e.g., "Batch 3 of 5")

Analysis Tab

  • Document metadata discovered during the ANALYZING phase
  • Page count, content summary, and extraction strategy

Logs Tab

  • Detailed step-by-step log of the extraction process
  • Each log entry shows:
    • Timestamp
    • Step name
    • Status message
    • Additional details (if any)

Reviewing Extracted Questions

After extraction completes, review the questions before saving them.

  1. Click "Review Questions"

    • The modal directs you to the import batch review page
    • Or navigate to "Import" > "Batch History" and find the batch
  2. Browse Extracted Questions

    • Each question is displayed with:
      • Question text (rendered with LaTeX math)
      • Options (if applicable)
      • Detected question type
      • Marks
      • Confidence score

    Review Extracted Questions

  3. Edit Individual Questions

    • Click "Edit" on any question to open the question editor
    • Fix any extraction errors:
      • Correct garbled text
      • Fix math formulas
      • Re-order or add missing options
      • Mark the correct answer
    • Click "Save"
  4. Set Taxonomy

    • If you did not set defaults before extraction, assign taxonomy now:
      • Select questions (checkboxes)
      • Use "Bulk Assign" to set board, subject, chapter, topic for all selected
    • Or edit taxonomy per question individually
  5. Approve or Reject

    • Approve questions that are correct and ready for use
    • Reject questions that are unusable (badly extracted, incomplete, duplicates)

Saving to Question Bank

  1. Select Questions to Save

    • Check the questions you want to keep
    • Or use "Select All" to include everything
  2. Click "Save to Question Bank"

    • Selected questions are added to your question bank
    • They appear in the Question Bank with:
      • Source tagged as "PDF Import"
      • Batch ID for traceability
      • Taxonomy as assigned
  3. Use in Exams

    • Saved questions are immediately available for use in paper building and exam creation

Using File-to-LaTeX Converter

For extracting content from images or converting documents to LaTeX format:

  1. Open File-to-LaTeX Converter

    • Click "AI Tools" > "File to LaTeX"
  2. Upload a File

    • Drag and drop or click to upload
    • Supported formats: PNG, JPG, PDF, DOCX
  3. AI Conversion

    • The AI processes the file and produces:
      • Raw LaTeX output
      • Parsed questions with detected structure
      • Metadata (total questions, OR blocks, case-based questions)
    • Detected elements are flagged: text, math, chemistry formulas, tables

    File to LaTeX

  4. Review Side-by-Side

    • The converter shows the original file alongside the LaTeX output
    • Toggle between Preview (rendered) and Code (raw LaTeX) views
  5. Insert into Question Editor

    • Click "Insert" to paste the LaTeX into the question editor
    • Or click "Insert Question" to import a detected question directly
    • Use "Copy" to copy the raw LaTeX to clipboard

Tip: The File-to-LaTeX converter is especially useful for math-heavy content where OCR alone cannot capture formulas accurately.

Credit Costs

PDF extraction consumes AI credits from your account.

ActionCredit Cost
Document analysisIncluded in extraction cost
Question extractionVaries by document length and complexity
Estimated cost shownDisplayed during ANALYZING phase before extraction begins
  • The estimated credit cost is shown after the ANALYZING phase completes
  • Actual credits used are displayed after COMPLETED phase
  • If you have insufficient credits, the extraction will fail with an error

Tip: Short documents (5-10 pages) typically cost fewer credits. Very long documents (50+ pages) with dense content may cost significantly more.

Best Practices

Preparing PDFs for Extraction

  • Use clear, high-resolution scans (300 DPI recommended)
  • Ensure text is selectable in the PDF (not just images of text)
  • Remove watermarks or background images that may confuse OCR
  • Crop out irrelevant headers, footers, or margin notes

Maximizing Extraction Quality

  • Set taxonomy defaults before starting extraction
  • Review all extracted questions carefully -- AI is not perfect
  • Pay special attention to math formulas and chemical equations
  • Verify correct answers are properly identified

Handling Poor Extractions

  • If a document produces many errors, try the File-to-LaTeX converter instead
  • For scanned handwritten papers, quality depends heavily on handwriting clarity
  • Split large documents into smaller sections for better results

Common Issues

Extraction Stuck or Failed

  • Check that the file is not password-protected
  • Ensure the file is under 50 MB
  • Try re-uploading the file
  • Check your credit balance

Math Formulas Not Extracted Correctly

  • Use the File-to-LaTeX converter for math-heavy content
  • Edit extracted LaTeX in the question editor
  • Preview LaTeX rendering before saving

"Insufficient Credits" Error

  • Purchase additional AI credits from the Billing page
  • Or upgrade to a plan that includes more credits

Questions Missing from Extraction

  • The AI may miss questions in unusual formats or layouts
  • Manually add missing questions to your question bank
  • Report extraction issues to support for improvement

Next Steps

Need Help?

Contact support at support@edukali.ai