PDF Extraction
Upload a PDF document and let AI extract questions automatically, saving hours of manual data entry.
Overview
The PDF Extraction feature uses AI to analyze uploaded PDFs (textbook chapters, past papers, worksheets) and extract individual questions with their options, answers, and metadata. The extraction pipeline runs in phases -- analyzing the document structure, extracting questions in batches, and validating the results. Extracted questions can be reviewed, edited, and saved directly to your question bank.
Supported File Formats
| Format | Max Size | Notes |
|---|---|---|
| 50 MB | Text-based and scanned PDFs | |
| DOCX | 50 MB | Microsoft Word documents |
| DOC | 50 MB | Legacy Word format |
Starting an Extraction
Using the PDF Extraction Modal
-
Open PDF Extraction
- Click "AI Tools" in the sidebar
- Click "Extract from PDF"
- The PDF Extraction Modal opens
-
Upload Your File
- Drag and drop a PDF file into the upload area, or click to browse
- The file is uploaded to the server and a new extraction job is created
-
Set Taxonomy Defaults (optional)
- Before extraction begins, you can set default taxonomy values:
- Board: e.g., CBSE, ICSE
- Grade: e.g., Grade 10
- Subject: e.g., Physics
- Chapter: e.g., Motion
- Topic: e.g., Newton's Laws
- These defaults are applied to all extracted questions, saving you from tagging each one individually
- Before extraction begins, you can set default taxonomy values:
-
Start Extraction
- Click "Start Extraction"
- The extraction job begins processing
Tip: Setting taxonomy defaults before extraction is highly recommended. It dramatically reduces the time needed to review and organize extracted questions.
Extraction Phases
The extraction runs through five phases with real-time progress updates.
Phase 1: PENDING
- The job has been created and is queued for processing
- Progress: 0%
Phase 2: ANALYZING
-
AI reads and analyzes the document structure
-
Identifies:
- Total page count
- Document layout (single column, multi-column, etc.)
- Content type (textbook, question paper, worksheet)
- Estimated number of questions
- Estimated credit cost
-
The Analysis tab shows document metadata
Phase 3: EXTRACTING
-
AI extracts questions batch by batch
-
Progress bar shows current batch out of total batches
-
Each batch extracts a group of questions
-
Extracted data includes:
- Question text (with LaTeX for math content)
- Options (for MCQ-type questions)
- Correct answer
- Marks/weightage
- Question type classification
Phase 4: VALIDATING
- The system validates extracted questions for:
- Completeness (question text present, options complete)
- Answer correctness (marked answer matches an option)
- Duplicate detection
- Format consistency
Phase 5: COMPLETED
-
Extraction is finished
-
Summary shows:
- Total questions extracted vs. expected
- Credits used for the extraction
- Link to review extracted questions
Tip: You can switch between the Progress, Analysis, and Logs tabs while extraction is running to monitor different aspects of the process.
Monitoring Progress
The modal provides three views during extraction:
Progress Tab
- Visual progress bar with percentage
- Current phase name and description
- Batch progress (e.g., "Batch 3 of 5")
Analysis Tab
- Document metadata discovered during the ANALYZING phase
- Page count, content summary, and extraction strategy
Logs Tab
- Detailed step-by-step log of the extraction process
- Each log entry shows:
- Timestamp
- Step name
- Status message
- Additional details (if any)
Reviewing Extracted Questions
After extraction completes, review the questions before saving them.
-
Click "Review Questions"
- The modal directs you to the import batch review page
- Or navigate to "Import" > "Batch History" and find the batch
-
Browse Extracted Questions
- Each question is displayed with:
- Question text (rendered with LaTeX math)
- Options (if applicable)
- Detected question type
- Marks
- Confidence score
- Each question is displayed with:
-
Edit Individual Questions
- Click "Edit" on any question to open the question editor
- Fix any extraction errors:
- Correct garbled text
- Fix math formulas
- Re-order or add missing options
- Mark the correct answer
- Click "Save"
-
Set Taxonomy
- If you did not set defaults before extraction, assign taxonomy now:
- Select questions (checkboxes)
- Use "Bulk Assign" to set board, subject, chapter, topic for all selected
- Or edit taxonomy per question individually
- If you did not set defaults before extraction, assign taxonomy now:
-
Approve or Reject
- Approve questions that are correct and ready for use
- Reject questions that are unusable (badly extracted, incomplete, duplicates)
Saving to Question Bank
-
Select Questions to Save
- Check the questions you want to keep
- Or use "Select All" to include everything
-
Click "Save to Question Bank"
- Selected questions are added to your question bank
- They appear in the Question Bank with:
- Source tagged as "PDF Import"
- Batch ID for traceability
- Taxonomy as assigned
-
Use in Exams
- Saved questions are immediately available for use in paper building and exam creation
Using File-to-LaTeX Converter
For extracting content from images or converting documents to LaTeX format:
-
Open File-to-LaTeX Converter
- Click "AI Tools" > "File to LaTeX"
-
Upload a File
- Drag and drop or click to upload
- Supported formats: PNG, JPG, PDF, DOCX
-
AI Conversion
- The AI processes the file and produces:
- Raw LaTeX output
- Parsed questions with detected structure
- Metadata (total questions, OR blocks, case-based questions)
- Detected elements are flagged: text, math, chemistry formulas, tables
- The AI processes the file and produces:
-
Review Side-by-Side
- The converter shows the original file alongside the LaTeX output
- Toggle between Preview (rendered) and Code (raw LaTeX) views
-
Insert into Question Editor
- Click "Insert" to paste the LaTeX into the question editor
- Or click "Insert Question" to import a detected question directly
- Use "Copy" to copy the raw LaTeX to clipboard
Tip: The File-to-LaTeX converter is especially useful for math-heavy content where OCR alone cannot capture formulas accurately.
Credit Costs
PDF extraction consumes AI credits from your account.
| Action | Credit Cost |
|---|---|
| Document analysis | Included in extraction cost |
| Question extraction | Varies by document length and complexity |
| Estimated cost shown | Displayed during ANALYZING phase before extraction begins |
- The estimated credit cost is shown after the ANALYZING phase completes
- Actual credits used are displayed after COMPLETED phase
- If you have insufficient credits, the extraction will fail with an error
Tip: Short documents (5-10 pages) typically cost fewer credits. Very long documents (50+ pages) with dense content may cost significantly more.
Best Practices
Preparing PDFs for Extraction
- Use clear, high-resolution scans (300 DPI recommended)
- Ensure text is selectable in the PDF (not just images of text)
- Remove watermarks or background images that may confuse OCR
- Crop out irrelevant headers, footers, or margin notes
Maximizing Extraction Quality
- Set taxonomy defaults before starting extraction
- Review all extracted questions carefully -- AI is not perfect
- Pay special attention to math formulas and chemical equations
- Verify correct answers are properly identified
Handling Poor Extractions
- If a document produces many errors, try the File-to-LaTeX converter instead
- For scanned handwritten papers, quality depends heavily on handwriting clarity
- Split large documents into smaller sections for better results
Common Issues
Extraction Stuck or Failed
- Check that the file is not password-protected
- Ensure the file is under 50 MB
- Try re-uploading the file
- Check your credit balance
Math Formulas Not Extracted Correctly
- Use the File-to-LaTeX converter for math-heavy content
- Edit extracted LaTeX in the question editor
- Preview LaTeX rendering before saving
"Insufficient Credits" Error
- Purchase additional AI credits from the Billing page
- Or upgrade to a plan that includes more credits
Questions Missing from Extraction
- The AI may miss questions in unusual formats or layouts
- Manually add missing questions to your question bank
- Report extraction issues to support for improvement
Next Steps
- Creating Questions - Manual question creation
- Question Bank - Organize extracted questions
- Building Exams - Use extracted questions in exams
Need Help?
Contact support at support@edukali.ai