Unlocking Hidden Business Intelligence from Documents
Many organizations sit on goldmines of untapped business intelligence locked away in paper documents and PDFs. With the rise of generative AI and large language models, we now have powerful tools to extract meaningful data from these documents at scale. In this post, we'll explore how to build intelligent document processing pipelines using Amazon Bedrock that give you flexibility in both processing time and cost.
The Challenge: Millions of Documents, Multiple Formats
Imagine having hundreds of millions of scanned PDF documents—like land lease agreements—sitting in your backlog, with new ones arriving daily. Each document might have a different format: some present information in numbered lists, others in tables, and some even include hand-drawn diagrams. How do you efficiently extract structured data from this variety while balancing speed and cost?
Solution Overview: Dual Pipeline Architecture
The solution involves building two complementary inference pipelines:
- On-demand pipeline: Processes documents individually for time-sensitive requests, returning results within seconds
- Batch inference pipeline: Handles multiple documents asynchronously for cost-optimized processing
Both pipelines leverage Amazon Bedrock's Prompt Management feature, allowing you to dynamically specify different large language models and prompts at the document level—perfect for handling varying document formats with the same infrastructure.
On-Demand Pipeline: Speed When You Need It
The on-demand pipeline uses an AWS SQS FIFO queue to maintain message ordering and ensure reliable delivery. Here's how it works:
Key Components:
- SQS FIFO Queue: Triggers processing with message attributes including document ID, LLM model ID, and prompt versions
- Lambda Function: Handles the heavy lifting of document processing
- Amazon Bedrock: Performs the actual data extraction using multimodal models
- DynamoDB: Stores extracted results and performance metrics
Processing Flow:
- Lambda retrieves the PDF from S3 and converts pages to PNG images
- Relevant prompts are fetched from Amazon Bedrock Prompt Management
- For documents with more than 20 pages (current Claude 4 Sonnet limit), the function splits them into manageable chunks
- The LLM processes the images and extracts data in JSON format
- Results are stored in DynamoDB with tracking information for chunks
Dynamic Prompt Management
One of the most powerful features is the ability to use different prompts for different document types. Since land lease documents can vary dramatically in format, you can specify the appropriate prompt ID and version in each queue message. This ensures optimal extraction accuracy for each document type.
Batch Pipeline: Cost-Optimized Processing
For non-urgent processing needs, the batch pipeline offers significant cost savings by processing multiple documents in a single Amazon Bedrock batch inference job.
Key Differences:
- Uses standard SQS queue for higher throughput
- EventBridge Scheduler triggers processing on a schedule
- Requires minimum 100 records for batch jobs
- Handles duplicate message detection (since standard SQS doesn't guarantee exactly-once delivery)
- Asynchronous processing with EventBridge rules for completion handling
Practical Implementation Tips
Message Structure
Queue messages contain essential attributes like:
- Document S3 location
- LLM model ID
- Prompt ID and version
- System prompt ID and version
Handling Large Documents
The solution elegantly handles documents exceeding model limits by chunking them and tracking each piece with unique identifiers. This ensures no data is lost while maintaining processing efficiency.
Prompt Versioning Strategy
With Amazon Bedrock's limit of 50 prompts per region and 10 versions per prompt, careful prompt management becomes crucial. Consider creating prompts for different document categories and using versioning for refinements.
Choosing Your Pipeline
The decision between on-demand and batch processing depends on your specific needs:
Choose on-demand when:
- Processing time-sensitive documents
- Need immediate results for downstream systems
- Handling small volumes of high-priority documents
Choose batch processing when:
- Cost optimization is the priority
- Processing large backlogs
- Results can wait for scheduled processing windows
The Bottom Line
This dual-pipeline approach gives you the flexibility to handle diverse document processing needs efficiently. By leveraging Amazon Bedrock's prompt management and both on-demand and batch inference capabilities, you can unlock valuable business intelligence from your document archives while optimizing for both speed and cost.
The key to success lies in the dynamic prompt selection capability, which allows a single infrastructure to handle multiple document types effectively. Whether you're processing urgent contracts or working through historical archives, this solution adapts to your needs.
Source: AWS Machine Learning Blog by Tim Shear