| # SPARKNET Cloud Architecture |
|
|
| This document outlines the cloud-ready architecture for deploying SPARKNET on AWS. |
|
|
| ## Overview |
|
|
| SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure. |
|
|
| ## Local Development Stack |
|
|
| ``` |
| ┌─────────────────────────────────────────────────────┐ |
| │ Local Machine │ |
| ├─────────────────────────────────────────────────────┤ |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ |
| │ │ Ollama │ │ ChromaDB │ │ File I/O │ │ |
| │ │ (LLM) │ │ (Vector) │ │ (Storage) │ │ |
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ |
| │ │ │ │ │ |
| │ └───────────────┼───────────────┘ │ |
| │ │ │ |
| │ ┌────────┴────────┐ │ |
| │ │ SPARKNET │ │ |
| │ │ Application │ │ |
| │ └─────────────────┘ │ |
| └─────────────────────────────────────────────────────┘ |
| ``` |
|
|
| ## AWS Cloud Architecture |
|
|
| ### Target Architecture |
|
|
| ``` |
| ┌────────────────────────────────────────────────────────────────────┐ |
| │ AWS Cloud │ |
| ├────────────────────────────────────────────────────────────────────┤ |
| │ │ |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ |
| │ │ API GW │──────│ Lambda │──────│ Step Functions │ │ |
| │ │ (REST) │ │ (Compute) │ │ (Orchestration) │ │ |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ |
| │ │ │ │ │ |
| │ │ │ │ │ |
| │ ▼ ▼ ▼ │ |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ |
| │ │ S3 │ │ Bedrock │ │ OpenSearch │ │ |
| │ │ (Storage) │ │ (LLM) │ │ (Vector Store) │ │ |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ |
| │ │ |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ |
| │ │ Textract │ │ Titan │ │ DynamoDB │ │ |
| │ │ (OCR) │ │ (Embeddings)│ │ (Metadata) │ │ |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ |
| │ │ |
| └────────────────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| ### Component Mapping |
|
|
| | Local Component | AWS Service | Purpose | |
| |----------------|-------------|---------| |
| | File I/O | S3 | Document storage | |
| | PaddleOCR/Tesseract | Textract | OCR extraction | |
| | Ollama LLM | Bedrock (Claude/Titan) | Text generation | |
| | Ollama Embeddings | Titan Embeddings | Vector embeddings | |
| | ChromaDB | OpenSearch Serverless | Vector search | |
| | SQLite (optional) | DynamoDB | Metadata storage | |
| | Python Process | Lambda | Compute | |
| | CLI | API Gateway | HTTP interface | |
|
|
| ## Migration Strategy |
|
|
| ### Phase 1: Storage Migration |
|
|
| ```python |
| # Abstract storage interface |
| class StorageAdapter: |
| def put(self, key: str, data: bytes) -> str: ... |
| def get(self, key: str) -> bytes: ... |
| def delete(self, key: str) -> bool: ... |
| |
| # Local implementation |
| class LocalStorageAdapter(StorageAdapter): |
| def __init__(self, base_path: str): |
| self.base_path = Path(base_path) |
| |
| # S3 implementation |
| class S3StorageAdapter(StorageAdapter): |
| def __init__(self, bucket: str): |
| self.client = boto3.client('s3') |
| self.bucket = bucket |
| ``` |
|
|
| ### Phase 2: OCR Migration |
|
|
| ```python |
| # Abstract OCR interface |
| class OCREngine: |
| def recognize(self, image: np.ndarray) -> OCRResult: ... |
| |
| # Local: PaddleOCR |
| class PaddleOCREngine(OCREngine): ... |
| |
| # Cloud: Textract |
| class TextractEngine(OCREngine): |
| def __init__(self): |
| self.client = boto3.client('textract') |
| |
| def recognize(self, image: np.ndarray) -> OCRResult: |
| response = self.client.detect_document_text( |
| Document={'Bytes': image_bytes} |
| ) |
| return self._convert_response(response) |
| ``` |
|
|
| ### Phase 3: LLM Migration |
|
|
| ```python |
| # Abstract LLM interface |
| class LLMAdapter: |
| def generate(self, prompt: str) -> str: ... |
| |
| # Local: Ollama |
| class OllamaAdapter(LLMAdapter): ... |
| |
| # Cloud: Bedrock |
| class BedrockAdapter(LLMAdapter): |
| def __init__(self, model_id: str = "anthropic.claude-3-sonnet"): |
| self.client = boto3.client('bedrock-runtime') |
| self.model_id = model_id |
| |
| def generate(self, prompt: str) -> str: |
| response = self.client.invoke_model( |
| modelId=self.model_id, |
| body=json.dumps({"prompt": prompt}) |
| ) |
| return response['body'] |
| ``` |
|
|
| ### Phase 4: Vector Store Migration |
|
|
| ```python |
| # Abstract vector store interface (already implemented) |
| class VectorStore: |
| def add_chunks(self, chunks, embeddings): ... |
| def search(self, query_embedding, top_k): ... |
| |
| # Local: ChromaDB (already implemented) |
| class ChromaVectorStore(VectorStore): ... |
| |
| # Cloud: OpenSearch |
| class OpenSearchVectorStore(VectorStore): |
| def __init__(self, endpoint: str, index: str): |
| self.client = OpenSearch(hosts=[endpoint]) |
| self.index = index |
| |
| def search(self, query_embedding, top_k): |
| response = self.client.search( |
| index=self.index, |
| body={ |
| "knn": { |
| "embedding": { |
| "vector": query_embedding, |
| "k": top_k |
| } |
| } |
| } |
| ) |
| return self._convert_results(response) |
| ``` |
|
|
| ## AWS Services Deep Dive |
|
|
| ### Amazon S3 |
|
|
| - **Purpose**: Document storage and processed results |
| - **Structure**: |
| ``` |
| s3://sparknet-documents/ |
| ├── raw/ # Original documents |
| │ └── {doc_id}/ |
| │ └── document.pdf |
| ├── processed/ # Processed results |
| │ └── {doc_id}/ |
| │ ├── metadata.json |
| │ ├── chunks.json |
| │ └── pages/ |
| │ ├── page_0.png |
| │ └── page_1.png |
| └── cache/ # Processing cache |
| ``` |
|
|
| ### Amazon Textract |
|
|
| - **Purpose**: OCR extraction with layout analysis |
| - **Features**: |
| - Document text detection |
| - Table extraction |
| - Form extraction |
| - Handwriting recognition |
|
|
| ### Amazon Bedrock |
|
|
| - **Purpose**: LLM inference |
| - **Models**: |
| - Claude 3.5 Sonnet (primary) |
| - Titan Text (cost-effective) |
| - Titan Embeddings (vectors) |
|
|
| ### Amazon OpenSearch Serverless |
|
|
| - **Purpose**: Vector search and retrieval |
| - **Configuration**: |
| ```json |
| { |
| "index": "sparknet-vectors", |
| "settings": { |
| "index.knn": true, |
| "index.knn.space_type": "cosinesimil" |
| }, |
| "mappings": { |
| "properties": { |
| "embedding": { |
| "type": "knn_vector", |
| "dimension": 1024 |
| } |
| } |
| } |
| } |
| ``` |
|
|
| ### AWS Lambda |
|
|
| - **Purpose**: Serverless compute |
| - **Functions**: |
| - `process-document`: Document processing pipeline |
| - `extract-fields`: Field extraction |
| - `rag-query`: RAG query handling |
| - `index-document`: Vector indexing |
|
|
| ### AWS Step Functions |
|
|
| - **Purpose**: Workflow orchestration |
| - **Workflow**: |
| ```json |
| { |
| "StartAt": "ProcessDocument", |
| "States": { |
| "ProcessDocument": { |
| "Type": "Task", |
| "Resource": "arn:aws:lambda:process-document", |
| "Next": "IndexChunks" |
| }, |
| "IndexChunks": { |
| "Type": "Task", |
| "Resource": "arn:aws:lambda:index-document", |
| "End": true |
| } |
| } |
| } |
| ``` |
|
|
| ## Cost Optimization |
|
|
| ### Tiered Processing |
|
|
| | Tier | Use Case | Services | Cost | |
| |------|----------|----------|------| |
| | Basic | Simple OCR | Textract + Titan | $ | |
| | Standard | Full pipeline | + Claude Haiku | $$ | |
| | Premium | Complex analysis | + Claude Sonnet | $$$ | |
|
|
| ### Caching Strategy |
|
|
| 1. **Document Cache**: S3 with lifecycle policies |
| 2. **Embedding Cache**: ElastiCache (Redis) |
| 3. **Query Cache**: Lambda@Edge |
|
|
| ## Security |
|
|
| ### IAM Policies |
|
|
| ```json |
| { |
| "Version": "2012-10-17", |
| "Statement": [ |
| { |
| "Effect": "Allow", |
| "Action": [ |
| "s3:GetObject", |
| "s3:PutObject" |
| ], |
| "Resource": "arn:aws:s3:::sparknet-documents/*" |
| }, |
| { |
| "Effect": "Allow", |
| "Action": [ |
| "textract:DetectDocumentText", |
| "textract:AnalyzeDocument" |
| ], |
| "Resource": "*" |
| }, |
| { |
| "Effect": "Allow", |
| "Action": [ |
| "bedrock:InvokeModel" |
| ], |
| "Resource": "arn:aws:bedrock:*::foundation-model/*" |
| } |
| ] |
| } |
| ``` |
|
|
| ### Data Encryption |
|
|
| - S3: Server-side encryption (SSE-S3 or SSE-KMS) |
| - OpenSearch: Encryption at rest |
| - Lambda: Environment variable encryption |
|
|
| ## Deployment |
|
|
| ### Infrastructure as Code (Terraform) |
|
|
| ```hcl |
| # S3 Bucket |
| resource "aws_s3_bucket" "documents" { |
| bucket = "sparknet-documents" |
| } |
| |
| # Lambda Function |
| resource "aws_lambda_function" "processor" { |
| function_name = "sparknet-processor" |
| runtime = "python3.11" |
| handler = "handler.process" |
| memory_size = 1024 |
| timeout = 300 |
| } |
| |
| # OpenSearch Serverless |
| resource "aws_opensearchserverless_collection" "vectors" { |
| name = "sparknet-vectors" |
| type = "VECTORSEARCH" |
| } |
| ``` |
|
|
| ### CI/CD Pipeline |
|
|
| ```yaml |
| # GitHub Actions |
| name: Deploy SPARKNET |
| |
| on: |
| push: |
| branches: [main] |
| |
| jobs: |
| deploy: |
| runs-on: ubuntu-latest |
| steps: |
| - uses: actions/checkout@v3 |
| - name: Deploy Lambda |
| run: | |
| aws lambda update-function-code \ |
| --function-name sparknet-processor \ |
| --zip-file fileb://package.zip |
| ``` |
|
|
| ## Monitoring |
|
|
| ### CloudWatch Metrics |
|
|
| - Lambda invocations and duration |
| - S3 request counts |
| - OpenSearch query latency |
| - Bedrock token usage |
|
|
| ### Dashboards |
|
|
| - Processing throughput |
| - Error rates |
| - Cost tracking |
| - Vector store statistics |
|
|
| ## Next Steps |
|
|
| 1. **Implement Storage Abstraction**: Create S3 adapter |
| 2. **Add Textract Engine**: Implement AWS OCR |
| 3. **Create Bedrock Adapter**: LLM migration |
| 4. **Deploy OpenSearch**: Vector store setup |
| 5. **Build Lambda Functions**: Serverless compute |
| 6. **Setup Step Functions**: Workflow orchestration |
| 7. **Configure CI/CD**: Automated deployment |
|
|