Spaces:

MHamdan
/

SPARKNET

Sleeping

App Files Files Community

SPARKNET / docs /CLOUD_ARCHITECTURE.md

MHamdan

Initial commit: SPARKNET framework

d520909 2 months ago

preview code

raw

history blame contribute delete

12.6 kB

	# SPARKNET Cloud Architecture

	This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.

	## Overview

	SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.

	## Local Development Stack

	```
	┌─────────────────────────────────────────────────────┐
	│ Local Machine │
	├─────────────────────────────────────────────────────┤
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
	│ │ Ollama │ │ ChromaDB │ │ File I/O │ │
	│ │ (LLM) │ │ (Vector) │ │ (Storage) │ │
	│ └─────────────┘ └─────────────┘ └─────────────┘ │
	│ │ │ │ │
	│ └───────────────┼───────────────┘ │
	│ │ │
	│ ┌────────┴────────┐ │
	│ │ SPARKNET │ │
	│ │ Application │ │
	│ └─────────────────┘ │
	└─────────────────────────────────────────────────────┘
	```

	## AWS Cloud Architecture

	### Target Architecture

	```
	┌────────────────────────────────────────────────────────────────────┐
	│ AWS Cloud │
	├────────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
	│ │ API GW │──────│ Lambda │──────│ Step Functions │ │
	│ │ (REST) │ │ (Compute) │ │ (Orchestration) │ │
	│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
	│ │ │ │ │
	│ │ │ │ │
	│ ▼ ▼ ▼ │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
	│ │ S3 │ │ Bedrock │ │ OpenSearch │ │
	│ │ (Storage) │ │ (LLM) │ │ (Vector Store) │ │
	│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
	│ │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
	│ │ Textract │ │ Titan │ │ DynamoDB │ │
	│ │ (OCR) │ │ (Embeddings)│ │ (Metadata) │ │
	│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
	│ │
	└────────────────────────────────────────────────────────────────────┘
	```

	### Component Mapping

	\| Local Component \| AWS Service \| Purpose \|
	\|----------------\|-------------\|---------\|
	\| File I/O \| S3 \| Document storage \|
	\| PaddleOCR/Tesseract \| Textract \| OCR extraction \|
	\| Ollama LLM \| Bedrock (Claude/Titan) \| Text generation \|
	\| Ollama Embeddings \| Titan Embeddings \| Vector embeddings \|
	\| ChromaDB \| OpenSearch Serverless \| Vector search \|
	\| SQLite (optional) \| DynamoDB \| Metadata storage \|
	\| Python Process \| Lambda \| Compute \|
	\| CLI \| API Gateway \| HTTP interface \|

	## Migration Strategy

	### Phase 1: Storage Migration

	```python
	# Abstract storage interface
	class StorageAdapter:
	def put(self, key: str, data: bytes) -> str: ...
	def get(self, key: str) -> bytes: ...
	def delete(self, key: str) -> bool: ...

	# Local implementation
	class LocalStorageAdapter(StorageAdapter):
	def __init__(self, base_path: str):
	self.base_path = Path(base_path)

	# S3 implementation
	class S3StorageAdapter(StorageAdapter):
	def __init__(self, bucket: str):
	self.client = boto3.client('s3')
	self.bucket = bucket
	```

	### Phase 2: OCR Migration

	```python
	# Abstract OCR interface
	class OCREngine:
	def recognize(self, image: np.ndarray) -> OCRResult: ...

	# Local: PaddleOCR
	class PaddleOCREngine(OCREngine): ...

	# Cloud: Textract
	class TextractEngine(OCREngine):
	def __init__(self):
	self.client = boto3.client('textract')

	def recognize(self, image: np.ndarray) -> OCRResult:
	response = self.client.detect_document_text(
	Document={'Bytes': image_bytes}
	)
	return self._convert_response(response)
	```

	### Phase 3: LLM Migration

	```python
	# Abstract LLM interface
	class LLMAdapter:
	def generate(self, prompt: str) -> str: ...

	# Local: Ollama
	class OllamaAdapter(LLMAdapter): ...

	# Cloud: Bedrock
	class BedrockAdapter(LLMAdapter):
	def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
	self.client = boto3.client('bedrock-runtime')
	self.model_id = model_id

	def generate(self, prompt: str) -> str:
	response = self.client.invoke_model(
	modelId=self.model_id,
	body=json.dumps({"prompt": prompt})
	)
	return response['body']
	```

	### Phase 4: Vector Store Migration

	```python
	# Abstract vector store interface (already implemented)
	class VectorStore:
	def add_chunks(self, chunks, embeddings): ...
	def search(self, query_embedding, top_k): ...

	# Local: ChromaDB (already implemented)
	class ChromaVectorStore(VectorStore): ...

	# Cloud: OpenSearch
	class OpenSearchVectorStore(VectorStore):
	def __init__(self, endpoint: str, index: str):
	self.client = OpenSearch(hosts=[endpoint])
	self.index = index

	def search(self, query_embedding, top_k):
	response = self.client.search(
	index=self.index,
	body={
	"knn": {
	"embedding": {
	"vector": query_embedding,
	"k": top_k
	}
	}
	}
	)
	return self._convert_results(response)
	```

	## AWS Services Deep Dive

	### Amazon S3

	- Purpose: Document storage and processed results
	- Structure:
	```
	s3://sparknet-documents/
	├── raw/ # Original documents
	│ └── {doc_id}/
	│ └── document.pdf
	├── processed/ # Processed results
	│ └── {doc_id}/
	│ ├── metadata.json
	│ ├── chunks.json
	│ └── pages/
	│ ├── page_0.png
	│ └── page_1.png
	└── cache/ # Processing cache
	```

	### Amazon Textract

	- Purpose: OCR extraction with layout analysis
	- Features:
	- Document text detection
	- Table extraction
	- Form extraction
	- Handwriting recognition

	### Amazon Bedrock

	- Purpose: LLM inference
	- Models:
	- Claude 3.5 Sonnet (primary)
	- Titan Text (cost-effective)
	- Titan Embeddings (vectors)

	### Amazon OpenSearch Serverless

	- Purpose: Vector search and retrieval
	- Configuration:
	```json
	{
	"index": "sparknet-vectors",
	"settings": {
	"index.knn": true,
	"index.knn.space_type": "cosinesimil"
	},
	"mappings": {
	"properties": {
	"embedding": {
	"type": "knn_vector",
	"dimension": 1024
	}
	}
	}
	}
	```

	### AWS Lambda

	- Purpose: Serverless compute
	- Functions:
	- `process-document`: Document processing pipeline
	- `extract-fields`: Field extraction
	- `rag-query`: RAG query handling
	- `index-document`: Vector indexing

	### AWS Step Functions

	- Purpose: Workflow orchestration
	- Workflow:
	```json
	{
	"StartAt": "ProcessDocument",
	"States": {
	"ProcessDocument": {
	"Type": "Task",
	"Resource": "arn:aws:lambda:process-document",
	"Next": "IndexChunks"
	},
	"IndexChunks": {
	"Type": "Task",
	"Resource": "arn:aws:lambda:index-document",
	"End": true
	}
	}
	}
	```

	## Cost Optimization

	### Tiered Processing

	\| Tier \| Use Case \| Services \| Cost \|
	\|------\|----------\|----------\|------\|
	\| Basic \| Simple OCR \| Textract + Titan \| $ \|
	\| Standard \| Full pipeline \| + Claude Haiku \| $$ \|
	\| Premium \| Complex analysis \| + Claude Sonnet \| $$$ \|

	### Caching Strategy

	1. Document Cache: S3 with lifecycle policies
	2. Embedding Cache: ElastiCache (Redis)
	3. Query Cache: Lambda@Edge

	## Security

	### IAM Policies

	```json
	{
	"Version": "2012-10-17",
	"Statement": [
	{
	"Effect": "Allow",
	"Action": [
	"s3:GetObject",
	"s3:PutObject"
	],
	"Resource": "arn:aws:s3:::sparknet-documents/*"
	},
	{
	"Effect": "Allow",
	"Action": [
	"textract:DetectDocumentText",
	"textract:AnalyzeDocument"
	],
	"Resource": "*"
	},
	{
	"Effect": "Allow",
	"Action": [
	"bedrock:InvokeModel"
	],
	"Resource": "arn:aws:bedrock:::foundation-model/"
	}
	]
	}
	```

	### Data Encryption

	- S3: Server-side encryption (SSE-S3 or SSE-KMS)
	- OpenSearch: Encryption at rest
	- Lambda: Environment variable encryption

	## Deployment

	### Infrastructure as Code (Terraform)

	```hcl
	# S3 Bucket
	resource "aws_s3_bucket" "documents" {
	bucket = "sparknet-documents"
	}

	# Lambda Function
	resource "aws_lambda_function" "processor" {
	function_name = "sparknet-processor"
	runtime = "python3.11"
	handler = "handler.process"
	memory_size = 1024
	timeout = 300
	}

	# OpenSearch Serverless
	resource "aws_opensearchserverless_collection" "vectors" {
	name = "sparknet-vectors"
	type = "VECTORSEARCH"
	}
	```

	### CI/CD Pipeline

	```yaml
	# GitHub Actions
	name: Deploy SPARKNET

	on:
	push:
	branches: [main]

	jobs:
	deploy:
	runs-on: ubuntu-latest
	steps:
	- uses: actions/checkout@v3
	- name: Deploy Lambda
	run: \|
	aws lambda update-function-code \
	--function-name sparknet-processor \
	--zip-file fileb://package.zip
	```

	## Monitoring

	### CloudWatch Metrics

	- Lambda invocations and duration
	- S3 request counts
	- OpenSearch query latency
	- Bedrock token usage

	### Dashboards

	- Processing throughput
	- Error rates
	- Cost tracking
	- Vector store statistics

	## Next Steps

	1. Implement Storage Abstraction: Create S3 adapter
	2. Add Textract Engine: Implement AWS OCR
	3. Create Bedrock Adapter: LLM migration
	4. Deploy OpenSearch: Vector store setup
	5. Build Lambda Functions: Serverless compute
	6. Setup Step Functions: Workflow orchestration
	7. Configure CI/CD: Automated deployment