HOOPS AI Parallel Embedding Processing
This article provides a comprehensive overview of the specifications parameters available when using embed_shape_batch() with parallel execution. These parameters control file bucketing, timeout behavior, memory management, and heavy file handling.
1. File Size Bucketing & Thresholds
1.1 What is file_size_bucketing?
When enabled, the system automatically categorizes CAD files into three size buckets and processes each bucket sequentially with optimized settings. This allows different timeout values for different file complexities, resulting in more efficient processing.
1.2 Size Categories
Classification is based on file size in BYTES:
- Small: < 1 MB
- Medium: 1 MB – 10 MB
- Large: > 10 MB
2. Timeout Parameters
Each timeout is CUMULATIVE per file. If a file is restarted due to RAM constraints, the elapsed time from previous attempts is accumulated toward the timeout limit.
- time_limit_small: Maximum seconds to process a small file (< 1 MB)
- time_limit_medium: Maximum seconds to process a medium file (1–10 MB)
- time_limit_large: Maximum seconds to process a large file (> 10 MB)
- time_limit_overall: Default/fallback timeout when bucketing is disabled or not specified
- Default: All timeouts default to 120 seconds if not explicitly specified.
3. Heavy Files & Skip Behavior
3.1 What Qualifies as a “Heavy File”?
A file is automatically flagged as “too_heavy” when it causes 2 or more worker restarts due to RAM constraints. Once flagged, the file is either skipped or retried with a single worker.
3.2 skip_heavy_files Parameter
skip_heavy_files=False (default): Heavy files are retried with 1 worker and 2× the large-file timeout. This serializes processing to prevent memory exhaustion.
skip_heavy_files=True: Heavy files are skipped entirely with an error message. Use this if you want to exclude problematic files and process the rest of your batch quickly.
4. RAM Management Parameters
The RAM guard monitors available system memory during execution and restarts workers if memory drops below your specified threshold.
4.1 min_available_ram_gb
Specifies an ABSOLUTE MINIMUM of free RAM to maintain (in gigabytes).
Example: min_available_ram_gb=2.0 reserves 2 GB of free RAM. If available RAM drops below 2 GB, workers restart to free memory.
4.2 min_available_ram_percent
Specifies a PERCENTAGE OF TOTAL SYSTEM RAM to keep free.
Example: min_available_ram_percent=10 reserves 10% of total RAM. On a 32 GB system, this maintains ~3.2 GB of free RAM.
4.3 Priority & Defaults
- If both are specified: min_available_gb takes priority
- If neither is specified: Defaults to 10% of total RAM
- Best practice: Use min_available_ram_percent for portable configurations
4.4 ram_check_interval_s
How frequently (in seconds) the system checks available RAM during processing.
- Default: 0.25 seconds (checks every 250 milliseconds)
5. Additional Specifications
start_method | Values: ‘spawn’ or ‘fork’ | Default: spawn (default on Windows)
‘spawn’ creates fresh processes (recommended for Windows/macOS). ‘fork’ available on Unix/Linux only; faster but can cause licensing issues.
log_dir | Values: file path | Default: ‘.’ (current directory)
Directory where logs are written, including too_heavy_files.log listing flagged files.
6. Configuration Examples
6.1 Fast Processing (Default Settings)
batch = embedder.embed_shape_batch(
files,
num_workers=4,
# specifications defaults apply
)
Uses 10% RAM reserve, 120-second timeout per file, no bucketing.
6.2 Optimize for Mixed File Sizes
batch = embedder.embed_shape_batch(
files,
num_workers=4,
specifications={
'file_size_bucketing': True,
'time_limit_small': 30,
'time_limit_medium': 90,
'time_limit_large': 180,
'min_available_ram_percent': 15,
}
)
Bucketed processing with per-size timeouts and 15% RAM reserve.
6.3 Skip Heavy Files to Save Time
batch = embedder.embed_shape_batch(
files,
num_workers=4,
specifications={
'file_size_bucketing': True,
'skip_heavy_files': True,
'min_available_ram_gb': 3.0,
'log_dir': './results',
}
)
Skips problematic files and logs them to results/too_heavy_files.log.
7. Summary Table
| Parameter | Possible Values | Default | Purpose | Unit |
|---|---|---|---|---|
| file_size_bucketing | True / False | False | Enable size-based processing phases | — |
| time_limit_small | numeric | 120 | Timeout for files < 1 MB | seconds |
| time_limit_medium | numeric | 120 | Timeout for files 1–10 MB | seconds |
| time_limit_large | numeric | 120 | Timeout for files > 10 MB | seconds |
| time_limit_overall | numeric | 120 | Fallback timeout | seconds |
| min_available_ram_gb | numeric | None | Absolute min free RAM (priority) | GB |
| min_available_ram_percent | numeric (0–100) | 10 | % of total RAM to keep free | % |
| ram_check_interval_s | numeric | 0.25 | RAM check frequency | seconds |
| skip_heavy_files | True / False | False | Skip vs. retry heavy files | — |
| start_method | ‘spawn’ / ‘fork’ | ‘spawn’ | Multiprocessing start method | — |
| log_dir | file path | ‘.’ | Log output directory | — |