Performance Optimization Guide
This guide provides best practices and recommendations for optimizing the performance of the countryflag package, particularly when working with large datasets or in performance-critical applications.
General Performance Best Practices
Optimize Input Size
Batch processing: Process country names in batches of optimal size (around 100-500 items) rather than individually or in very large batches
Avoid duplicates: Remove duplicate country names before processing to avoid redundant conversions
Prevalidate inputs: Validate country names before conversion to avoid wasting time on invalid inputs
# Instead of this:
for country in very_large_list:
flag = countryflag.getflag([country])
# ... process flag
# Do this:
# Remove duplicates and batch process
unique_countries = list(set(very_large_list))
flags_string = countryflag.getflag(unique_countries)
flags = flags_string.split(" ")
# ... process flags
Use Efficient Data Structures
List vs Generator: Use generators for large datasets when iterating through results to reduce memory usage
Join vs Concatenation: Prefer join operations over string concatenation in loops for better performance
Dictionary lookups: Use dictionary lookups for frequently accessed data
# Instead of this:
result = ""
for country, flag in pairs:
result += flag + " " # Inefficient string concatenation
# Do this:
result = " ".join(flag for _, flag in pairs) # More efficient
Caching Strategies
Built-in Caching Options
CountryFlag provides two built-in caching implementations:
Memory Cache (MemoryCache): Fast in-memory caching with no persistence
Disk Cache (DiskCache): Persistent caching with slightly slower access
When to Use Caching
Repetitive conversions: When the same country names are converted multiple times
Long-running applications: For services or applications that run for extended periods
Batch processing: When processing large datasets with potential repeated values
Choosing a Cache Implementation
Memory Cache: Best for speed when persistence is not required and memory is plentiful
Disk Cache: Best for persistence between application runs or when memory is limited
Custom Cache: Implement your own cache by extending the Cache interface for specialized needs
# Using memory cache
from countryflag.cache import MemoryCache
from countryflag.core import CountryFlag
# Create a memory cache
memory_cache = MemoryCache()
# Create a CountryFlag instance with caching
cf = CountryFlag(cache=memory_cache)
# Subsequent calls will use the cache
flags, pairs = cf.get_flag(["United States", "Canada", "Mexico"])
# Using disk cache
from countryflag.cache import DiskCache
# Create a disk cache
disk_cache = DiskCache("/path/to/cache/dir")
# Create a CountryFlag instance with disk caching
cf = CountryFlag(cache=disk_cache)
Cache Invalidation
When to invalidate: Invalidate cache when country data might have changed
Selective invalidation: Delete specific cache entries rather than clearing the entire cache
Cache size management: Monitor cache size and implement policies to limit growth
Benchmarking Results
Our benchmarks show significant performance improvements with caching:
Dataset Size |
No Cache (ms) |
Memory Cache (ms) |
Improvement |
|---|---|---|---|
Small (5) |
10 |
0.5 |
20x |
Medium (25) |
50 |
1 |
50x |
Large (250) |
500 |
5 |
100x |
Note: Actual performance will vary based on hardware and system load.
Handling Large Datasets
Strategies for Large Lists
Chunking: Process very large lists in smaller chunks to avoid memory issues
Streaming: Use generators and streaming processing when possible
Parallel processing: Process chunks in parallel for better performance
def process_large_country_list(countries, chunk_size=500):
"""Process a large list of countries in chunks."""
from countryflag.core import CountryFlag
cf = CountryFlag()
results = []
# Process in chunks
for i in range(0, len(countries), chunk_size):
chunk = countries[i:i+chunk_size]
flags, pairs = cf.get_flag(chunk)
results.extend(pairs)
return results
File Processing Optimizations
Asynchronous I/O: Use process_file_input_async for processing large files
Parallel processing: Use process_multiple_files for processing multiple files in parallel
Streaming: Process large files line by line rather than loading the entire content
# Asynchronous file processing
import asyncio
from countryflag.utils.io import process_file_input_async
async def process_large_file(file_path):
countries = await process_file_input_async(file_path)
# Process countries...
asyncio.run(process_large_file("very_large_file.txt"))
# Parallel processing of multiple files
from countryflag.utils.io import process_multiple_files
file_paths = ["file1.txt", "file2.txt", "file3.txt", "file4.txt"]
all_countries = process_multiple_files(file_paths, max_workers=4)
Concurrency Recommendations
Thread-Based Concurrency
When to use: For I/O-bound operations or when making multiple independent conversions
Thread pool: Use ThreadPoolExecutor for efficient thread management
Shared resources: Be careful with shared caches in multi-threaded environments
from concurrent.futures import ThreadPoolExecutor
def convert_countries(countries):
cf = CountryFlag()
return cf.get_flag(countries)
country_lists = [list1, list2, list3, list4]
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(convert_countries, country_lists))
Process-Based Concurrency
When to use: For CPU-bound operations on large datasets
Process pool: Use ProcessPoolExecutor for true parallel processing
Data serialization: Be aware of the overhead of inter-process communication
from concurrent.futures import ProcessPoolExecutor
# Function to be executed in separate processes
def process_country_chunk(chunk):
cf = CountryFlag()
return cf.get_flag(chunk)
# Split large list into chunks
chunks = [large_list[i:i+1000] for i in range(0, len(large_list), 1000)]
# Process chunks in parallel
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
results = list(executor.map(process_country_chunk, chunks))
Asynchronous Processing
When to use: For I/O-bound operations like file reading or network requests
Event loop: Use asyncio’s event loop for coordinating asynchronous tasks
Async functions: Use async/await with the library’s async functions
import asyncio
async def process_files(file_paths):
from countryflag.utils.io import process_file_input_async
# Create tasks for each file
tasks = [process_file_input_async(file_path) for file_path in file_paths]
# Run all tasks concurrently
country_lists = await asyncio.gather(*tasks)
# Flatten the list of lists
all_countries = [country for sublist in country_lists for country in sublist]
return all_countries
Memory Usage Optimization
Memory-Efficient Processing
Generator expressions: Use generator expressions instead of list comprehensions when appropriate
Chunking: Process data in manageable chunks to control memory usage
Garbage collection: Force garbage collection after processing large batches
import gc
# Process a very large dataset in memory-efficient way
def memory_efficient_processing(very_large_list):
cf = CountryFlag()
# Process in chunks to control memory usage
chunk_size = 1000
results = []
for i in range(0, len(very_large_list), chunk_size):
chunk = very_large_list[i:i+chunk_size]
flags, pairs = cf.get_flag(chunk)
# Process and store only what you need
results.extend((country, flag) for country, flag in pairs)
# Force garbage collection after each chunk
gc.collect()
return results
Object Lifecycle Management
Reuse objects: Create CountryFlag instances once and reuse them
Limit cached data: Control the size of caches with policies like LRU (Least Recently Used)
Reference management: Be aware of references that might prevent garbage collection
Performance Profiling
Measure first: Use the cProfile module or other profiling tools to identify bottlenecks
Target optimizations: Focus on optimizing the most time-consuming operations
Benchmark regularly: Regularly benchmark to ensure optimizations are effective
import cProfile
# Profile the performance of a function
def profile_countryflag():
cf = CountryFlag()
large_list = generate_large_country_list(1000)
cProfile.runctx('cf.get_flag(large_list)', globals(), locals(), 'prof_stats')
# Analyze the results
import pstats
p = pstats.Stats('prof_stats')
p.sort_stats('cumulative').print_stats(20)
Advanced Performance Tips
JIT compilation: For extreme performance, consider using PyPy or Numba for JIT compilation
C extensions: Critical parts could be rewritten as C extensions for maximum performance
Distributed processing: For massive datasets, consider distributed processing frameworks
Conclusion
By applying these optimization strategies, you can significantly improve the performance of the countryflag package, especially when working with large datasets or in performance-critical applications. Always measure before and after optimization to ensure your changes are having the desired effect.