-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fix/reduce redundant device data called[wip] #3526
Changes from 5 commits
7877995
10d0937
701d3f1
08c1bd0
060fc93
a196699
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,10 @@ | |
|
||
import json | ||
|
||
import logging | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class AirnowDataUtils: | ||
@staticmethod | ||
|
@@ -74,26 +78,48 @@ def extract_bam_data(start_date_time: str, end_date_time: str) -> pd.DataFrame: | |
|
||
@staticmethod | ||
def process_bam_data(data: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Processes BAM data by matching it to device details and constructing a list of air quality measurements. | ||
|
||
Args: | ||
data (pd.DataFrame): A DataFrame containing raw BAM device data. | ||
|
||
Returns: | ||
pd.DataFrame: A DataFrame containing processed air quality data, with relevant device information and pollutant values. | ||
""" | ||
air_now_data = [] | ||
|
||
devices = AirQoApi().get_devices(tenant=Tenant.ALL) | ||
|
||
# Precompute device mapping for faster lookup | ||
device_mapping = {} | ||
for device in devices: | ||
for device_code in device["device_codes"]: | ||
device_mapping[device_code] = device | ||
|
||
for _, row in data.iterrows(): | ||
try: | ||
device_id = row["FullAQSCode"] | ||
device_details = list( | ||
filter(lambda y: str(device_id) in y["device_codes"], devices) | ||
)[0] | ||
device_id = str(row["FullAQSCode"]) | ||
|
||
pollutant_value = dict({"pm2_5": None, "pm10": None, "no2": None}) | ||
# Lookup device details based on FullAQSCode | ||
device_details = device_mapping.get(device_id) | ||
if not device_details: | ||
logger.exception(f"Device with ID {device_id} not found") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Robust error handling and logging improvements! The addition of specific error logging for various scenarios (device not found, tenant mismatch, and general exceptions) greatly enhances the debugging capabilities of this method. The use of A suggestion to consider: To make the logs even more informative, consider including more context in the log messages. For example, you could modify the tenant mismatch log as follows: logger.exception(f"Tenant mismatch for device ID {device_id}. Expected: {device_details.get('tenant')}, Got: {row['tenant']}") This additional information could help quickly identify the source of mismatches without needing to dig through the data. Also applies to: 121-121, 148-148 |
||
continue | ||
|
||
# Initialize pollutant values (note: pm10 and no2 are not always present) | ||
pollutant_value = {"pm2_5": None, "pm10": None, "no2": None} | ||
|
||
# Get the corresponding pollutant value for the current parameter | ||
parameter_col_name = AirnowDataUtils.parameter_column_name( | ||
row["Parameter"] | ||
) | ||
|
||
pollutant_value[parameter_col_name] = row["Value"] | ||
if parameter_col_name in pollutant_value: | ||
pollutant_value[parameter_col_name] = row["Value"] | ||
|
||
if row["tenant"] != device_details.get("tenant"): | ||
raise Exception("tenants dont match") | ||
logger.exception(f"Tenant mismatch for device ID {device_id}") | ||
continue | ||
|
||
air_now_data.append( | ||
{ | ||
|
@@ -119,8 +145,6 @@ def process_bam_data(data: pd.DataFrame) -> pd.DataFrame: | |
} | ||
) | ||
except Exception as ex: | ||
print(ex) | ||
traceback.print_exc() | ||
logger.exception(f"Error processing row: {ex}") | ||
|
||
air_now_data = pd.DataFrame(air_now_data) | ||
return DataValidationUtils.remove_outliers(air_now_data) | ||
return pd.DataFrame(air_now_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Excellent performance optimization with device mapping!
The introduction of a precomputed device mapping is a smart move. This change significantly improves the efficiency of device lookups, reducing the time complexity from O(n) to O(1) for each iteration of the main loop. Well done!
A small optimization to consider:
You could potentially further optimize memory usage by using a generator expression instead of a list comprehension when creating the
devices
list. This would be beneficial if the list of devices is large. Here's how you could modify line 93:This change would fetch only the BAM devices, reducing the amount of data processed and stored in memory.