git commit -m "SM-17 <test>"

This commit is contained in:
atef 2024-12-04 17:09:06 +01:00
parent f03325a6a2
commit 98fb313a6f
2 changed files with 0 additions and 332 deletions

View File

@ -1,127 +0,0 @@
This README file provides a comprehensive guide to utilizing a Python script for interacting with S3 storage,
specifically designed for downloading and processing data files based on a specified time range and key parameters.
The script requires Python3 installed on your system and makes use of the s3cmd tool for accessing data in cloud storage.
It also illustrates the process of configuring s3cmd by creating a .s3cfg file with your access credentials.
############ Create the .s3cfg file in home directory ################
nano .s3cfg
Copy this lines inside the file.
[default]
host_base = sos-ch-dk-2.exo.io
host_bucket = %(bucket)s.sos-ch-dk-2.exo.io
access_key = EXO4d838d1360ba9fb7d51648b0
secret_key = _bmrp6ewWAvNwdAQoeJuC-9y02Lsx7NV6zD-WjljzCU
use_https = True
############ S3cmd instalation ################
Please install s3cmd for retrieving data from our Cloud storage.
sudo apt install s3cmd
############ Python3 instalation ################
To check if you have already have python3, run this command
python3 --version
To install you can use this command:
1) sudo apt update
2) sudo apt install python3
3) python3 --version (to check if pyhton3 installed correctly)
############ Run extractRange.py ################
usage: extractRange.py [-h] --key KEY --bucket-number BUCKET_NUMBER start_timestamp end_timestamp
KEY: the key can be a one word or a path
for example: /DcDc/Devices/2/Status/Dc/Battery/voltage ==> this will provide us a Dc battery Voltage of the DcDc device 2.
example : Dc/Battery/voltage ==> This will provide all DcDc Device voltage (including the avg voltage of all DcDc device)
example : voltage ==> This will provide all voltage of all devices in the Salimax
BUCKET_NUMBER: This a number of bucket name for the instalation
List of bucket number/ instalation:
1: Prototype
2: Marti Technik (Bern)
3: Schreinerei Schönthal (Thun)
4: Wittmann Kottingbrunn
5: Biohof Gubelmann (Walde)
6: Steakhouse Mettmenstetten
7: Andreas Ballif / Lerchenhof
8: Weidmann Oberwil (ZG)
9: Christian Huber (EBS Elektrotechnik)
start_timestamp end_timestamp: this must be a correct timestamp of 10 digits.
The start_timestamp must be smaller than the end_timestamp.
PS: The data will be downloaded to a folder named S3cmdData_{Bucket_Number}. If this folder does not exist, it will be created.
If the folder exist, it will try to download data if there is no files in the folder.
If the folder exist and contains at least one file, it will only data extraction.
Example command:
python3 extractRange.py 1707087500 1707091260 --key ActivePowerImportT2 --bucket-number 1
################################ EXTENDED FEATURES FOR MORE ADVANCED USAGE ################################
1) Multiple Keys Support:
The script supports the extraction of data using multiple keys. Users can specify one or multiple keys separated by commas with the --keys parameter.
This feature allows for more granular data extraction, catering to diverse data analysis requirements. For example, users can extract data for different
metrics or parameters from the same or different CSV files within the specified range.
2) Exact Match for Keys:
With the --exact_match flag, the script offers an option to enforce exact matching of keys. This means that only the rows containing a key that exactly
matches the specified key(s) will be considered during the data extraction process. This option enhances the precision of the data extraction, making it
particularly useful when dealing with CSV files that contain similar but distinct keys.
3) Dynamic Header Generation:
The script dynamically generates headers for the output CSV file based on the keys provided. This ensures that the output file accurately reflects the
extracted data, providing a clear and understandable format for subsequent analysis. The headers correspond to the keys used for data extraction, making
it easy to identify and analyze the extracted data.
4)Advanced Data Processing Capabilities:
i) Booleans as Numbers: The --booleans_as_numbers flag allows users to convert boolean values (True/False) into numeric representations (1/0). This feature
is particularly useful for analytical tasks that require numerical data processing.
ii) Sampling Stepsize: The --sampling_stepsize parameter enables users to define the granularity of the time range for data extraction. By specifying the number
of 2-second intervals, users can adjust the sampling interval, allowing for flexible data retrieval based on time.
Example Command:
python3 extractRange.py 1707087500 1707091260 --keys ActivePowerImportT2,Soc --bucket-number 1 --exact_match --booleans_as_numbers
This command extracts data for ActivePowerImportT2 and TotalEnergy keys from bucket number 1, between the specified timestamps, with exact
matching of keys and boolean values converted to numbers.
Visualization and Data Analysis:
After data extraction, the script facilitates data analysis by:
i) Providing a visualization function to plot the extracted data. Users can modify this function to suit their specific analysis needs, adjusting
plot labels, titles, and other matplotlib parameters.
ii) Saving the extracted data in a CSV file, with dynamically generated headers based on the specified keys. This file can be used for further
analysis or imported into data analysis tools.
This Python script streamlines the process of data retrieval from S3 storage, offering flexible and powerful options for data extraction, visualization,
and analysis. Its support for multiple keys, exact match filtering, and advanced processing capabilities make it a valuable tool for data analysts and
researchers working with time-series data or any dataset stored in S3 buckets.

View File

@ -1,205 +0,0 @@
import os
import csv
import subprocess
import argparse
import matplotlib.pyplot as plt
from collections import defaultdict
import zipfile
import base64
import shutil
def extract_timestamp(filename):
timestamp_str = filename[:10]
try:
timestamp = int(timestamp_str)
return timestamp
except ValueError:
return 0
def extract_values_by_key(csv_file, key, exact_match):
matched_values = defaultdict(list)
with open(csv_file, 'r') as file:
reader = csv.reader(file)
for row in reader:
if row:
columns = row[0].split(';')
if len(columns) > 1:
first_column = columns[0].strip()
path_key = first_column.split('/')[-1]
for key_item in key:
if exact_match:
if key_item.lower() == row[0].split('/')[-1].split(';')[0].lower():
matched_values[path_key].append(row[0])
else:
if key_item.lower() in first_column.lower():
matched_values[path_key].append(row[0])
final_key = ''.join(matched_values.keys())
combined_values = []
for values in matched_values.values():
combined_values.extend(values)
final_dict = {final_key: combined_values}
return final_dict
def list_files_in_range(start_timestamp, end_timestamp, sampling_stepsize):
filenames_in_range = [f"{timestamp:10d}" for timestamp in range(start_timestamp, end_timestamp + 1, 2*sampling_stepsize)]
return filenames_in_range
def download_files(bucket_number, filenames_to_download, product_type):
if product_type == 0:
hash = "3e5b3069-214a-43ee-8d85-57d72000c19d"
elif product_type == 1:
hash = "c0436b6a-d276-4cd8-9c44-1eae86cf5d0e"
else:
raise ValueError("Invalid product type option. Use 0 or 1")
output_directory = f"S3cmdData_{bucket_number}"
if not os.path.exists(output_directory):
os.makedirs(output_directory)
print(f"Directory '{output_directory}' created.")
for filename in filenames_to_download:
stripfilename = filename.strip()
local_path = os.path.join(output_directory, stripfilename + ".csv")
if not os.path.exists(local_path):
s3cmd_command = f"s3cmd get s3://{bucket_number}-{hash}/{stripfilename}.csv {output_directory}/"
try:
subprocess.run(s3cmd_command, shell=True, check=True)
downloaded_files = [file for file in os.listdir(output_directory) if file.startswith(filename)]
if not downloaded_files:
print(f"No matching files found for prefix '{filename}'.")
else:
print(f"Files with prefix '{filename}' downloaded successfully.")
except subprocess.CalledProcessError as e:
print(f"Error downloading files: {e}")
continue
else:
print(f"File '{filename}.csv' already exists locally. Skipping download.")
def decompress_file(compressed_file, output_directory):
base_name = os.path.splitext(os.path.basename(compressed_file))[0]
with open(compressed_file, 'rb') as file:
compressed_data = file.read()
# Decode the base64 encoded content
decoded_data = base64.b64decode(compressed_data)
zip_path = os.path.join(output_directory, 'temp.zip')
with open(zip_path, 'wb') as zip_file:
zip_file.write(decoded_data)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(output_directory)
# Rename the extracted data.csv file to the original timestamp-based name
extracted_csv_path = os.path.join(output_directory, 'data.csv')
if os.path.exists(extracted_csv_path):
new_csv_path = os.path.join(output_directory, f"{base_name}.csv")
os.rename(extracted_csv_path, new_csv_path)
os.remove(zip_path)
#os.remove(compressed_file)
print(f"Decompressed and renamed '{compressed_file}' to '{new_csv_path}'.")
def get_last_component(path):
path_without_slashes = path.replace('/', '')
return path_without_slashes
def download_and_process_files(bucket_number, start_timestamp, end_timestamp, sampling_stepsize, key, booleans_as_numbers, exact_match, product_type):
output_directory = f"S3cmdData_{bucket_number}"
if os.path.exists(output_directory):
shutil.rmtree(output_directory)
if not os.path.exists(output_directory):
os.makedirs(output_directory)
print(f"Directory '{output_directory}' created.")
filenames_to_check = list_files_in_range(start_timestamp, end_timestamp, sampling_stepsize)
existing_files = [filename for filename in filenames_to_check if os.path.exists(os.path.join(output_directory, f"{filename}.csv"))]
files_to_download = set(filenames_to_check) - set(existing_files)
if os.listdir(output_directory):
print("Files already exist in the local folder. Skipping download.")
else:
if files_to_download:
download_files(bucket_number, files_to_download, product_type)
# Decompress all downloaded .csv files (which are actually compressed)
compressed_files = [os.path.join(output_directory, file) for file in os.listdir(output_directory) if file.endswith('.csv')]
for compressed_file in compressed_files:
decompress_file(compressed_file, output_directory)
csv_files = [file for file in os.listdir(output_directory) if file.endswith('.csv')]
csv_files.sort(key=extract_timestamp)
keypath = ''
for key_item in key:
keypath += get_last_component(key_item)
output_csv_filename = f"{keypath}_{start_timestamp}_{bucket_number}.csv"
with open(output_csv_filename, 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
header = ['time']
add_header = True
for csv_file in csv_files:
file_path = os.path.join(output_directory, csv_file)
extracted_values = extract_values_by_key(file_path, key, exact_match)
if add_header:
add_header = False
for values in extracted_values.values():
first_value = values
for first_val in first_value:
header.append(first_val.split(';')[0].strip())
break
csv_writer.writerow(header)
if extracted_values:
for first_column, values in extracted_values.items():
if booleans_as_numbers:
values = [1 if value.split(';')[1].strip() == "True" else 0 if value.split(';')[1].strip() == "False" else value.split(';')[1].strip() for value in values]
values_list = []
values_list.append(csv_file.replace(".csv", ""))
for i, value in enumerate(values):
if value is None:
value = "No value provided"
else:
values_list.append(value.split(';')[1].strip())
csv_writer.writerow(values_list)
print(f"Extracted data saved in '{output_csv_filename}'.")
def parse_keys(input_string):
keys = [key.strip() for key in input_string.split(',')]
return keys
def main():
parser = argparse.ArgumentParser(description='Download files from S3 using s3cmd and extract specific values from CSV files.')
parser.add_argument('start_timestamp', type=int, help='The start timestamp for the range (even number)')
parser.add_argument('end_timestamp', type=int, help='The end timestamp for the range (even number)')
parser.add_argument('--keys', type=parse_keys, required=True, help='The part to match from each CSV file, can be a single key or a comma-separated list of keys')
parser.add_argument('--bucket-number', type=int, required=True, help='The number of the bucket to download from')
parser.add_argument('--sampling_stepsize', type=int, required=False, default=1, help='The number of 2sec intervals, which define the length of the sampling interval in S3 file retrieval')
parser.add_argument('--booleans_as_numbers', action="store_true", required=False, help='If key used, then booleans are converted to numbers [0/1], if key not used, then booleans maintained as text [False/True]')
parser.add_argument('--exact_match', action="store_true", required=False, help='If key used, then key has to match exactly "=", else it is enough that key is found "in" text')
parser.add_argument('--product_type', required=True, help='Use 0 for Salimax and 1 for Salidomo')
args = parser.parse_args()
start_timestamp = args.start_timestamp
end_timestamp = args.end_timestamp
keys = args.keys
bucket_number = args.bucket_number
sampling_stepsize = args.sampling_stepsize
booleans_as_numbers = args.booleans_as_numbers
exact_match = args.exact_match
# new arg for product type
product_type = int(args.product_type)
if start_timestamp >= end_timestamp:
print("Error: start_timestamp must be smaller than end_timestamp.")
return
download_and_process_files(bucket_number, start_timestamp, end_timestamp, sampling_stepsize, keys, booleans_as_numbers, exact_match, product_type)
if __name__ == "__main__":
main()