[CI-DWH][Python][REST API ODATA] Deleting a DWH_Custom Table and Restoring It

In the previous post I was identifying a good dwh_custom table to use to test deleting and inserting data, multiple rows at a time into the NetApp Cloud Insights Data Warehouse. I'd identified a good table which had 106 rows.

The following bits of python will delete all rows in my dwh_custom table, after first backing it up to CSV. Then with the backup CSV we restore all the data back into the table. And we get some timings from doing this. This is all run from the Python shell.

Python REST API ODATA to Delete All Rows in a Table

import csv
import time
import urllib3
import requests
from datetime import datetime
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

''' Program Variables '''
schema = "dwh_custom"
table = "YOUR_CUSTOM_TABLE"
key = "YOUR_KEY"

''' Connection Variables ''' 
proxies = {'https': 'http://USERNAME:PASSWORD@PROXYFQDN:PORT'}
yourTenant = "YOURTENANT"
apiKey = "APIKEY"
headers = {'X-CloudInsights-ApiKey': '{key}'.format(key=apiKey)}

''' TO GET THE WHOLE TABLE AND EXPORT TO CSV (SO WE COULD RESTORE IT) '''
''' Note: This will work for tables 1000 rows or under '''
api = "/rest/v1/dwh-management/odata/" + schema + "/" + table
fullTable = requests.get(yourTenant + api,headers=headers,proxies=proxies,verify=False).json()

columns = list(fullTable['value'][0].keys())
output = open(schema + '.' + table + '.csv',mode='w',newline='')
writer = csv.writer(output, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(columns)

for row in fullTable['value']:
  outputRow = []
  for v in columns: outputRow.append(row[v])
  writer.writerow(outputRow)

output.close()

print("Pausing for 5 seconds before deletion!")  
time.sleep(5)

''' TO GET JUST THE KEY FIELD '''
api = "/rest/v1/dwh-management/odata/" + schema + "/" + table + "?$select=" + key
theKeys = requests.get(yourTenant + api,headers=headers,proxies=proxies,verify=False).json()

''' THE DELETES '''
api1 = "/rest/v1/dwh-management/odata/" + schema + "/" + table + "("
api2 = ")"

print("Deleting " + str(len(theKeys['value'])) + " rows.")
print("Start Time =", datetime.now().strftime("%H:%M:%S"))

for k in theKeys['value']:
  print(yourTenant + api1 + str(k['tk']) + api2)
  d = requests.delete(yourTenant + api1 + str(k['tk']) + api2, headers=headers,proxies=proxies,verify = False)
  print("DELETE API worked!") if d.status_code == 200 else print("DELETE API failed!")

print("End Time =", datetime.now().strftime("%H:%M:%S"))  

''' VERIFY THE TABLE IS NOW EMPTY '''
api = "/rest/v1/dwh-management/odata/" + schema + "/" + table + "?$select=" + key
theKeys = requests.get(yourTenant + api,headers=headers,proxies=proxies,verify=False).json()
print("Row count now " + str(len(theKeys['value'])) + " rows.")

''' THE END '''


Deleted 106 rows in 45 seconds, or 2.36 deletes every second.
For 1000 rows that would be 423 seconds or 7 minutes.
For 10000 rows that would be ~ 70 minutes.
For 100000 rows that would be 700 minutes or ~ 12 hours!


Python REST API ODATA to Restore Rows in a Table

import urllib3
import requests
from datetime import datetime
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

''' Program Variables '''
schema = "dwh_custom"
table = "YOUR_CUSTOM_TABLE"
key = "YOUR_KEY"
filename = 'dwh_custom.YOUR_CUSTOM_TABLE.csv'

''' Connection Variables '''
proxies = {'https': 'http://USERNAME:PASSWORD@PROXYFQDN:PORT'}
yourTenant = "YOURTENANT"
apiKey = "APIKEY"
headers = {'X-CloudInsights-ApiKey': '{key}'.format(key=apiKey)}

''' Import the CSV '''
import pandas as pd
df = pd.read_csv (filename)
columns = list(df.columns)

''' Format as: rows[row number starting at 0][always 1][the key i.e. 'tk'] '''
rows = [] 
for r in df.iterrows(): rows.append(r)

''' Remove NaN values otherwise we get the error:
requests.exceptions.InvalidJSONError: Out of range float values are not JSON compliant ''' 
df = df.fillna('')

''' Get Row Count @ Start '''
api = "/rest/v1/dwh-management/odata/" + schema + "/" + table + "?&$count=true&$top=1"
j = requests.get(yourTenant + api,headers=headers,proxies=proxies,verify=False).json()
print("Row count at start: " + str(j['@odata.count']) + " rows.")

''' Then cycle through the list adding the data '''
print("Start Time =", datetime.now().strftime("%H:%M:%S"))
api = '/rest/v1/dwh-management/odata/' + schema + '/' + table
for r in df.iterrows():
  body = dict(r[1])
  p = requests.post(yourTenant+api, headers=headers, proxies=proxies, json=body, verify = False)

print("End Time =", datetime.now().strftime("%H:%M:%S"))  

''' Get Row Count @ End '''
api = "/rest/v1/dwh-management/odata/" + schema + "/" + table + "?&$count=true&$top=1"
j = requests.get(yourTenant + api,headers=headers,proxies=proxies,verify=False).json()
print("Row count at end: " + str(j['@odata.count']) + " rows.")

''' THE END '''


Inserted 106 rows in 56 seconds, or 1.89 inserts every second.
For 1000 rows that would be 529 seconds or ~ 9 minutes.
For 10000 rows that would be ~ 90 minutes.
For 100000 rows that would be 900 minutes or 15 hours!

Summary

This really needs a lot of testing, but it's pretty clear that it's not a solution for large datasets having to delete/insert rows of data. For small tables (really, all tables in dwh_custom should be small) it is fine.

Let's say best case scenario, 3 deletes per second and 2 inserts per second.

The solution for larger tables is using the POST ETL script and upload CSV functionality (I believe the limit on CSV size is 100MB.)

Note: In my testing, I once deleted 110 rows but only got 106 back again. Second run deleted 106 and restore 106. I guess there was something with the formatting of 4 rows from the original test (I did not have time to investigate.)

Comments