Aerocet sampleData KML Pandas
2014-10-15
####Aerocet Sampling Data, using kml files and pandas####
- The data from dust pollution monitoring in Coimbatore using Aerocet was copy paste into a csv file from miniterm.py program and commanded
2
given to aerocet. Made separate CSV files for count and mass mode sampling. - Sampling point locations made into a point feature in Google earth with the name of respective sampling time. Further saved as kml file and opened in QGIS, converted into shapefile to add geometry. The shapefile was subject to edit by attribute calculator and added new two columns of lat and lon by x and y geometry for respective columns; these columns were formatted as STRINGS to get all decimal values of latitude and longitude. The shape files are then converted into csv files using MMQGS plugin. Giving decimal or real number rounds the long lat values and become useless while converting into CSV file.
- The csv files two of aerocet data with time and location data (lat lon) with time have to merge into single file for storing and further analysis.
- An initial attempt was made with this to merge the csv files as dcits, but having limitations in terms of sorting, join/merge and other sql function needed for analysis. Then found familiar pandas library is most useful in this case.
The location file joined with any of the other data files by using time , and it has to a similar format. In the location file, only time is given, in data file it is date and time. Several search and attempts were made to do text into the column for data file time column having data and time separated by space. Based on this, this and most handy by this, but having problem in the commands
# to import csv file import pandas as pd df = pd.read_csv(PMCount_08102014.csv) df1 = pd.DataFrame(df.Time.str.split(' ',1).tolist(),columns = ['Date','Time'])
returns a new separate dataframe with splitter columns, the above cocmmand is based on this. It becomed difccult in join again these data frame with mother data frame. Faced problem such as AssertionError: length of join_axes must not be equal to 1
for doing merge those data frame by merge as df1 = pd.merge(df, s, on='Time', suffixes=['Time','Status']), and for doing concat by
df1 = pd.concat([df,s],joinaxes=‘Time’)getting
ValueError: too many values to unpack
1. Then made attempt of reciprocal method that is add date to the place csf file data frame and join with other data file. For this to add date to the time data frame, used this
df[‘Name’]=‘2014-10-08 ’ + df[‘Name’].astype(str), here Name was the column heading given for the time as it is given in kml file. this column heading was changed by
df=df.rename(columns={‘Name’:‘Time’})
1. With this also mergeing ends with error of
AssertionError: length of joinaxes must not be equal to 1from command
df2=pd.concat([df,df1],joinaxes=‘Time’). Then it is found that the error is due to lack of index in two merging data frame. Index for data farme was made by
df.setindex(‘Time’); df1.setindex(‘Time’), then merging of the datframe ends without any any error by the command
df2 = pd.merge(df, df1, on=‘Time’, suffixes=[’1’, ‘2’]). Withis all the three dataframe can be merged one by one, in which suffixes gives tag to differentiate the two datframe same columns heading.
1. With this formed dataframe the removal of unwanted columns is tryed with
del df4[‘AT©x’]having limation of only single columns delete,
df4.drop([‘RH(%)x’,‘Locationx’],axis=1)doing multi column delte but giving unsatisfactory, poping up of droped columns in latter commnads, then it is found that make a new data frame with only requiered column list as by below command
df5=pa.DataFrame(df,columns=[‘Time’,‘lon’,‘lat’,‘0.3(CF)’, ‘0.5(CF)’,‘1.0(CF)’, ‘5.0(CF)’,‘10 (CF)’,‘PM1(ug/m3)’,‘PM2.5(ug/m3)’, ‘PM4(ug/m3)’,‘PM7(ug/m3)’,‘PM10(ug/m3)’, ‘TSP(ug/m3)’, ‘AT©y’,‘RH(%)y’,‘Locationy’,
‘Secondsy’,‘Status_y’])
1. The full commnads are as follows
df1 = pd.read_csv('PMCount_08102014.csv')
df2 = pd.read_csv('PMMass_08102014.csv')
df3 = pd.read_csv('08102014L.csv')
df3['Name']='2014-10-08 ' + df3['Name'].astype(str)
df3=df3.rename(columns={'Name':'Time'})
df1.set_index('Time')
df2.set_index('Time')
df3.set_index('Time')
df4 = pd.merge(df1, df2, on='Time', suffixes=['_1', '_2'])
df5 = pd.merge(df3, df4, on='Time')
df6=pd.DataFrame(df5,columns=['Time','lon','lat','0.3(CF)', '0.5(CF)','1.0(CF)', '5.0(CF)','10 (CF)','PM1(ug/m3)','PM2.5(ug/m3)', 'PM4(ug/m3)','PM7(ug/m3)','PM10(ug/m3)', 'TSP(ug/m3)', 'AT(C)_2','RH(%)_2','Location_2','Seconds_2','Status_2'])
df6.to_csv('A08102014M.csv')
- kml to csv file was generated by this scripts, it converts kml point location file into csv file with name and latitude and longitude column stitched together as a tuple. The kml file gives two coordinates one is with name tag latitude and longitude and another with coordinates. In which coordinates is current lat long as by checked its factual in google earth. below is an example of kml file point coordinates
- moreover the kml to csv script was edited to take only coordinates and names, this avoids the error comes from locus map point marking kml files where the name tag lot lang is absent. The modified script is as follows
”“”
pip install pykml
pykml depends to lxml module this script convert only one folder to csv “””
import sys, csv, codecs from pykml import parser as kmlParser
def main(): args = sys.argv[1:]
if 1 > len(args) > 2: usage()
filename = args[0]
name, places = getdata(filename)
lst = places_to_list(places)
if filename.endswith('.kml') : filename = filename.replace('.kml','.csv')
else: filename += '.csv'
if len(args) == 2: filename = args[1]
to_csv(filename, lst)
print "See {filename}\nGoodluck (^_^)!".format(filename=filename)
def usage(): print “python parse.py input.mkl output.csv” sys.exit(2)
def getDocumentAndTagPrefix(xml): root = xml.getroot(); tagPrefix = root.tag.replace( ‘}’ + xml.docinfo.root_name, ‘}’ ) return root.Document, tagPrefix
def iterFolders(folder, tagprefix): lst = [ i for i in folder.iterchildren(tagprefix+‘Folder’) ] return folder if not len(lst) else iter(lst)
def getdata(filename): document, tagPrefix =getDocumentAndTagPrefix( kmlParser.parse( open(filename) ) ) mainFolder = document.Folder
name = None
placemarks = None;
for folder in iterFolders(mainFolder, tagPrefix):
name = folder.iterchildren(tagPrefix+"name").next().text
placemarks = [ x for x in folder.iterchildren(tagPrefix+"Placemark") ]
break;
return name,placemarks;
def placestolist(places): lst = [[‘name’, ‘coordinates’]]
for place in places:
lst.append([
place.name, place.Point.coordinates
])
return lst
def to_csv(filename, lst):
with codecs.open(filename, ‘wb’) as fl:
out = csv.writer(fl);
out.writerows(lst)
if name == “main“: main() sys.exit(0)
- The modified script gives name and coordinates as separate tuples such as below
name,coordinates 18:48:29,“76.938598315386,10.94663518249638,0”
- In Aerocet sampling, used Locus map android application with a tablet phone. Marked each point in locus map tracking facility. So for duplicate point sampling, the kml file was generated with the point marked with more than one time as names. So the name columns have to be attached with more than one same coordinates in coordinates columns.
- To do the above requirements first tried with simple list split and join like below
#to open csv as tuples based on this with open(‘21102014_N.csv’) as f: data=[tuple(line) for line in csv.reader(f)]
#for a sample name and coordinates location_in = “19:10:29, 19:11:29, 19:12:29”,“76.96160393798016,10.9879284257414,0”
#to separate the time and location tuples time = locationin[0] loc = locationin[1] ts=time.split(‘, ‘)
#for loop for attaching name and location different try, #based on this, this, this, for tss in ts: a2=str(ts[0])+loc a.append(a2)
#based on this a=[] for tu in data: ts=str(tu[0].split(‘, ‘)) tp=ts[0+‘,’+(len(ts[0])*tu[1]) a.append(tp)
a=[] for tu in data: ts=tu[0][0]+(len(tu[0][0])*tu[0][1]) a.append(ts)
#but list based mehtod is not easy to applay, but below #pandas df based solution is doing exactly what is required #based on this df23=pd.concat([Series(row[‘coordinates’], row[‘names’].split(‘,’)) for , row in df.iterrows()]).resetindex()
#to add columns to dataframe df23.columns = [‘names’, ‘coordinates’]
#to split columns in pandas datafarme based on comma #delimiter, for comma separated values into two in #coordinates column. But it is not working df32 = pd.DataFrame(df3.coordinates.str.split(‘, ‘,1).tolist(),columns =[‘flips’,‘row’])
#to split columns in pandas dataframe, it is working, based #on this list=df3[‘coordinates’].tolist() thenewlist = [x.split(‘,’) for x in list] ab=pd.DataFrame(thenewlist) ab.to_csv(‘data5.csv’)
#to set the column as date time in pandas based on this df[‘time’] = df[‘time’].astype(‘datetime64[ns]‘)