Nishadh KA

Aerocet sampleData KML Pandas

2014-10-15


####Aerocet Sampling Data, using kml files and pandas####

  1. The data from dust pollution monitoring in Coimbatore using Aerocet was copy paste into a csv file from miniterm.py program and commanded 2 given to aerocet. Made separate CSV files for count and mass mode sampling.
  2. Sampling point locations made into a point feature in Google earth with the name of respective sampling time. Further saved as kml file and opened in QGIS, converted into shapefile to add geometry. The shapefile was subject to edit by attribute calculator and added new two columns of lat and lon by x and y geometry for respective columns; these columns were formatted as STRINGS to get all decimal values of latitude and longitude. The shape files are then converted into csv files using MMQGS plugin. Giving decimal or real number rounds the long lat values and become useless while converting into CSV file.
  3. The csv files two of aerocet data with time and location data (lat lon) with time have to merge into single file for storing and further analysis.
  4. An initial attempt was made with this to merge the csv files as dcits, but having limitations in terms of sorting, join/merge and other sql function needed for analysis. Then found familiar pandas library is most useful in this case.
  5. The location file joined with any of the other data files by using time , and it has to a similar format. In the location file, only time is given, in data file it is date and time. Several search and attempts were made to do text into the column for data file time column having data and time separated by space. Based on this, this and most handy by this, but having problem in the commands

                # to import csv file
                 import pandas as pd
                df = pd.read_csv(PMCount_08102014.csv)
                df1 = pd.DataFrame(df.Time.str.split(' ',1).tolist(),columns = ['Date','Time'])
    

returns a new separate dataframe with splitter columns, the above cocmmand is based on this. It becomed difccult in join again these data frame with mother data frame. Faced problem such as AssertionError: length of join_axes must not be equal to 1 for doing merge those data frame by merge as df1 = pd.merge(df, s, on='Time', suffixes=['Time','Status']), and for doing concat bydf1 = pd.concat([df,s],joinaxes=‘Time’)gettingValueError: too many values to unpack 1. Then made attempt of reciprocal method that is add date to the place csf file data frame and join with other data file. For this to add date to the time data frame, used thisdf[‘Name’]=‘2014-10-08 ’ + df[‘Name’].astype(str), here Name was the column heading given for the time as it is given in kml file. this column heading was changed bydf=df.rename(columns={‘Name’:‘Time’}) 1. With this also mergeing ends with error ofAssertionError: length of joinaxes must not be equal to 1from command df2=pd.concat([df,df1],joinaxes=‘Time’). Then it is found that the error is due to lack of index in two merging data frame. Index for data farme was made bydf.setindex(‘Time’); df1.setindex(‘Time’), then merging of the datframe ends without any any error by the commanddf2 = pd.merge(df, df1, on=‘Time’, suffixes=[’1’, ‘2’]). Withis all the three dataframe can be merged one by one, in which suffixes gives tag to differentiate the two datframe same columns heading. 1. With this formed dataframe the removal of unwanted columns is tryed withdel df4[‘AT©x’]having limation of only single columns delete,df4.drop([‘RH(%)x’,‘Locationx’],axis=1)doing multi column delte but giving unsatisfactory, poping up of droped columns in latter commnads, then it is found that make a new data frame with only requiered column list as by below command df5=pa.DataFrame(df,columns=[‘Time’,‘lon’,‘lat’,‘0.3(CF)’, ‘0.5(CF)’,‘1.0(CF)’, ‘5.0(CF)’,‘10 (CF)’,‘PM1(ug/m3)’,‘PM2.5(ug/m3)’, ‘PM4(ug/m3)’,‘PM7(ug/m3)’,‘PM10(ug/m3)’, ‘TSP(ug/m3)’, ‘AT©y’,‘RH(%)y’,‘Locationy’,
‘Seconds
y’,‘Status_y’])
1. The full commnads are as follows

                df1 = pd.read_csv('PMCount_08102014.csv')
                df2 = pd.read_csv('PMMass_08102014.csv')
                df3 = pd.read_csv('08102014L.csv')
      df3['Name']='2014-10-08 ' + df3['Name'].astype(str)
                df3=df3.rename(columns={'Name':'Time'})
      df1.set_index('Time')
                df2.set_index('Time')
                df3.set_index('Time')
      df4 = pd.merge(df1, df2, on='Time', suffixes=['_1', '_2'])
                df5 = pd.merge(df3, df4, on='Time')
                df6=pd.DataFrame(df5,columns=['Time','lon','lat','0.3(CF)', '0.5(CF)','1.0(CF)', '5.0(CF)','10 (CF)','PM1(ug/m3)','PM2.5(ug/m3)', 'PM4(ug/m3)','PM7(ug/m3)','PM10(ug/m3)', 'TSP(ug/m3)', 'AT(C)_2','RH(%)_2','Location_2','Seconds_2','Status_2'])
                 df6.to_csv('A08102014M.csv')
  1. kml to csv file was generated by this scripts, it converts kml point location file into csv file with name and latitude and longitude column stitched together as a tuple. The kml file gives two coordinates one is with name tag latitude and longitude and another with coordinates. In which coordinates is current lat long as by checked its factual in google earth. below is an example of kml file point coordinates

07:37:03 76.94981740084671 10.91510509935186 0 0.008824190954161487 0 830.3193705463046 gx:altitudeModerelativeToSeaFloor/gx:altitudeMode #m_ylw-pushpin 76.94856689929749,10.91446297419933,0

  1. moreover the kml to csv script was edited to take only coordinates and names, this avoids the error comes from locus map point marking kml files where the name tag lot lang is absent. The modified script is as follows

”“”

pip install pykml

pykml depends to lxml module this script convert only one folder to csv “””

import sys, csv, codecs from pykml import parser as kmlParser

def main(): args = sys.argv[1:]

if 1 > len(args) > 2: usage()
filename = args[0]
name, places = getdata(filename)
lst = places_to_list(places)

if filename.endswith('.kml') : filename = filename.replace('.kml','.csv')
else: filename += '.csv'
if len(args) == 2: filename = args[1]
to_csv(filename, lst)
print "See {filename}\nGoodluck (^_^)!".format(filename=filename)

def usage(): print “python parse.py input.mkl output.csv” sys.exit(2)

def getDocumentAndTagPrefix(xml): root = xml.getroot(); tagPrefix = root.tag.replace( ‘}’ + xml.docinfo.root_name, ‘}’ ) return root.Document, tagPrefix

def iterFolders(folder, tagprefix): lst = [ i for i in folder.iterchildren(tagprefix+‘Folder’) ] return folder if not len(lst) else iter(lst)

def getdata(filename): document, tagPrefix =getDocumentAndTagPrefix( kmlParser.parse( open(filename) ) ) mainFolder = document.Folder

name = None
placemarks = None;
for folder in iterFolders(mainFolder, tagPrefix):
    name = folder.iterchildren(tagPrefix+"name").next().text
    placemarks = [ x for x in folder.iterchildren(tagPrefix+"Placemark") ]
    break;
return name,placemarks;

def placestolist(places): lst = [[‘name’, ‘coordinates’]]

for place in places:
    lst.append([
        place.name, place.Point.coordinates
    ])
return lst

def to_csv(filename, lst): with codecs.open(filename, ‘wb’) as fl: out = csv.writer(fl);
out.writerows(lst)

if name == “main“: main() sys.exit(0)

  1. The modified script gives name and coordinates as separate tuples such as below

name,coordinates 18:48:29,“76.938598315386,10.94663518249638,0”

  1. In Aerocet sampling, used Locus map android application with a tablet phone. Marked each point in locus map tracking facility. So for duplicate point sampling, the kml file was generated with the point marked with more than one time as names. So the name columns have to be attached with more than one same coordinates in coordinates columns.
  2. To do the above requirements first tried with simple list split and join like below

#to open csv as tuples based on this with open(‘21102014_N.csv’) as f: data=[tuple(line) for line in csv.reader(f)]

#for a sample name and coordinates location_in = “19:10:29, 19:11:29, 19:12:29”,“76.96160393798016,10.9879284257414,0”

#to separate the time and location tuples time = locationin[0] loc = locationin[1] ts=time.split(‘, ‘)

#for loop for attaching name and location different try, #based on this, this, this, for tss in ts: a2=str(ts[0])+loc a.append(a2)

#based on this a=[] for tu in data: ts=str(tu[0].split(‘, ‘)) tp=ts[0+‘,’+(len(ts[0])*tu[1]) a.append(tp)

a=[] for tu in data: ts=tu[0][0]+(len(tu[0][0])*tu[0][1]) a.append(ts)

#but list based mehtod is not easy to applay, but below #pandas df based solution is doing exactly what is required #based on this df23=pd.concat([Series(row[‘coordinates’], row[‘names’].split(‘,’)) for , row in df.iterrows()]).resetindex()

#to add columns to dataframe df23.columns = [‘names’, ‘coordinates’]

#to split columns in pandas datafarme based on comma #delimiter, for comma separated values into two in #coordinates column. But it is not working df32 = pd.DataFrame(df3.coordinates.str.split(‘, ‘,1).tolist(),columns =[‘flips’,‘row’])

#to split columns in pandas dataframe, it is working, based #on this list=df3[‘coordinates’].tolist() thenewlist = [x.split(‘,’) for x in list] ab=pd.DataFrame(thenewlist) ab.to_csv(‘data5.csv’)

#to set the column as date time in pandas based on this df[‘time’] = df[‘time’].astype(‘datetime64[ns]‘)

  1. Other usefull pandas commands to preview dataframe based on this df.head()more info on reshaping and pivoting(grouping) is from this, this and more and easy intro to pandas is this