The job of a data analyst involves data cleaning such as filling in null values and removal of outliers that may skew the trend if we computed the data without thinking. The learning videos told me this is rubbish in and rubbish out. >_<
To select certain rows, we can do slicing using the ":" symbol. series[start_index:end_index] with the end_index not included. (Using the df from my previous post.) series[start_index:end_index:no to jump]. eg [::-1] returns the elements in the series in opposite direction.
eg1. df['Month'][0:2]
This will return me the first two rows of data in a series.
eg2. df[['Month']][0:2]
This will return me the first two rows of data in a df.
We can also select columns by making use of .loc attribute.
eg3. df.loc[:, 'Month']
This will return a series for all the data in 'Month' column instead of a dataframe in df.loc[:, ['Month']].
Both works the same way however one uses the index location for slicing.
To only see a particular section of the data, we need to do filtering. To achieve filtering, we first create a list of True and False values.
eg 4. Item_filter = df['Item']=='Sofa'
This will return a list. 0:True 1:False 2:False. Since I only have one entry of Sofa
To get the df with only sofa entry. We can use the two methods below.
eg 5. df[Item_filter]
eg 7. df = df.loc[Item_filter, :]
We reassign the df and overwrite with the df with the filter. After this is run, whenever df is called, it will be the new df that has been filter. To restart, you need to run the original df again using the excel data.
To find out how many unique values there are in a df.
eg 8. df['Item'].value_counts
Combine the two filter list to create the final dataframe
eg 9. df.loc[Item_filter & Month_filter, :]
Here are other operators:
To renaming columns during data cleanup since python does not work with a space in between the names or is case sensitive
1) Using dictionary substitution
eg 10. df = df.rename(columns={'Month': 'month',
'Number': 'saleunits',
'Item': 'item'})
2) Using list with the correct column titles as replacement, make sure the list has the title even for columns you are not changing.
eg 11. df.columns = ['month', 'saleunits', 'Item']
To deleting a column
1) You can drop multiple columns at a time
eg 12. df = df.drop(columns=['Item'])
2)
eg 13. del df['Item']
To make use of math functions on df
eg 14. df['interest_paid'].sum() # sum the values in a column
eg 15. df.sum() #sum up the whole df
Find missing data NaN using isna or isnull. They are exactly the same methods, but with different names.
eg16. Item_missing = df['Item'].isna()
To find all the missing data, create a series for column that has NaN values
eg 17. df.loc[Item_missing,:]
eg 18. df.loc[~Item_missing,:] # Keep in mind that we can use the not operator (~) to negate the filter so every row that doesn't have a nan is returned.
eg 19. df['Item'].isna().sum() # The code counts the number of missing values
To remove missing data
eg 20. df[30:40].dropna(how = 'any')
This will look at row 30 to 40 and if there are NaN values, it will be dropped. Depending on situations it may not be a good idea since we are not sure what values will be impacted when we drop the whole row.
To fill in the missing data with 0 / value from one loc before or after that missing data
eg 21. df['Number'][30:40].fillna(0)
df['Number'][30:40].fillna(method='bfill')
df['Number'][30:40].fillna(method='ffill')
To fill with interpolation
eg 22. df['Number'][30:40].interpolate(method = 'linear')
To fill with exact data from your exact knowledge
eg 23. Number_missing = df['Number'].isna()
df.loc[Number_missing,'Number'] = 5
Sometimes when the method require df to be converted to numpy or dictionary
eg 24. # Both change to numpy
df.to_numpy()
df.values
eg 25. df.to_dict() #Change to dictionary to preserve the indices
Lastly to export df to csv or excel files with the pathname of the file, index can be remove using the index = false
eg 26. df.to_csv(path_or_buf='directory/filename.csv', index = False)
df.to_excel(excel_writer='directory/filename.xlsx', index=False
(Phew~~ I am finally done with this section. )
Comments
Post a Comment
I would love to hear from you.