Skip to main content

My attempt (3) at learning a coding language in my mid 30s

The job of a data analyst involves data cleaning such as filling in null values and removal of outliers that may skew the trend if we computed the data without thinking. The learning videos told me this is rubbish in and rubbish out. >_< 

To select certain rows, we can do slicing using the ":" symbolseries[start_index:end_index] with the end_index not included. (Using the df from my previous post.) series[start_index:end_index:no to jump]. eg [::-1] returns the elements in the series in opposite direction.

eg1. df['Month'][0:2]

This will return me the first two rows of data in a series.

eg2. df[['Month']][0:2]

This will return me the first two rows of data in a df. 

We can also select columns by making use of .loc attribute.

eg3. df.loc[:, 'Month']

This will return a series for all the data in 'Month' column instead of a dataframe in df.loc[:, ['Month']].  

Both works the same way however one uses the index location for slicing.  

To only see a particular section of the data, we need to do filtering. To achieve filtering, we first create a list of True and False values. 

eg 4. Item_filter = df['Item']=='Sofa'

This will return a list. 0:True 1:False 2:False. Since I only have one entry of Sofa

To get the df with only sofa entry. We can use the two methods below.

eg 5. df[Item_filter] 


eg 6. df.loc[Item_filter, :] All columns of the row locations which contains sofa. 

Update the dataframe with the filter to only contain sofa (TRUE) entry.

eg 7. df = df.loc[Item_filter, :]

We reassign the df and overwrite with the df with the filter. After this is run, whenever df is called, it will be the new df that has been filter. To restart, you need to run the original df again using the excel data. 

To find out how many unique values there are in a df. 

eg 8. df['Item'].value_counts

Combine the two filter list to create the final dataframe 

eg 9. df.loc[Item_filter & Month_filter, :]

Here are other operators: 

To renaming columns during data cleanup since python does not work with a space in between the names or is case sensitive

1) Using dictionary substitution 

eg 10. df = df.rename(columns={'Month': 'month',

                        'Number': 'saleunits', 

                        'Item': 'item'})

2) Using list with the correct column titles as replacement, make sure the list has the title even for columns you are not changing. 

eg 11. df.columns = ['month',  'saleunits', 'Item']

To deleting a column

1) You can drop multiple columns at a time 

eg 12. df = df.drop(columns=['Item'])

2) 

eg 13. del df['Item']

To make use of math functions on df

eg 14. df['interest_paid'].sum() # sum the values in a column

eg 15. df.sum()  #sum up the whole df

Find missing data NaN using isna or isnull. They are exactly the same methods, but with different names.

eg16. Item_missing = df['Item'].isna()

To find all the missing data, create a series for column that has NaN values

eg 17. df.loc[Item_missing,:]

eg 18. df.loc[~Item_missing,:] # Keep in mind that we can use the not operator (~) to negate the filter so every row that doesn't have a nan is returned.

eg 19. df['Item'].isna().sum() # The code counts the number of missing values

To remove missing data

eg 20. df[30:40].dropna(how = 'any')  

This will look at row 30 to 40 and if there are NaN values, it will be dropped. Depending on situations it may not be a good idea since we are not sure what values will be impacted when we drop the whole row. 

To fill in the missing data with 0 / value from one loc before or after that missing data

eg 21. df['Number'][30:40].fillna(0)

df['Number'][30:40].fillna(method='bfill')

df['Number'][30:40].fillna(method='ffill')

To fill with interpolation

eg 22. df['Number'][30:40].interpolate(method = 'linear')

To fill with exact data from your exact knowledge

eg 23. Number_missing = df['Number'].isna()
            df.loc[Number_missing,'Number'] = 5

Sometimes when the method require df to be converted to numpy or dictionary

eg 24.  # Both change to numpy 

df.to_numpy()

df.values

eg 25. df.to_dict() #Change to dictionary to preserve the indices

Lastly to export df to csv or excel files with the pathname of the file, index can be remove using the index = false

eg 26. df.to_csv(path_or_buf='directory/filename.csv', index = False)

df.to_excel(excel_writer='directory/filename.xlsx', index=False

(Phew~~ I am finally done with this section. )

Comments