August 19, 2022
The intent of this post is to document basic functions related to analyzing data using DataFrames.
- Data Exploration
- Operations with rows and columns
- Filtering Data Operations
- Chaining examples
- Dictionary Operations
Command | Syntax |
---|---|
Describe | df.describe |
Number of Rows and Columns | df.shape , where:df.shape[0] = rows & df.shape[1] = columns |
Names in Columns | df.columns.values |
Column data types | df.dtypes |
Min | df['COLUMN_NAME'].min() |
Max | df['COLUMN_NAME'].max() |
Number of unique values | len(df['COLUMN_NAME'].unique()) |
Sorted list of unique values | sorted(df['COLUMN_NAME'].unique()) |
Identify type of value in a column | type(df_reason_mod['COLUMN_NAME'][0]) |
Count per value in column | df.['COLUMN_NAME'].value_counts(dropna=False) |
Traspose | df.head().T |
View column with missing value in rows | df.loc[df.COLUMN_NAME.isna(), ['COLUMN_NAME_1', 'COLUMN_NAME_2', 'COLUMN_NAME_3']] |
Command | Syntax |
---|---|
Dropping column | df=df.drop(['COLUMN_NAME'], axis=1) |
Adding column | df['NEW_COLUMN']=df.sum(axis=1) |
Sum values per row | df['NEW_COLUMN']=df.sum(axis=1) |
Sum values per column | df['COLUMN_NAME'].sum(axis=0) |
Cut dataframe from Columns A to D | new_df=df.loc[:,A:D] |
Concatenate multiple df | df= pd.concat([df1,df2,df3,df4,...,dfn], axis=1) |
Renaiming columns | 1) get initial values with column_names=df.columns.value 2) modify the name value in variable 3) df.columns=column_names |
Reordering columns | 1) get initial values with colum_names_reordered=df.columns.value 2) modify order of values in variable 3) df=df[colum_names_reordered] |
Convert values in column to timestamp format. Ex, a column DATE with str such as 07/15/2018 |
pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y') The convension of the format parameter, see the ‘strftime’ documentation |
Reclassify the values in a column | In a column with 4 distinct values, where we wish to combine 2,3 & 4 as B, and re-label 1 as A: df['COLUMN_NAME']=df['COLUMN_NAME'].map({1:A,2:B,3:B,4:B}) |
Slice data with iloc | df.iloc[START_ROW:END_ROW,START_COLUMN:END_COLUMN] Ex: df.iloc[:,:-1] , which is all rows, all columns but the last one |
Where | df['COLUMN_NAME']= np.where(daf['COLUMN_NAMEs']>CONDITION,TRUE_VALUE,FALSE_VALUE) |
Filter data | df[df['COLUMN_NAME']=='VALUE'] |
Temporarily display more rows | with pd.option_context('display.max_columns', 22, 'display.min_rows',10): display(df) |
Use the parameter ‘as_index=False’ in the groupby to have teh output show in the same level as the input df
# initialize data of lists.
data = {'department': ['shoes', 'clothing', 'shoes', 'furniture']
, 'product_type':['shoes', 'shirt', 'shoes', 'furniture']
,'order_amount': [20, 21, 19, 180]
,'taxes_paid':[1.5,2,1.2,10]
}
# Create DataFrame
df = pd.DataFrame(data)
display(df)
df_result=(df.groupby(['department', 'product_type'], as_index=False)
.agg({
'order_amount':'sum'
,'taxes_paid':'sum'
})
)
# Print the output.
display(df_result)
For a dataFrame called ‘df’, this shows the ‘percentage’ of missing values per column, where there is in fact missing values detected by the ‘.isna()’ function.
((df
.isna()
.mean()
*100)
.pipe(lambda ser:ser[ser>0])
)
Operation | Example Code |
---|---|
With AND Condition | ` df[(df.state==’Texas’) & (df.year==1990)] ` |
With OR Condition | df[(df.state=='Texas') \| (df.year==1990)] |
With != Column Names Nomenclature | df[df.state=='Texas'] or df[df['state']=='Texas'] |
With a list_of_values | data_filtered=df[df.column_Name.isin(list_of_values)] |
Using ‘query’ | df.query('colum_Name>=5 and colum_Name<=10') |
Operation | Example Code |
---|---|
Converting timestamp into datetime | data['dt'] = data['ts'].map(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').strftime("%Y-%m-%d")) |
Function to convert columns | for column in columns_to_date: </break> df[column] = pd.to_datetime(df[column], format='%B %d, %Y', errors='coerce') |
Operation | Example Code | ||||||||||||||
Groupby & Aggregations Controlling Column Name by Using 'pd.NamedAgg' |
See a code example this workbook.</td> </tr>
|