I’ve been teaching myself Python, and I’ve had a couple of successes recently where I took a concept or “need” and coded it to completion. Yay, me!
One thing I find myself doing a lot is sorting a batch of data by one column in Excel/Google Sheets, then dividing up another column into quantiles and averaging each quantile for later graphing. This a research technique I use a lot with financial data.
For example, my previous post on ‘Peak Crashes’ required me to sort a large-ish amount of data by the percentage the S&P500 was below its 250-day highest close. Then divide up the 5-day, 10-day, 20-day and 40-day returns into 20 quantiles (vigintiles) and average all of those. Then I graphed it. This sort of thing is a little time-consuming to do manually using Google Sheets.
I therefore set out to create some Python code that would easily sort by a column of my choice, divide up another chosen column into a user-specified number of quantiles, and then print the mean and median for each quantile.
The code also needed to gracefully handle data that didn’t divide evenly into quantiles (in this case, any remainder is divided up between the first and last quantile).
The output shows the quantile number, mean and median, and is formatted with tabs so you can copy/paste directly into Excel or Google Sheets.
For extra credit, I made the code create a graph with matplotlib, a library that I don’t yet fully understand. The lead image is a duplicate of a graph from the previous post, but created with my code and graphed using matplotlib.
Here’s the code:
#!/usr/bin/python # -*- coding: utf-8 -*- # tiler.py # percentile breaker-upper import pandas as pd from pandas import read_csv, DataFrame import matplotlib.pyplot as plt plt.rcdefaults() import numpy as np import matplotlib.pyplot as plt path_to_file = input("path:") #path_to_file = 'test_data.csv' # remove trailing space if exists, as happens when getting the file path # by dragging an icon from the desktop on a Mac. path_to_file = path_to_file.strip() # read the file into a DataFrame dframe = read_csv(path_to_file) list_headers = list(dframe.columns.values) len_data = len(dframe.index) print('length of data: ',len_data) # print a list of the headers, with an index number to select for x in range(0,len(list_headers)): print('column #',x,' ',list_headers[x]) col_sort = int(input('enter column to sort by: ')) col_meanmed = int(input('enter mean/median column: ')) tiles = int(input("enter number of quantiles to use: ")) # sort the data dframe = dframe.sort_values([list_headers[col_sort]],ascending=True) # handle uneven number of rows by assigning remainder to first and last quantiles. # if remainder is odd, the first quantile gets the extra one. int_tile = int(len_data/tiles) modrem = len_data % tiles offset = int((modrem/2) + (modrem % 2)) stuff_begin = int_tile + offset stuff_end = len_data - int(modrem/2) - int_tile # this is the computed quantile mean and median data tile_data = # the first and last quantiles are handled outside the loop, since they # are possibly of differing row lengths to the inside quantiles. new_slice=slice(0,stuff_begin) temp_mean = dframe.iloc[new_slice,col_meanmed] print('\n\nquantile data:\n\nmean/median 0 :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([0,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) for x in range(1,tiles-1): new_slice=slice((int_tile * x)+ offset, (int_tile * (x+1)) + offset) temp_mean = dframe.iloc[new_slice,col_meanmed] print('mean/median',x,':\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([x,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) new_slice = slice(stuff_end,len_data) temp_mean = dframe.iloc[new_slice,col_meanmed] print('mean/median',x+1,' :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([x+1,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) # matplotlib seems to want tuples instead of other data types. # adding 1 to each element in the quantile row, so that quantiles don't start # at zero. row_t = tuple([row+1 for row in tile_data]) row_mn = tuple([row for row in tile_data]) row_md = tuple([row for row in tile_data]) print(""" making bar chart.... """) # graphing stuff. Copied and pasted from samples, with much head-scratching involved. n_groups = len(row_t) title_graph = str(row_t[len(row_t)-1]) + ' quantiles, sorted by "' + str(list_headers[col_sort]) + '", data: "' + str(list_headers[col_meanmed]) + '"' fig, ax = plt.subplots() index = np.arange(n_groups) bar_width = 0.35 opacity = 1 rects1 = plt.bar(index, row_mn, bar_width, alpha=opacity, color='b', label='Mean') rects2 = plt.bar(index + bar_width, row_md, bar_width, alpha=opacity, color='r', label='Median') plt.xlabel('quantiles') plt.ylabel('value') plt.title(title_graph) plt.xticks(index + bar_width, row_t) plt.legend() plt.show()
Because quantile exploration is now so easy, I stumbled upon this below. When I sorted by the size of the ‘drop’ signal, the 40-day hold showed a strong negative trend with the higher ‘drop’ quantiles. Apparently, the bigger the fall, the longer it keeps falling (without taking any market health into account). That gives you more ideas for you to explore at home!
P.S. I’ve still got lots to learn with Python. If you see ways to make this code more efficient, I would appreciate a polite comment to that effect.