常用的包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the data into a DataFrame

df = pd.read_csv(filepath_or_buffer='../data/AAPL.csv',
                index_col='Date',
                parse_dates = True)
df2 = df.iloc[::-1]  #-1为step,获取反序数组
loc:通过行索引"Index"中的具体值来取行数据
iloc:通过行号来取行数据            

df = pd.read_csv('../Data/General/SPX-Constituents.csv', index_col='Symbol')
df = pd.read_excel(io='CU_V1.xlsx', sheet_name='Factors', parse_dates=True, index_col='Date')

Have a quick look at what we have read in : df的常用函数

df.info
df.describe
df.head

DataFrames -Rows and columns

selecting data from a DataFrame for specific row(s) or column(s)

  • use df[‘colA’] to select a single column
  • use df[[‘colA’,’colB’]] to select multiple columns
  • use df.loc[‘rowA’] to select a single row
  • use df.loc[[‘rowA’,’rowB’]] to select a row by the rows label

loc iloc的用法

创建一个Dateframe
data=pd.DataFrame(np.arange(16).reshape(4,4),index=list('abcd'),columns=list('ABCD'))
    A   B   C   D
a   0   1   2   3
b   4   5   6   7
c   8   9  10  11
d  12  13  14  15

提取第一行数据
data.loc['a'] or data.iloc[0]
提取第一列数据
data.loc[:,['A']] or data.iloc[:,[0]]
提取前两行两列数据
data.loc[['a','b'],['A','B']] or data.iloc[[0,1],[0,1]]
提取所有数据
data.loc[:,:] or data.iloc[:,:]

Calculate the Bollinger Bands for the Adj. Close

df2['30d mavg'] = df2['AdjClose'].rolling(window=21).mean()
df2['30d std'] = df2['AdjClose'].rolling(window=21).std()
df2['Upper Band'] = df2['30d mavg'] + (df2['30d std'] * 2)
df2['Lower Band'] = df2['30d mavg'] - (df2['30d std'] * 2)

Plot the results

cols = ['30d mavg','Upper Band','Lower Band','AdjClose']
df2[cols].plot()
df2['2016':][cols].plot()

improve the presentation of the chart

# set style, empty figure and axes
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)

# Get index values for the X axis for the DataFrame
x_axis = df2['2016':].index.get_level_values(0)

# Plot shaded 21 Day Bollinger Band for Facebook
ax.fill_between(x_axis, df2['2016':]['Upper Band'], df2['2016':]['Lower Band'], color='grey')

ax.plot(                                              , df2['2016':]['AdjClose'], color='blue', lw=2)
ax.plot(x_axis, df2['2016':]['30d mavg'], color='black', lw=2)

# Set Title & Show the Image
ax.set_title('30 Day Bollinger Band For Apple')
ax.set_xlabel('Date (Year/Month)')
ax.set_ylabel('Price(USD)')
ax.legend()
plt.show();

pandas data structures

series

The first Pandas data structure we’re going to look at is the Series. Series are simple one dimensional vectors of data that we can manipulate.

index = ['first','second','third','fourth']
s = pd.Series(np.arange(4),index=index)
s
s.index = ['fee','fi','fo','fum','foo']
print (s['foo'])
print (s['fi':])

pandas and datatime

The datetime stamp is one of the common fields in time series data. Let’s look at how this works. First, we’ll show how to create a range of dates using the date_range method from Pandas.

import pandas as pd
dates  = pd.date_range('2018-07-07','2018-11-26',dayfirst = False)
dates = dates[0:6]
dates

Note that we can combine these two series into a dataframe.

stock_df = pd.DataFrame({'GOOG': goog_stock_series, 'AAPL':aapl_stock_series})
stock_df

Time Series Frequencies

rng = pd.date_range()

Bollinger Bands

# Load in the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv('../Data/Commodities/GOLD.csv', index_col='Date', parse_dates=True)
df_BOLL = pd.DataFrame()

# Copy the 'Price' column from the original DataFrame into this DataFrame
df_BOLL['Price(m avg)'] = df['USD (PM)'].rolling(21).mean()
df_BOLL['Upper'] = df_BOLL['Price(m avg)'] + 2 * df['USD (PM)'].rolling(21).std()
df_BOLL['Lower'] = df_BOLL['Price(m avg)'] - 2 * df['USD (PM)'].rolling(21).std()

#Plot the results
fig = plt.figure(figsize=(18,6))

fig.suptitle('GOLD -- 2 stds above/below closing price')
plt.xlabel('Date')
plt.ylabel('Price (USD PM)')

plt.plot(df_BOLL)

#Plot the result for 2017
plt.plot(df_BOLL['2017'])

Time resampling

# Load in the data
df_APPL = pd.read_csv('../Data/Equities/AAPL.csv',index_col='Date',parse_dates=True)

# resampling Price, Open, High, Low - Annually
cols = ['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close']
df_APPL[cols].resample(rule='Y').max()

# Resampling Price by Business Quarter
df_APPL[cols].resample(rule='BQ').mean()

Sklearn Tutorial

  • machine learning library written in Python -Simple and effcient, for both experts and non-experts
  • classical, well-established machine learning algorithms
  • shipped with documentation and examples
  • BSD 3 license

Algorithms

  • Supervised learning:
  • Linear models (Ridge, Lasso, Elastic Net, …)
  • Support Vector Machines
  • Tree-based methods (Random Forests, Bagging, GBRT, …)
  • Nearest neighbors
  • Neural networks
  • Gaussian Processes
  • Feature selection

    Datasets

    Sklearn has a number of datasets that come with the library. We can import them with the python import command.

    from sklearn import datasets
    dir(datasets)
    

    our first ML program

  • This is our first example of a machine learning algorithm. Here we import the digits dataset and attemp to classify each image from 0-9 ``` from sklearn import svm from sklearn import datasets import matplotlib.pyplot as plt

Load the dataset.

digits = datasets.load_digits()

We’re going to use a support vector machine. The gamma and C paramters are known as hyperparameters. Gamma is the learning rate.

clf = svm.SVC(gamma = 0.0001, C = 100)

x,y = digits.data[:-10],digits.target[:-10]

clf.fit(x,y)

print(‘Prediction: ‘, clf.predict(digits.data[[-5]]))

Plot the image to see if what the classifier predicts matches what we expect.

plt.imshow(digits.images[-5], cmap=plt.cm.gray_r,interpolation=’nearest’) plt.show()


## Training set and testing set
- machine learning is about learning some properties of a data set and applying them to new data.

from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits()

```