NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy主要用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多,本身是由C语言开发。这个是很基础的扩展,其余的扩展都是以此为基础。数据结构为ndarray,一般有三种方式来创建:
- Python对象的转换
- 通过类似工厂函数NumPy内置函数生成:np.arange,np.linspace.....
- 从硬盘读取,loadtxt
The Basics
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.
In NumPy dimensions are called axes. The number of axes is rank.
NumPy’s array class is called
ndarray
.
ndarray
的常用重要属性:
- ndarray.ndim
- ndarray.shape
- ndarray.size
- ndarray.dtype
- ndarray.itemsize
- ndarray.data
1 | import numpy as np |
Array Creation
1 | import numpy as np |
Basic Operations
1 | 20,30,40,50] ) a = np.array( [ |
Universal Functions
1 | 3) B = np.arange( |
Indexing, Slicing and Iterating
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.
Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:
1 | def f(x,y): |
The dots (
...
) represent as many colons as needed to produce a complete indexing tuple. For example, ifx
is a rank 5 array (i.e., it has 5 axes), then
x[1,2,...]
is equivalent tox[1,2,:,:,:]
,x[...,3]
tox[:,:,:,:,3]
andx[4,...,5,:]
tox[4,:,:,5,:]
.
1 | # iteration |
Shape Manipulation
1 | 10*np.random.random((3,4))) a = np.floor( |
Stacking together different arrays
1 | 10*np.random.random((2,2))) a = np.floor( |
Copies and Views
No Copy at all
1 | 12) a = np.arange( |
View or Shallow Copy
1 | c = a.view() |
Slicing an array returns a view of it:
1 | 1:3] # spaces added for clarity; could also be written "s = a[:,1:3]" s = a[ : , |
Deep Copy
1 | # a new array object with new data is created d = a.copy() |
pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。最具有统计意味的工具包,某些方面优于R软件。数据结构有一维的Series,二维的DataFrame(类似于Excel或者SQL中的表,如果深入学习,会发现Pandas和SQL相似的地方很多,例如merge函数),三维的Panel(Pan(el) + da(ta) + s,知道名字的由来了吧)。学习Pandas你要掌握的是:
- 汇总和计算描述统计,处理缺失数据 ,层次化索引
- 清理、转换、合并、重塑、GroupBy技术
- 日期和时间数据类型及工具(日期处理方便地飞起)
这篇文章中使用的是Python3
的开发环境。
通常,我们引入如下包(库):
1 | In [1]: import pandas as pd |
Object Creation
创建一个
Series
通过传递一个list,pandas会创建默认的整数索引:
1 | In [4]: s = pd.Series([1,3,5,np.nan,6,8]) |
创建一个DataFrame
通过传递一个numpy array,
和一个datetime的索引以及打标签的列名:
1 | In [6]: dates = pd.date_range('20130101', periods=6) |
创建一个DataFrame
通过传递一个可以被转换成类似series的对象的字典:
1 | In [10]: df2 = pd.DataFrame({ 'A' : 1., |
1 | In [13]: df2.<TAB> |
Viewing Data
1 | In [14]: df.head() |
Selection
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods,
.at
,.iat
,.loc
,.iloc
and.ix
.
Getting
1 | In [23]: df['A'] |
Selecting by Label
1 | In [26]: df.loc[dates[0]] |
Selecting by Position
1 | In [32]: df.iloc[3] |
Boolean Index
1 | In [39]: df[df.A > 0] |
Setting
1 | In [45]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) |
Missing Data
1 | In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) |
1 | In [58]: df1.dropna(how='any') |
1 | In [59]: df1.fillna(value=5) |
1 | In [60]: pd.isnull(df1) |
Operations
Stats
1 | In [61]: df.mean() |
1 | In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) |
Apply
1 | In [66]: df.apply(np.cumsum) |
Histogramming
1 | In [68]: s = pd.Series(np.random.randint(0, 7, size=10)) |
String Methods
1 | In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) |
Merge
Concat
1 | In [73]: df = pd.DataFrame(np.random.randn(10, 4)) |
Join
1 | In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) |
1 | In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) |
Append
1 | In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) |
Grouping
1 | In [91]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', |
Reshaping
Stack
1 | In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', |
Pivot Tables
1 | In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, |
Time Series
1 | In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S') |
Getting Data In/Out
CSV
1 | In [141]: df.to_csv('foo.csv') |
Excel
1 | In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1') |