NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy主要用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多,本身是由C语言开发。这个是很基础的扩展,其余的扩展都是以此为基础。数据结构为ndarray,一般有三种方式来创建:
- Python对象的转换
- 通过类似工厂函数NumPy内置函数生成:np.arange,np.linspace.....
- 从硬盘读取,loadtxt
The Basics
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.
In NumPy dimensions are called axes. The number of axes is rank.
NumPy’s array class is called
ndarray.
ndarray的常用重要属性:
- ndarray.ndim
- ndarray.shape
- ndarray.size
- ndarray.dtype
- ndarray.itemsize
- ndarray.data
1 | import numpy as np |
Array Creation
1 | import numpy as np |
Basic Operations
1 | a = np.array( [20,30,40,50] ) |
Universal Functions
1 | B = np.arange(3) |
Indexing, Slicing and Iterating
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.
Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:
1 | def f(x,y): |
The dots (
...) represent as many colons as needed to produce a complete indexing tuple. For example, ifxis a rank 5 array (i.e., it has 5 axes), then
x[1,2,...]is equivalent tox[1,2,:,:,:],x[...,3]tox[:,:,:,:,3]andx[4,...,5,:]tox[4,:,:,5,:].
1 | # iteration |
Shape Manipulation
1 | a = np.floor(10*np.random.random((3,4))) |
Stacking together different arrays
1 | a = np.floor(10*np.random.random((2,2))) |
Copies and Views
No Copy at all
1 | a = np.arange(12) |
View or Shallow Copy
1 | c = a.view() |
Slicing an array returns a view of it:
1 | s = a[ : , 1:3] # spaces added for clarity; could also be written "s = a[:,1:3]" |
Deep Copy
1 | d = a.copy() # a new array object with new data is created |
pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。最具有统计意味的工具包,某些方面优于R软件。数据结构有一维的Series,二维的DataFrame(类似于Excel或者SQL中的表,如果深入学习,会发现Pandas和SQL相似的地方很多,例如merge函数),三维的Panel(Pan(el) + da(ta) + s,知道名字的由来了吧)。学习Pandas你要掌握的是:
- 汇总和计算描述统计,处理缺失数据 ,层次化索引
- 清理、转换、合并、重塑、GroupBy技术
- 日期和时间数据类型及工具(日期处理方便地飞起)
这篇文章中使用的是Python3的开发环境。
通常,我们引入如下包(库):
1 | In [1]: import pandas as pd |
Object Creation
创建一个
Series通过传递一个list,pandas会创建默认的整数索引:
1 | In [4]: s = pd.Series([1,3,5,np.nan,6,8]) |
创建一个DataFrame通过传递一个numpy array,
和一个datetime的索引以及打标签的列名:
1 | In [6]: dates = pd.date_range('20130101', periods=6) |
创建一个DataFrame通过传递一个可以被转换成类似series的对象的字典:
1 | In [10]: df2 = pd.DataFrame({ 'A' : 1., |
1 | In [13]: df2.<TAB> |
Viewing Data
1 | In [14]: df.head() |
Selection
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods,
.at,.iat,.loc,.ilocand.ix.
Getting
1 | In [23]: df['A'] |
Selecting by Label
1 | In [26]: df.loc[dates[0]] |
Selecting by Position
1 | In [32]: df.iloc[3] |
Boolean Index
1 | In [39]: df[df.A > 0] |
Setting
1 | In [45]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) |
Missing Data
1 | In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) |
1 | In [58]: df1.dropna(how='any') |
1 | In [59]: df1.fillna(value=5) |
1 | In [60]: pd.isnull(df1) |
Operations
Stats
1 | In [61]: df.mean() |
1 | In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) |
Apply
1 | In [66]: df.apply(np.cumsum) |
Histogramming
1 | In [68]: s = pd.Series(np.random.randint(0, 7, size=10)) |
String Methods
1 | In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) |
Merge
Concat
1 | In [73]: df = pd.DataFrame(np.random.randn(10, 4)) |
Join
1 | In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) |
1 | In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) |
Append
1 | In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) |
Grouping
1 | In [91]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', |
Reshaping
Stack
1 | In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', |
Pivot Tables
1 | In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, |
Time Series
1 | In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S') |
Getting Data In/Out
CSV
1 | In [141]: df.to_csv('foo.csv') |
Excel
1 | In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1') |