最近的一些比賽如
1.及時刪除無用變量並垃圾回收
通常我們在特徵工程中會涉及大量的轉換操作,產生很多的中間變量等,除了使用
1 2 3 4 5 6 7 8
temp = pd.read_csv( #do something to the file temp[ #delete when no longer needed del #collect residual garbage gc.collect()
2.預定義數據類型
pandas一般會自己推斷數據類型,不過傾向於使用耗費空間大的,如下面例子所示,預定義數據類型節省了超過一半的空間。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
dtypes = { 'ip' 'app' 'device' 'os' 'channel' 'is_attributed' } dtypes2 = { 'ip' 'app' 'device' 'os' 'channel' 'is_attributed' } train = pd.read_csv(train_sample_file,parse_dates=[ #check datatypes: train.info() train = pd.read_csv(train_sample_file,dtype=dtypes,parse_dates=[ #check datatypes: train.info() train = pd.read_csv(train_sample_file,dtype=dtypes2,parse_dates=[ #check datatypes: train.info() ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 8 columns): ip 100000 non-null int64 app 100000 non-null int64 device 100000 non-null int64 os 100000 non-null int64 channel 100000 non-null int64 click_time 100000 non-null datetime64[ns] attributed_time 227 non-null object is_attributed 100000 non-null int64 dtypes: datetime64[ns](1), int64(6), object(1) memory usage: 6.1+ MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 8 columns): ip 100000 non-null uint32 app 100000 non-null uint16 device 100000 non-null uint16 os 100000 non-null uint16 channel 100000 non-null uint16 click_time 100000 non-null datetime64[ns] attributed_time 227 non-null object is_attributed 100000 non-null uint8 dtypes: datetime64[ns](1), object(1), uint16(4), uint32(1), uint8(1) memory usage: 2.8+ MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 8 columns): ip 100000 non-null int32 app 100000 non-null int16 device 100000 non-null int16 os 100000 non-null int16 channel 100000 non-null int16 click_time 100000 non-null datetime64[ns] attributed_time 227 non-null object is_attributed 100000 non-null int8 dtypes: datetime64[ns](1), int16(4), int32(1), int8(1), object(1) memory usage: 2.8+ MB '''
3.只使用csv文件內的指定行
a) 指定行數
直接使用nrows指定
b) 跳過行數
比如我們跳過前500w取100w下面保留了head,
1 2 3
train = pd.read_csv(
c) sampling
1 2 3 4 5 6 7 8 9 10 11
import print for lines = subprocess.run([ print ''' # Line count: 184903891 ../input/train.csv 18790470 ../input/test.csv 100001 ../input/train_sample.csv '''
train一共有
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#generate list of lines to skip skiplines = np.random.choice(np.arange( #sort the list skiplines=np.sort(skiplines) #check our list print print ###################SANITY CHECK################### #find lines that weren't skipped by checking difference between each consecutive line #how many out of first 100000 will be imported into the csv? diff = skiplines[ remain = print print train = pd.read_csv( train.head() del gc.collect()
4.使用pandas 的生成器,用chunk處理
這裡我們使用np.where過濾掉'is_attributed'為0的部分(例如
1 2 3 4 5 6 7 8 9 10
#set up an empty dataframe df_converted = pd.DataFrame() #we are going to work with chunks of size 1 million rows chunksize = #in each chunk, filter for values that have 'is_attributed'==1, and merge these values into one dataframe for filtered = (chunk[(np.where(chunk[ df_converted = pd.concat([df_converted, filtered], ignore_index=
5.只載入若干列
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#wanted columns columns = [ dtypes = { 'ip' 'is_attributed' } ips_df = pd.read_csv( print ips_df.head() ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 184903890 entries, 0 to 184903889 Data columns (total 3 columns): ip uint32 click_time object is_attributed uint8 dtypes: object(1), uint32(1), uint8(1) memory usage: 2.2+ GB None'''
6.結合多種方法創意性的處理數據
例如無法使用整個數據來groupby那麼可以分塊來做,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
size= all_rows = num_parts = all_rows//size #generate the first batch ip_sums = ips_df[ #add remaining batches for start = p*size end = p*size + size if group = ips_df[start:end][[ else group = ips_df[start:][[ ip_sums = ip_sums.merge(group, on= ip_sums.columns = [ ip_sums[ ip_sums.drop(columns=[
7.使用dask代替pandas
1 2
import dask import dask.dataframe as dd