處理大數據集的建議

最近的一些比賽如

1.及時刪除無用變量並垃圾回收

通常我們在特徵工程中會涉及大量的轉換操作,產生很多的中間變量等,除了使用

1
2
3
4
5
6
7
8
 temp = pd.read_csv(

#do something to the file
temp[
#delete when no longer needed
del
#collect residual garbage
gc.collect()

2.預定義數據類型

pandas一般會自己推斷數據類型,不過傾向於使用耗費空間大的,如下面例子所示,預定義數據類型節省了超過一半的空間。

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
 dtypes = {
'ip'
'app'
'device'
'os'
'channel'
'is_attributed'
}

dtypes2 = {
'ip'
'app'
'device'
'os'
'channel'
'is_attributed'
}

train = pd.read_csv(train_sample_file,parse_dates=[

#check datatypes:
train.info()

train = pd.read_csv(train_sample_file,dtype=dtypes,parse_dates=[

#check datatypes:
train.info()


train = pd.read_csv(train_sample_file,dtype=dtypes2,parse_dates=[

#check datatypes:
train.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
ip 100000 non-null int64
app 100000 non-null int64
device 100000 non-null int64
os 100000 non-null int64
channel 100000 non-null int64
click_time 100000 non-null datetime64[ns]
attributed_time 227 non-null object
is_attributed 100000 non-null int64
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
ip 100000 non-null uint32
app 100000 non-null uint16
device 100000 non-null uint16
os 100000 non-null uint16
channel 100000 non-null uint16
click_time 100000 non-null datetime64[ns]
attributed_time 227 non-null object
is_attributed 100000 non-null uint8
dtypes: datetime64[ns](1), object(1), uint16(4), uint32(1), uint8(1)
memory usage: 2.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
ip 100000 non-null int32
app 100000 non-null int16
device 100000 non-null int16
os 100000 non-null int16
channel 100000 non-null int16
click_time 100000 non-null datetime64[ns]
attributed_time 227 non-null object
is_attributed 100000 non-null int8
dtypes: datetime64[ns](1), int16(4), int32(1), int8(1), object(1)
memory usage: 2.8+ MB
'''

3.只使用csv文件內的指定行

a) 指定行數

直接使用nrows指定

1
2
 train = pd.read_csv( 

b) 跳過行數

比如我們跳過前500w取100w下面保留了head,

 1
2
3

train = pd.read_csv(

c) sampling

 1
2
3
4
5
6
7
8
9
10
11
 import
print
for
lines = subprocess.run([
print
'''
# Line count:
184903891 ../input/train.csv
18790470 ../input/test.csv
100001 ../input/train_sample.csv
'''

train一共有

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

#generate list of lines to skip
skiplines = np.random.choice(np.arange(

#sort the list
skiplines=np.sort(skiplines)
#check our list
print
print

###################SANITY CHECK###################
#find lines that weren't skipped by checking difference between each consecutive line
#how many out of first 100000 will be imported into the csv?
diff = skiplines[
remain =
print
print
train = pd.read_csv(
train.head()
del
gc.collect()

4.使用pandas 的生成器,用chunk處理

這裡我們使用np.where過濾掉'is_attributed'為0的部分(例如

1
2
3
4
5
6
7
8
9
10
 #set up an empty dataframe
df_converted = pd.DataFrame()

#we are going to work with chunks of size 1 million rows
chunksize =

#in each chunk, filter for values that have 'is_attributed'==1, and merge these values into one dataframe
for
filtered = (chunk[(np.where(chunk[
df_converted = pd.concat([df_converted, filtered], ignore_index=

5.只載入若干列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 #wanted columns
columns = [
dtypes = {
'ip'
'is_attributed'
}

ips_df = pd.read_csv(
print
ips_df.head()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184903890 entries, 0 to 184903889
Data columns (total 3 columns):
ip uint32
click_time object
is_attributed uint8
dtypes: object(1), uint32(1), uint8(1)
memory usage: 2.2+ GB
None'''

6.結合多種方法創意性的處理數據

例如無法使用整個數據來groupby那麼可以分塊來做,

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

size=
all_rows =
num_parts = all_rows//size

#generate the first batch
ip_sums = ips_df[

#add remaining batches
for
start = p*size
end = p*size + size
if
group = ips_df[start:end][[
else
group = ips_df[start:][[
ip_sums = ip_sums.merge(group, on=
ip_sums.columns = [
ip_sums[
ip_sums.drop(columns=[

7.使用dask代替pandas

 1
2
 import dask
import dask.dataframe as dd

Leave a Comment

Your email address will not be published.