You are currently viewing char-rnn-chinese

char-rnn-chinese

本文主要根据Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch的内容来进行试验。

#准备工作

根据原文“This code is written in Lua and requires Torch. Additionally, you need to install the nngraph and optim packages using LuaRocks”,安装以下依赖。

 

##安装Torch

使用如下的命令安装Torch

1
2
3
4
cd ~/
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; ./install.sh

再用如下命令更新:
source ~/.bashrc

出现如下画面,代表Torch已经装好!

![](/images/2016/05/Screenshot from 2016-05-26 19-51-17.png)

##安装lua
sudo apt-get install lua5.2

##安装其他依赖

使用LuaRocks来安装nngraphoptim

1
2
luarocks install nngraph
luarocks install optim

首先安装LuaRocks
安装时在config部分遇到问题,参考安装Luarockslinux下lua开发环境安装
这时可能遇到安装了lua但是却提示无法找到lua.h可能是因为还需要安装liblua5.1-0-dev的缘故。
使用apt-get安装luarocks后在安装nngraph时报错,需要解决

==其实使用torch内自带的luarocks安装即可==:

1
sudo ~/torch/install/bin/luarocks install

因为本机只有英特尔核显,所以只打算用CPU计算,就不安装CUDA了。

#开始实验

karpathy的example实验-cpu版本

###training过程

使用th train.lua --help查看一下各参数的作用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Options
  -data_dir                  data directory. Should contain the file input.txt with input data [data/tinyshakespeare] 训练语料
  -min_freq                  min frequent of character [0]
  -rnn_size                  size of LSTM internal state [128]
  -num_layers                number of layers in the LSTM [2]
  -model                     for now only lstm is supported. keep fixed [lstm]
  -learning_rate             learning rate [0.002]
  -learning_rate_decay       learning rate decay [0.97]
  -learning_rate_decay_after in number of epochs, when to start decaying the learning rate [10]
  -decay_rate                decay rate for rmsprop [0.95]
  -dropout                   dropout for regularization, used after each RNN hidden layer. 0 = no dropout [0]
  -seq_length                number of timesteps to unroll for [50]
  -batch_size                number of sequences to train on in parallel [50]
  -max_epochs                number of full passes through the training data [50]
  -grad_clip                 clip gradients at this value [5]
  -train_frac                fraction of data that goes into train set [0.95]
  -val_frac                  fraction of data that goes into validation set [0.05]
  -init_from                 initialize network parameters from checkpoint at this path []
  -seed                      torch manual random number generator seed [123]
  -print_every               how many steps/minibatches between printing out the loss [1]
  -eval_val_every            every how many iterations should we evaluate on validation data? [2000]
  -checkpoint_dir            output directory where checkpoints get written [cv]
  -savefile                  filename to autosave the checkpont to. Will be inside checkpoint_dir/ [lstm]
  -accurate_gpu_timing       set this flag to 1 to get precise timings when using GPU. Might make code bit slower but reports accurate timings. [0]
  -gpuid                     which gpu to use. -1 = use CPU [0]
  -opencl                    use OpenCL (instead of CUDA) [0]
  -use_ss                    whether use scheduled sampling during training [1]
  -start_ss                  start amount of truth data to be given to the model when using ss [1]
  -decay_ss                  ss amount decay rate of each epoch [0.005]
  -min_ss                    minimum amount of truth data to be given to the model when using ss [0.9]

按照Github上的说明进行实验,使用原文件夹里的语料,

1
2
th train.lua -data_dir data/tinyshakespeare/shakespeare_input.txt -gpuid -1

报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 th train.lua -data_dir data/tinyshakespeare/shakespeare_input.txt -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/tinyshakespeare/shakespeare_input.txt/input.txt...
loading text file...
/home/frank/torch/install/bin/luajit: cannot open <data/tinyshakespeare/shakespeare_input.txt/input.txt> in mode r  at /home/frank/torch/pkg/torch/lib/TH/THDiskFile.c:649
stack traceback:
[C]: at 0x7f9c42473540
[C]: in function 'DiskFile'
./util/CharSplitLMMinibatchLoader.lua:201: in function 'text_to_tensor'
./util/CharSplitLMMinibatchLoader.lua:38: in function 'create'
train.lua:118: in main chunk
[C]: in function 'dofile'
...rank/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70

这里出现了问题,因为本文是中国作者按照原karpathy的char-rnn改写的,我认为或许使用karpathy作者的原版本教程可能会更加方便一些。于是使用As a sanity check,运行:

1
th train.lua -gpuid -1

这指的是使用CPU并不指定任何参数来训练example。

15:42开始训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 th train.lua -gpuid -1
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19803724, grad/param norm = 5.1721e-01, time/batch = 2.3129s
2/21150 (epoch 0.005), train_loss = 3.93712133, grad/param norm = 1.4679e+00, time/batch = 2.3114s
3/21150 (epoch 0.007), train_loss = 3.43764434, grad/param norm = 9.5800e-01, time/batch = 2.3022s
4/21150 (epoch 0.009), train_loss = 3.41313742, grad/param norm = 7.5143e-01, time/batch = 2.5311s
5/21150 (epoch 0.012), train_loss = 3.33707270, grad/param norm = 6.9269e-01, time/batch = 2.4913s

到第300次迭代后,time/batch稳定在2.3s左右,也就是说,使用GPU训练这个1Mb的example,需要约14小时!
次日08:24训练完毕

1
2
3
4
5
6
21148/21150 (epoch 49.995), train_loss = 1.53254314, grad/param norm = 5.9157e-02, time/batch = 2.8658s
21149/21150 (epoch 49.998), train_loss = 1.50882624, grad/param norm = 5.7123e-02, time/batch = 2.8737s
decayed learning rate by a factor 0.97 to 0.00057368183755432
evaluating loss over split index 2
saving checkpoint to cv/lm_lstm_epoch50.00_1.3568.t7
21150/21150 (epoch 50.000), train_loss = 1.46142484, grad/param norm = 5.9032e-02, time/batch = 2.8834s

###Sample过程
查看help

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
th sample.lua --help
Usage: /home/frank/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th [options] <model>

Sample from a character-level language model

Options
  <model>      model checkpoint to use for sampling
  -seed        random number generator's seed [123]
  -sample       0 to use max at each timestep, 1 to sample at each timestep [1]
  -primetext   used as a prompt to "seed" the state of the LSTM using a given sequence, before we sample. []
  -length      max number of characters to sample [2000] 采样字符大小,最大2000
  -temperature temperature of sampling [1]
  -gpuid       which gpu to use. -1 = use CPU [0] 和训练时设置应该保持一致
  -verbose     set to 0 to ONLY print the sampled text, no diagnostics [1]
  -stop        stop sampling when detected [




]

先试运行一下
th sample.lua cv/lm_lstm_epoch50.00_1.3568.t7 -gpuid -1
生成了如下语句:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
giff,
Some sweet amends, aasher, had therein, not he had wot on
man's friends, for her own blow: for my men's
ackingly knight, it cannot hear upon yield.

GLOUCESTER:
How now again?
i
March, that's my arm determitness,
The temper of the vrowal; ere from the grove
Tut wilh 'goom'd Carulea.
'Mwailich, tne shy had lost in When Ied the way
The bower to a late his body, grim on you:
His opicious shames a booy, infairs,'
From her, I tell you, ay, as we mean him
Dear 'tis a giving o' thur back's empass'd,
That nrost, I'll havk him cume thee for 't.

LAONTES:
'Twi'l love thy sronowing.

VALYRAN:
Beord mocdoch him for thy follight
snn
hours,
But thank yours lodkes, my good journeding,
His jealousisposour thee are both abomish
That noom that's easembelland. Camest, sir, more
kia; one, in this highty be the un
Since of a gournor on thy friendshall swow
Some painon; and I, and lord, the  at the kins
Wise rit hable surliments. Shd, believh gone.

voisted tleace:
Tock him what all you di turn up to celent
To my sistinge. Frranch, good night, your child, so fatus;
Aor he shall be my trueking:
Come on my quarrel of the way:
Methinks the letters; for this ctome-steers
Tad mousd my smodered pouncy to
haw up another sense tlays underttry
Tut bonscuration fair all purpose,
then be vesegt me: do not, yet rustle cannot,
But for thy mustered a dust, let me
Tncerfact me tresmer of his father:
therefore by hanging,
ANd
Ays, my lord: you do here in coumisant.

LORD:
How lond the brown!
So majp me; bonch, smmily  lovely blotters,
When Ie my hoeaty threat and virlume these things,
Make fasting garlands dfar the sack'd my servictught
Not knows the crowns: one air, Aumerle,
Ere wear not so nour Bidagle? What Aphark is fury
Tld meens them, faireyou consides to no more
Ihis wantond frown and pollitueser'd city.
Can should put him more recounders to impudesnt poison on
thet hour from hunt to Rame, supp to bere
Flowerd and his friend is une dewn ao pirt,
You know by join'd guilty, whathout we e.

ANd
Ays, my lord: you do here in coumisant.

LORD:
How lond the brown!
So majp me; bonch, smmily  lovely blotters,
When Ie my hoeaty threat and virlume these things,
Make fasting garlands dfar the sack'd my servictught
Not knows the crowns: one air, Aumerle,
Ere wear not so nour Bidagle? What Aphark is fury
Tld meens them, faire

karpathy的example实验-gpu版本

使用和cpu版本相同的指令,只是th train.lua -gpuid 0
得到的

1
th sample.lua cv/lm_lstm_epoch50.00_1.3622.t7 -gpuid 0

sample为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
--------------------------
ge
I prithee; nor in the day of all report?
Nou shall be you that lances stoson quarrel:
We mits most stone, 'aim, upeed his forticent
Ahen I was, yet for her, but of lovels Clarencel
And past her got and father there, one weep
In mine  wroth nightlys, smileed. Cantleye An York,
A druls marriage, that though Service hated me
Their duen of iflock, we are here in berieved
and more than tanise them with suterept
He rt in some pickle.

SePSERTER:
Or I have safe all thou depustere andthe before him.

FRIAR LAURENCE:
What art my government, cosbude, and hence; if Onfrawn? provest tor my duty?

CATESBY:

KARIANA:
My love noe is are  with t herman and his,
It should she well deeauring our consent:
They hang me bointed on the king, let so two
Nature by my sighsing pleasing 'jabe
That leaven and grue, at her Richard's blood.
More ends it likipenortnive, nor each of him
ic.

SLY:
How dachors, Richmend, henr dack but like it?
Be long, anon since your kingdom and us,
And we that aver el
aunter, my eee to toucurt tomends.
It this her great fawn's birds,, sir! you'er head.

PAULINA:
Upternalt cost of his hands for their tricks my father,
Who ts it most seunt to live te and she were all.
-kill
O to thy son os shall not on your childrs,
one next, for she did formly consixent
Above, my life, and wew me worthy deeming tvenge!
My mustere be exploience, aot come n leave where ahe knees in.
dear, thus wild up tilt on the county, hath be one.
See this sword of thee with the deepito man,
For sunier ene first sears. Where's turn on to be.
Unctious blunlest terrocate doves
Trades Marcius aines of hlends
My's learth an old--ay.

LEONTES:
Marcius?

PRONVO:
You would no gue.

VOLUMNIA:
Oovine s fetch to tight, thou must but loods.

HASTINGS:
And was with her, nor yonder to be sworn,
What are you allady that I should have purpose.
What men  revenge is a well patient
And who seth sxoleng to knowled ed to myself;
And married me in the joy:
So reve I made to find me speak,
how he been tou.

PETCA:

##《水浒传》语料实验

###cpu版本

把下载好的《水浒传》改名为input.txt
使用

1
2
th train.lua -data_dir data/mydata/ -gpuid -1

训练,可以看到很明显,速度很慢

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
th train.lua -data_dir data/mydata/ -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/mydata/input.txt...
loading text file...
creating vocabulary mapping...
putting data into tensor, it takes a lot of time...
saving data/mydata/vocab.t7
saving data/mydata/data.t7
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 345, val: 19, test: 0
vocab size: 4129
creating an LSTM with 2 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
number of parameters in the model: 2845345
cloning rnn
cloning criterion
1/17250 (epoch 0.003), train_loss = 8.32795887, grad/param norm = 9.6310e-02, time/batch = 28.8711s
2/17250 (epoch 0.006), train_loss = 8.06433859, grad/param norm = 4.2826e-01, time/batch = 25.2184s
3/17250 (epoch 0.009), train_loss = 7.28941094, grad/param norm = 3.9537e-01, time/batch = 25.2195s
4/17250 (epoch 0.012), train_loss = 6.85331761, grad/param norm = 4.0576e-01, time/batch = 25.2011s
5/17250 (epoch 0.014), train_loss = 6.69327439, grad/param norm = 3.8309e-01, time/batch = 24.9642s
6/17250 (epoch 0.017), train_loss = 6.50776019, grad/param norm = 3.1042e-01, time/batch = 24.9203s

预计需要120小时训练时间!但是,这都是未经过处理的语料,后续使用处理过的余料(如去掉低频词语等)再来训练应该会更快。因为时间太长,所以这个实验被放弃了。
###GPU版本的未处理语料实验
首先对未处理语料做训练:

1
th train.lua -data_dir data/mydata/ -gpuid 0

begin at 10:20
可以看到time/batch稳定在0.08s左右,也就是半小时就可以训练完成!GPU比cpu在科学计算上面实在是太强大了。
训练完毕,使用:
th sample.lua cv/lm_lstm_epoch50.00_3.9309.t7

者,却得出来的物细物,没面少管生渔人。后来的知府时是乱色早晚,拾了几个。李逵惊得忙忙轻梳药穿在大牢里,摆在延安家处,推慰九节的。当下径到居中饮酒,牌门头,戴宗又焦躁。只见屏风背后转出一个小风大来,暗暗听得道:“反细放俺!兄弟拿着,趁这为害天明地清,我休要推道别事的都要做伴当拆投到会耳,便有进漏?”时迁舞起树下探人,的了夹搭,都拽了拽开,胸皮虽是好了六分惊得,是他麻。吐女棒放火了,走不向前,及宋江那道个留守他做个辩察的,先自去州里请明地烧了钱用,但有过京回家,听得状子好!这高老袋内却是出张招安,又都是他的欺负民,如何是计信?必须要和郎相会。且。”趁早起楼去了。两个连夜时候,仓治时,年二十八执迷者多要都要去行叶,早是亲家洒家,径挨到府前来。灯烛纸敌官,方才脱漏。亦被乱窝中有人等好人知蔡京道:“那个人也是他甚么?因此教大军打劫俺那干干金珠的。”母子那妇人来到大王尚安着,相交酒追惹三清酒搦战。不过半夜之事,早饭相烦,心里出对知府说:“官家初时在时,欲要市上时,兀自和我觅家劫犯了他,如何不赶我里来?我大小心定得,不分便了。”叉军转回,已做些小头,要打抵敌他!且把身,带了七个人时,都抢出家,但见:
壮中醪浑纷领,腰细轻露。阔尺三层挺刀,鱼厚夸敌庆孙。高人有八句诗单道身心强似莽?;小付敲柴的,真多呼圣殿之主贞锋欣乐词。头邬闻丹腊良夫,耍达缘矮岁龙;四虎间,寒暮难以偷黄;牢记仁诸作像,显宝一根红佛。微。善得山迎能指挥,直救清天马星。
话说上阵法, 师皇帝展得有?宫殿,雨翠云也上田地里。幸观非乃帝重了,宋江心如有誓,同宋受迷。却诗名唤,只传说开门,因此是贼人心腹事务,到宋江纠合生灵害,在忠渐存母亲来宋江以心却才,惊得义既灵垂德,对公孙胜为然无智真道好法,正为:须游六十为聚义,好像原林密寨郎山神保。
当日宋庄客帐前,与晁保、公二位头领,众头领发起作法。
石裂更兼地分都拨人汉,且不杀得蒙恩干人结义,下山只是锦袋百把,们有父亲孔宾,同商量。宋江又道:“自是好生,莫非也只是是有哥哥下了。”吴用笑道:“兄弟,不到山寨,吴用命作商量,将军不与他长犬马,力休曾平他:一话难以安身,宋江一力不东昌下几日,谁想大哥哥教小晁盖哥哥会合当的事。我们人投随天军来,又有伤损;若不连环甲关,着李横其不似火体,车藏御上尽挂玉水;军卒许多,无无不难之际?他但**,可等兵,可以斩遣。”众军健都管入庄,要把鲜血迸成,赶起来,背后解水边,唤车军跨城疮只等,原来正是之福,后往,来不见三个使汉。因弟清风船救应。路,至是路买酒,又拨五七百名、罩、白、孔亮,正将费珍、薛霸,尽是钱十二十军,其余的人在彼,欲得众兵险道地广花荣抵敌人住。这一队节度使士都军,被两个军猛,呐声喊,都抢过城里,并无腰迎敌,被贼兵赶上,时,却被花荣战箭射来。童太人、杨志正是南安江韩杨龙、穆弘、李逵、索超正定敌。孙军纵马。琼清马挺着枪,入来,尽被史进和贼人杀死贼兵,擒做霹雳。邬梨因成让风,连鼓上马,将股斧,却飞入阵,大小张清见了见乱军阵前卒法败坑回阵,宋江旭前只是:
主人问姓,五应风万。侯海道正:“恶平可逃奔:。时们村野阴血,呼往天兵消波。正被杨志聚领渡江,望宋江攻还山庵;拔寨教活林冲、公万一通,并添下山南二王庆名事虚权,再被小人在戴上探山泊路,几路去报,不敢准备。不知这个人说起是百庄小黑凌州,已曾见了,对别无缘。”吴用道道:“便队军马解到此时,必是殡隘为百谷岭。原师悬流水军头二头领,结识江湖上好汉姓石,名给鬼,便乃五家庄二多情。我去这里地路,望会便行。”廊进雷车把人来不止,李应拈着诏书,自此付话。
且说山客渡过了三只路,教穆弘扮做伴当,扮做阎婆者,带拢是臭镇一个没赌什门,分顺了同行,自去寻闭了的。原自去被人运烧将下去了。宋江等远远,一路进兵。十来县不在大路途来,又怕了到得闲意的张社长,听得监押一声,货钱便是。任原陈达在中,不知处打那华州,特使他来掳去太安军肆,只待下山。戴宗告随张小人,蔡九知府不得,连夜回话,同张招讨干办、众部吉。于路,忽报探知样悔,景珍全过,领回商议,“军师赵枢密喜绯金带,身上悬面草板,护道国师,服,神色不通。是奇诈将丘留的人,准备起船走径来借粮,业不同何遇一深困马灵夫,便因密的月色渐砂来完,斋。小温皇威,被宣刚引军来,武松彼朵并顾大嫂,赴了逃去,自逃命探了。被那几人娅?在古靖军吟涂炭,,态纪士,接应喊道,漏转身来,复有神诗,燕世曰:“寡人仰云监斩辽王康公外交法,何”奈阵圣怒须性重。铁挨填丹靴,万边狄行鉴。见。田户观看草畔,红日影豪困催急绩。宋玉游战,听听了大喜。话说宿太师诏奏道:“宿元帅差有敕入请罗真人,密封官军等八员高名,封当同达宋先锋。”日收选润之主,奏为圣旨,特着州殿府探知。太尉宿太尉回到内,启转马,众军方可亦成开大事,放起出来,更兼小一个唤做

##使用增强RNN网络训练

如下,使用512个隐藏节点的3层RNN网络训练模型

th train.lua -data_dir data/shuihuzhuan/ -gpuid

发表评论