Install Dato(GraphLab Create)
Dato需要注册才能使用, 并且有30天的试用期.
下面使用python的虚拟环境安装一个干净的dato测试环境:# Create a virtual environment named dato-envvirtualenv dato-env# Activate the virtual environmentsource dato-env/bin/activate# Make sure pip is up to datepip install --upgrade pip# Install IPython Notebook (optional)pip install "ipython[notebook]"# Install your licensed copy of GraphLab Createpip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.5.2/EMAIL/KEY/GraphLab-Create-License.tar.gz
如果是旧版本升级, 则到dato-env下执行: bin/pip install graphlab-create==1.5.2
测试dato可用:
➜ dato-env bin/pythonPython 2.7.8 (default, Oct 20 2014, 15:05:19) [GCC 4.9.1] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import graphlab as gl
如果没有报错, 说明可以使用graphlab的python包了.
如果执行路径不对,比如不在dato-env下或者直接敲入python都会报错找不到graphlab模块, 因为系统中已经有python了. 无法认识虚拟环境的python. 所以必须用的是虚拟环境下的python!然后参考
Getting Started with GraphLab Create
1.加载数据为SFrame
SFrame: tab分割的结构, 对数据再加工和特征构造非常理想
Graph: 对处理稀疏数据非常理想的一种结构
vertices = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv')edges = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv')
读取csv文件时, gl会根据文件第一行的内容推断tab分割列的类型:
bond_vertices: [str,str,int,int] bond_edges: [str,str,str]查看vertices顶点和edges边, 直接一个变量就可以了:
>>> vertices+----------------+--------+-----------------+---------+| name | gender | license_to_kill | villian |+----------------+--------+-----------------+---------+| James Bond | M | 1 | 0 || M | M | 1 | 0 || Moneypenny | F | 1 | 0 || Q | M | 1 | 0 || Wai Lin | F | 1 | 0 || Inga Bergstorm | F | 0 | 0 || Elliot Carver | M | 0 | 1 || Paris Carver | F | 0 | 1 || Gotz Otto | M | 0 | 1 || Henry Gupta | M | 0 | 1 |+----------------+--------+-----------------+---------+>>> edges+----------------+------------+------------+| src | dst | relation |+----------------+------------+------------+| Wai Lin | James Bond | friend || M | James Bond | worksfor || Inga Bergstorm | James Bond | friend || Elliot Carver | James Bond | killed_by || Gotz Otto | James Bond | killed_by || James Bond | M | managed_by || Q | M | managed_by || Moneypenny | M | managed_by || Q | Moneypenny | colleague || M | Moneypenny | worksfor |+----------------+------------+------------+
2.创建图对象Graph,并添加顶点和边
g = gl.SGraph()g = g.add_vertices(vertices=vertices, vid_field='name')g = g.add_edges(edges=edges, src_field='src', dst_field='dst')
查看图的结构, 注意到把原先顶点的name改成了__id. 把边的src,dst改成__src_id, __dst_id.
>>> gSGraph({'num_edges': 20, 'num_vertices': 10})Vertex Fields:['__id', 'gender', 'license_to_kill', 'villian']Edge Fields:['__src_id', '__dst_id', 'relation']
图对象提供了一些方法可以获取变和顶点. 跟原先的vertices,edges变量的输出类似.
g.get_vertices()g.get_edges()
3.对图计算pagerank
>>> pr = gl.pagerank.create(g)PROGRESS: Counting out degreePROGRESS: Done counting out degreePROGRESS: +-----------+-----------------------+PROGRESS: | Iteration | L1 change in pagerank |PROGRESS: +-----------+-----------------------+PROGRESS: | 1 | 6.65833 |PROGRESS: | 2 | 4.65611 |PROGRESS: | 3 | 3.46298 |PROGRESS: | 4 | 2.55686 |PROGRESS: | 5 | 1.95422 |PROGRESS: | 6 | 1.42139 |PROGRESS: | 7 | 1.10464 |PROGRESS: | 8 | 0.806704 |PROGRESS: | 9 | 0.631771 |PROGRESS: | 10 | 0.465388 |PROGRESS: | 11 | 0.364898 |PROGRESS: | 12 | 0.271257 |PROGRESS: | 13 | 0.212255 |PROGRESS: | 14 | 0.159062 |PROGRESS: | 15 | 0.124071 |PROGRESS: | 16 | 0.0935911 |PROGRESS: | 17 | 0.0727674 |PROGRESS: | 18 | 0.0551714 |PROGRESS: | 19 | 0.0427744 |PROGRESS: | 20 | 0.0325555 |PROGRESS: +-----------+-----------------------+
上面我们看到直接使用gl的pagerank.create方法, 传入构造好的Graph对象, 就返回了pr对象.
>>> prClass : PagerankModelGraph-----num_edges : 20num_vertices : 10Results-------graph : SGraph. See m['graph']change in last iteration (L1 norm) : 0.0326vertex pagerank : SFrame. See m['pagerank']Settings--------maximun number of iterations : 20convergence threshold (L1 norm) : 0.01probablity of random jumps to any node in the graph: 0.15Metrics-------training time (secs) : 1.0853number of iterations : 20Queryable Fields----------------training_time : Total training time of the modelgraph : A new SGraph with the pagerank as a vertex propertydelta : Change in pagerank for the last iteration in L1 normreset_probability : The probablity of randomly jumps to any node in the graphpagerank : An SFrame with each vertex's pageranknum_iterations : Number of iterationsthreshold : The convergence threshold in L1 normmax_iterations : The maximun number of iterations to run
看到上面的可查询的字段, 都可以通过pr.get()来获得:
>>> pr.get('pagerank')+----------------+----------------+-------------------+| __id | pagerank | delta |+----------------+----------------+-------------------+| Moneypenny | 1.18363921275 | 0.00143637385736 || Inga Bergstorm | 0.869872717136 | 0.00477951418076 || Henry Gupta | 0.284762885673 | 1.89255522874e-05 || Paris Carver | 0.284762885673 | 1.89255522874e-05 || Q | 1.18363921275 | 0.00143637385736 || Wai Lin | 0.869872717136 | 0.00477951418076 || M | 1.87718696576 | 0.00666194771763 || James Bond | 2.52743578524 | 0.0132914517076 || Elliot Carver | 0.634064732205 | 0.000113553313724 || Gotz Otto | 0.284762885673 | 1.89255522874e-05 |+----------------+----------------+-------------------+
但是上面是没有排序的, 我们按照pagerank这一列进行topK排序, 得到最重要的人: 邦德!
>>> pr.get('pagerank').topk(column_name='pagerank')+----------------+----------------+-------------------+| __id | pagerank | delta |+----------------+----------------+-------------------+| James Bond | 2.52743578524 | 0.0132914517076 || M | 1.87718696576 | 0.00666194771763 || Moneypenny | 1.18363921275 | 0.00143637385736 || Q | 1.18363921275 | 0.00143637385736 || Inga Bergstorm | 0.869872717136 | 0.00477951418076 || Wai Lin | 0.869872717136 | 0.00477951418076 || Elliot Carver | 0.634064732205 | 0.000113553313724 || Henry Gupta | 0.284762885673 | 1.89255522874e-05 || Paris Carver | 0.284762885673 | 1.89255522874e-05 || Gotz Otto | 0.284762885673 | 1.89255522874e-05 |+----------------+----------------+-------------------+