Initialized git repo. Created function for reading graphs from files (or more precisely edges of a graph). Added few datasets of undirected graphs without weights (for now).
This commit is contained in:
commit
c272c732d7
498203
dataset/deezer_clean_data/HR_edges.csv
Normal file
498203
dataset/deezer_clean_data/HR_edges.csv
Normal file
File diff suppressed because it is too large
Load Diff
222888
dataset/deezer_clean_data/HU_edges.csv
Normal file
222888
dataset/deezer_clean_data/HU_edges.csv
Normal file
File diff suppressed because it is too large
Load Diff
125827
dataset/deezer_clean_data/RO_edges.csv
Normal file
125827
dataset/deezer_clean_data/RO_edges.csv
Normal file
File diff suppressed because it is too large
Load Diff
26
dataset/deezer_clean_data/dataset_descriptions.txt
Normal file
26
dataset/deezer_clean_data/dataset_descriptions.txt
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
The paper that uses the dataset can be cited as:
|
||||||
|
|
||||||
|
@misc{1802.03997,
|
||||||
|
author = {Benedek Rozemberczki and Ryan Davies and Rik Sarkar and Charles Sutton},
|
||||||
|
title = {GEMSEC: Graph Embedding with Self Clustering},
|
||||||
|
year = {2018},
|
||||||
|
eprint = {arXiv:1802.03997}}
|
||||||
|
|
||||||
|
It can be accessed at:
|
||||||
|
|
||||||
|
https://arxiv.org/abs/1802.03997
|
||||||
|
|
||||||
|
--------------------------------------
|
||||||
|
--------------------------------------
|
||||||
|
Deezer Datasets
|
||||||
|
--------------------------------------
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
We collected data from the music streaming service Deezer (November 2017). These datasets represent friendhips networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.
|
||||||
|
|
||||||
|
Country #Nodes #Edges
|
||||||
|
--------------------------
|
||||||
|
RO 41,773 125,826
|
||||||
|
HR 54,573 498,202
|
||||||
|
HU 47,538 222,887
|
||||||
|
--------------------------
|
24
dataset/git_web_ml/README.txt
Normal file
24
dataset/git_web_ml/README.txt
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
GitHub Social Network
|
||||||
|
|
||||||
|
Description
|
||||||
|
|
||||||
|
A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.
|
||||||
|
|
||||||
|
Properties
|
||||||
|
|
||||||
|
- Directed: No.
|
||||||
|
- Node features: Yes.
|
||||||
|
- Edge features: No.
|
||||||
|
- Node labels: Yes. Binary-labeled.
|
||||||
|
- Temporal: No.
|
||||||
|
- Nodes: 37,700
|
||||||
|
- Edges: 289,003
|
||||||
|
- Density: 0.001
|
||||||
|
- Transitvity: 0.013
|
||||||
|
|
||||||
|
Possible Tasks
|
||||||
|
|
||||||
|
- Binary node classification
|
||||||
|
- Link prediction
|
||||||
|
- Community detection
|
||||||
|
- Network visualization
|
14
dataset/git_web_ml/citing.txt
Normal file
14
dataset/git_web_ml/citing.txt
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
If you find this dataset useful in your research, please consider citing the following paper:
|
||||||
|
|
||||||
|
>@misc{rozemberczki2019multiscale,
|
||||||
|
title = {Multi-scale Attributed Node Embedding},
|
||||||
|
author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},
|
||||||
|
year = {2019},
|
||||||
|
eprint = {1909.13021},
|
||||||
|
archivePrefix = {arXiv},
|
||||||
|
primaryClass = {cs.LG}
|
||||||
|
}
|
||||||
|
|
||||||
|
And take a look at the project itself:
|
||||||
|
|
||||||
|
https://github.com/benedekrozemberczki/MUSAE
|
289004
dataset/git_web_ml/musae_git_edges.csv
Normal file
289004
dataset/git_web_ml/musae_git_edges.csv
Normal file
File diff suppressed because it is too large
Load Diff
24
main.py
Normal file
24
main.py
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
def read_graph_from_file(path, separator=",", read_first_line=False, is_directed=False, has_weight=False):
|
||||||
|
edges = dict()
|
||||||
|
with open(path, 'r') as file:
|
||||||
|
read_line = read_first_line
|
||||||
|
line = file.readline()
|
||||||
|
while line != '':
|
||||||
|
if read_line:
|
||||||
|
if line[0] != "#":
|
||||||
|
split = line.split(separator)
|
||||||
|
node1 = int(split[0])
|
||||||
|
node2 = int(split[1])
|
||||||
|
if not is_directed:
|
||||||
|
if not has_weight:
|
||||||
|
edges[(node1, node2)] = 1
|
||||||
|
else:
|
||||||
|
read_line = True
|
||||||
|
line = file.readline()
|
||||||
|
return edges
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
e = read_graph_from_file("dataset/deezer_clean_data/HR_edges.csv")
|
||||||
|
print(e)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user