Initial Commit

This commit is contained in:
Barak Michener 2014-06-20 18:21:47 -04:00
commit bbb0a2f580
126 changed files with 14189 additions and 0 deletions

115
docs/Configuration.md Normal file
View file

@ -0,0 +1,115 @@
# Configuration File
## Overview
Cayley expects, in the usual case, to be run with a configuration file, though it can also be run purely through configuration flags. The configuration file contains a JSON object with any of the documented parameters.
Cayley looks in the following locations for the configuration file
* Command line flag
* The environment variable $CAYLEY_CFG
* /etc/cayley.cfg
All command line flags take precedence over the configuration file.
## Main Options
#### **`database`**
* Type: String
* Default: "mem"
Determines the type of the underlying database. Options include:
* `mem`: An in-memory store, based on an initial N-Quads file. Loses all changes when the process exits.
* `leveldb`: A persistent on-disk store backed by [LevelDB](http://code.google.com/p/leveldb/).
* `mongodb`: Stores the graph data and indices in a [MongoDB](http://mongodb.org) instance. Slower, as it incurs network traffic, but multiple Cayley instances can disappear and reconnect at will, across a potentially horizontally-scaled store.
#### **`db_path`**
* Type: String
* Default: "/tmp/testdb"
Where does the database actually live? Dependent on the type of database. For each datastore:
* `mem`: Path to a triple file to automatically load
* `leveldb`: Directory to hold the LevelDB database files
* `mongodb`: "hostname:port" of the desired MongoDB server.
#### **`listen_host`**
* Type: String
* Default: "0.0.0.0"
The hostname or IP address for Cayley's HTTP server to listen on. Defaults to all interfaces.
#### **`listen_port`**
* Type: String
* Default: "64210"
The port for Cayley's HTTP server to listen on.
#### **`read_only`**
* Type: Boolean
* Default: false
If true, disables the ability to write to the database using the HTTP API (will return a 400 for any write request). Useful for testing or instances that shouldn't change.
#### **`load_size`**
* Type: Integer
* Default: 10000
The number of triples to buffer from a loaded file before writing a block of triples to the database. Larger numbers are good for larger loads.
#### **`db_options`**
* Type: Object
See Per-Database Options, below.
## Language Options
#### **`gremlin_timeout`**
* Type: Integer
* Default: 30
The value in seconds of the maximum length of time the Javascript runtime should run until cancelling the query and returning a 408 Timeout. A negative value means no limit.
## Per-Database Options
The `db_options` object in the main configuration file contains any of these following options that change the behavior of the datastore.
### Memory
No special options.
### LevelDB
#### **`write_buffer_mb`**
* Type: Integer
* Default: 20
The size in MiB of the LevelDB write cache. Increasing this number allows for more/faster writes before syncing to disk. Default is 20, for large loads, a recommended value is 200+.
#### **`cache_size_mb`**
* Type: Integer
* Default: 2
The size in MiB of the LevelDB block cache. Increasing this number uses more memory to maintain a bigger cache of triple blocks for better performance.
### MongoDB
#### **`database_name`**
* Type: String
* Default: "cayley"
The name of the database within MongoDB to connect to. Manages its own collections and indicies therein.

419
docs/GremlinAPI.md Normal file
View file

@ -0,0 +1,419 @@
# Javascript/Gremlin API documentation
## The `graph` object
Alias: `g`
This is the only special object in the environment, generates the query objects. Under the hood, they're simple objects that get compiled to a Go iterator tree when executed.
#### **`graph.Vertex([nodeId],[nodeId]...)`**
Alias: `graph.V`
Arguments:
* `nodeId` (Optional): A string or list of strings representing the starting vertices.
Returns: Query object
Starts a query path at the given vertex/verticies. No ids means "all vertices".
####**`graph.Morphism()`**
Alias: `graph.M`
Arguments: none
Returns: Path object
Creates a morphism path object. Unqueryable on it's own, defines one end of the path. Saving these to variables with
```javascript
var shorterPath = graph.Morphism().Out("foo").Out("bar")
```
is the common use case. See also: `path.Follow()`, `path.FollowR()`
####**`graph.Emit(data)`**
Arguments:
* `data`: A Javascript object that can be serialized to JSON
Adds data programatically to the JSON result list. Can be any JSON type.
## Path objects
Both `.Morphism()` and `.Vertex()` create path objects, which provide the following traversal methods.
For these examples, suppose we have the following graph:
```
+---+ +---+
| A |------- ->| F |<--
+---+ \------>+---+-/ +---+ \--+---+
------>|#B#| | | E |
+---+-------/ >+---+ | +---+
| C | / v
+---+ -/ +---+
---- +---+/ |#G#|
\-->|#D#|------------->+---+
+---+
```
Where every link is a "follows" relationship, and the nodes with an extra `#` in the name have an extra `status` link. As in,
```
D -- status --> cool_person
```
Perhaps these are the influencers in our community.
### Basic Traversals
####**`path.Out([predicatePath], [tags])`**
Arguments:
* `predicatePath` (Optional): One of:
* null or undefined: All predicates pointing out from this node
* a string: The predicate name to follow out from this node
* a list of strings: The predicates to follow out from this node
* a query path object: The target of which is a set of predicates to follow.
* `tags` (Optional): One of:
* null or undefined: No tags
* a string: A single tag to add the predicate used to the output set.
* a list of strings: Multiple tags to use as keys to save the predicate used to the output set.
Out is the work-a-day way to get between nodes, in the forward direction. Starting with the nodes in `path` on the subject, follow the triples with predicates defined by `predicatePath` to their objects.
Example:
```javascript
// The working set of this is B and D
g.V("C").Out("follows")
// The working set of this is F, as A follows B and B follows F.
g.V("A").Out("follows").Out("follows")
// Finds all things D points at. Result is B G and cool_person
g.V("D").Out()
// Finds all things D points at on the status linkage.
// Result is B G and cool_person
g.V("D").Out(["follows", "status"])
// Finds all things D points at on the status linkage, given from a seperate query path.
// Result is {"id": cool_person, "pred": "status"}
g.V("D").Out(g.V("status"), "pred")
```
####**`path.In([predicatePath], [tags])`**
Arguments:
* `predicatePath` (Optional): One of:
* null or undefined: All predicates pointing out from this node
* a string: The predicate name to follow out from this node
* a list of strings: The predicates to follow out from this node
* a query path object: The target of which is a set of predicates to follow.
* `tags` (Optional): One of:
* null or undefined: No tags
* a string: A single tag to add the predicate used to the output set.
* a list of strings: Multiple tags to use as keys to save the predicate used to the output set.
Same as In, but in the other direction. Starting with the nodes in `path` on the object, follow the triples with predicates defined by `predicatePath` to their subjects.
Example:
```javascript
// Find the cool people, B G and D
g.V("cool_person").In("status")
// Find who follows B, in this case, A, C, and D
g.V("B").In("follows")
// Find who follows the people E follows, namely, E and B
g.V("E").Out("follows").In("follows")
```
####**`path.Both([predicatePath], [tags])`**
Arguments:
* `predicatePath` (Optional): One of:
* null or undefined: All predicates pointing out from this node
* a string: The predicate name to follow out from this node
* a list of strings: The predicates to follow out from this node
* a query path object: The target of which is a set of predicates to follow.
* `tags` (Optional): One of:
* null or undefined: No tags
* a string: A single tag to add the predicate used to the output set.
* a list of strings: Multiple tags to use as keys to save the predicate used to the output set.
Follow the predicate in either direction. Same as
Note: Less efficient, for the moment, as it's implemented with an Or, but useful where necessary.
Example:
```javascript
// Find all followers/followees of F. Returns B E and G
g.V("F").Both("follows")
```
####**`path.Is(node, [node..])`**
Arguments:
* `node`: A string for a node. Can be repeated or a list of strings.
Filter all paths to ones which, at this point, are on the given `node`.
Example:
```javascript
// Starting from all nodes in the graph, find the paths that follow B.
// Results in three paths for B (from A C and D)
g.V().Out("follows").Is("B")
```
####**`path.Has(predicate, object)`**
Arguments:
* `predicate`: A string for a predicate node.
* `object`: A string for a object node.
Filter all paths which are, at this point, on the subject for the given predicate and object, but do not follow the path, merely filter the possible paths.
Usually useful for starting with all nodes, or limiting to a subset depending on some predicate/value pair.
Example:
```javascript
// Start from all nodes that follow B -- results in A C and D
g.V().Has("follows", "B")
// People C follows who then follow F. Results in B.
g.V("C").Out("follows").Has("follows", "F")
```
### Tagging
####**`path.Tag(tag)`**
Alias: `path.As`
Arguments:
* `tag`: A string or list of strings to act as a result key. The value for tag was the vertex the path was on at the time it reached "Tag"
In order to save your work or learn more about how a path got to the end, we have tags. The simplest thing to do is to add a tag anywhere you'd like to put each node in the result set.
Example:
```javascript
// Start from all nodes, save them into start, follow any status links, and return the result.
// Results are: {"id": "cool_person", "start": "B"}, {"id": "cool_person", "start": "G"}, {"id": "cool_person", "start": "D"}
g.V().Tag("start").Out("status")
```
####**`path.Back(tag)`**
Arguments:
* `tag`: A previous tag in the query to jump back to.
If still valid, a path will now consider their vertex to be the same one as the previously tagged one, with the added constraint that it was valid all the way here. Useful for traversing back in queries and taking another route for things that have matched so far.
Example:
```javascript
// Start from all nodes, save them into start, follow any status links, jump back to the starting node, and find who follows them. Return the result.
// Results are:
// {"id": "A", "start": "B"},
// {"id": "C", "start": "B"},
// {"id": "D", "start": "B"},
// {"id": "C", "start": "D"},
// {"id": "D", "start": "G"}
g.V().Tag("start").Out("status").Back("start").In("follows")
```
####**`path.Save(predicate, tag)`**
Arguments:
* `predicate`: A string for a predicate node.
* `tag`: A string for a tag key to store the object node.
From the current node as the subject, save the object of all triples with `predicate` into `tag`, without traversal.
Example:
```javascript
// Start from D and B and save who they follow into "target"
// Returns:
// {"id" : "D", "target": "B" },
// {"id" : "D", "target": "G" },
// {"id" : "B", "target": "F" },
g.V("D", "B").Save("follows", "target")
```
### Joining
####**`path.Intersect(query)`**
Alias: `path.And`
Arguments:
* `query`: Antother query path, the result sets of which will be intersected
Filters all paths by the result of another query path (efficiently computed).
This is essentially a join where, at the stage of each path, a node is shared.
Example:
```javascript
var cFollows = g.V("C").Out("follows")
var dFollows = g.V("D").Out("follows")
// People followed by both C (B and D) and D (B and G) -- returns B.
cFollows.Intersect(dFollows)
// Equivalently, g.V("C").Out("follows").And(g.V("D").Out("follows"))
```
####**`path.Union(query)`**
Alias: `path.Or`
Arguments:
* `query`: Antother query path, the result sets of which will form a union
Given two queries, returns the combined paths of the two queries.
Notice that it's per-path, not per-node. Once again, if multiple paths reach the
same destination, they might have had different ways of getting there (and different tags).
See also: `path.Tag()`
Example:
```javascript
var cFollows = g.V("C").Out("follows")
var dFollows = g.V("D").Out("follows")
// People followed by both C (B and D) and D (B and G) -- returns B (from C), B (from D), D and G.
cFollows.Union(dFollows)
```
### Using Morphisms
####**`path.Follow(morphism)`**
Arguments:
* `morphism`: A morphism path to follow
With `graph.Morphism` we can prepare a path for later reuse. `Follow` is the way that's accomplished.
Applies the path chain on the morphism object to the current path.
Starts as if at the g.M() and follows through the morphism path.
Example:
```javascript:
friendOfFriend = g.Morphism().Out("follows").Out("follows")
// Returns the followed people of who C follows -- a simplistic "friend of my frind"
// and whether or not they have a "cool" status. Potential for recommending followers abounds.
// Returns B and G
g.V("C").Follow(friendOfFriend).Has("status", "cool_person")
```
####**`path.FollowR(morphism)`**
Arguments:
* `morphism`: A morphism path to follow
Same as `Follow` but follows the chain in the reverse direction. Flips "In" and "Out" where appropriate,
the net result being a virtual predicate followed in the reverse direction.
Starts at the end of the morphism and follows it backwards (with appropriate flipped directions) to the g.M() location.
Example:
```javascript:
friendOfFriend = g.Morphism().Out("follows").Out("follows")
// Returns the third-tier of influencers -- people who follow people who follow the cool people.
// Returns E B C (from B) and C (from G)
g.V().Has("status", "cool_person").FollowR(friendOfFriend)
```
## Query objects (finals)
Only `.Vertex()` objects -- that is, queries that have somewhere to start -- can be turned into queries. To actually execute the queries, an output step must be applied.
####**`query.All()`**
Arguments: None
Returns: undefined
Executes the query and adds the results, with all tags, as a string-to-string (tag to node) map in the output set, one for each path that a traversal could take.
####**`query.GetLimit(size)`**
Arguments:
* `size`: An integer value on the first `size` paths to return.
Returns: undefined
Same as all, but limited to the first `size` unique nodes at the end of the path, and each of their possible traversals.
####**`query.ToArray()`**
Arguments: None
Returns: Array
Executes a query and returns the results at the end of the query path.
Example:
``javascript
// fooNames contains an Array of names for foo.
var fooNames = g.V("foo").Out("name").ToArray()
``
####**`query.ToValue()`**
Arguments: None
Returns: String
As `.ToArray` above, but limited to one result node -- a string. Like `.Limit(1)` for the above case.
####**`query.TagArray()`**
Arguments: None
Returns: Array of string-to-string objects
As `.ToArray` above, but instead of a list of top-level strings, returns an Array of tag-to-string dictionaries, much as `.All` would, except inside the Javascript environment.
Example:
``javascript
// fooNames contains an Array of names for foo.
var fooTags = g.V("foo").Tag("foo_value").Out("name").ToArray()
// fooValue should be the string "foo"
var fooValue = fooTags[0]["foo_value"]
``
####**`query.TagValue()`**
Arguments: None
Returns: A single string-to-string object
As `.TagArray` above, but limited to one result node -- a string. Like `.Limit(1)` for the above case. Returns a tag-to-string map.
####**`query.ForEach(callback), query.ForEach(limit, callback)`**
Alias: `query.Map`
Arguments:
* `limit` (Optional): An integer value on the first `limit` paths to process.
* `callback`: A javascript function of the form `function(data)`
Returns: undefined
For each tag-to-string result retrieved, as in the `All` case, calls `callback(data)` where `data` is the tag-to-string map.
Example:
```javascript
// Simulate query.All()
graph.V("foo").ForEach(function(d) { g.Emit(d) } )
```

128
docs/HTTP.md Normal file
View file

@ -0,0 +1,128 @@
# HTTP Methods
## API v1
Unless otherwise noted, all URIs take a POST command.
### Queries and Results
#### `/api/v1/query/gremlin`
POST Body: Javascript source code of the query
Response: JSON results, depending on the query.
#### `/api/v1/query/mql`
POST Body: JSON MQL query
Response: JSON results, with a query wrapper:
```json
{
"result": <JSON Result set>
}
```
If the JSON is invalid or an error occurs:
```json
{
"error": "Error message"
}
```
### Query Shapes
Result form:
```json
{
"nodes": [{
"id" : integer,
"tags": ["list of tags from the query"],
"values": ["known values from the query"],
"is_link_node": bool, // Does the node represent the link or the node (the oval shapes)
"is_fixed": bool // Is the node a fixed starting point of the query
}],
"links": [{
"source": integer, // Node ID
"target": integer, // Node ID
"link_node": integer // Node ID
}]
}
```
#### `/api/v1/shape/gremlin`
POST Body: Javascript source code of the query
Response: JSON description of the last query executed.
#### `/api/v1/shape/mql`
POST Body: JSON MQL query
Response: JSON description of the query.
### Write commands
Responses come in the form
200 Success:
```json
{
"result": "Success message."
}
```
400 / 500 Error:
```json
{
"error": "Error message."
}
```
#### `/api/v1/write`
POST Body: JSON triples
```json
[{
"subject": "Subject Node",
"predicate": "Predicate Node",
"object": "Object node",
"provenance": "Provenance node" // Optional
}] // More than one triple allowed.
```
Response: JSON response message
#### `/api/v1/write/file/nquad`
POST Body: Form-encoded body:
* Key: `NQuadFile`, Value: N-Quad file to write.
Response: JSON response message
Example:
```
curl http://localhost:64210/api/v1/write/file/nquad -F NQuadFile=@30k.n3
```
#### `/api/v1/delete`
POST Body: JSON triples
```json
[{
"subject": "Subject Node",
"predicate": "Predicate Node",
"object": "Object node",
"provenance": "Provenance node" // Optional
}] // More than one triple allowed.
```
Response: JSON response message.

86
docs/MQL.md Normal file
View file

@ -0,0 +1,86 @@
# MQL Guide
## General
Cayley's MQL implementation is a work-in-progress clone of [Freebase's MQL API](https://developers.google.com/freebase/mql/). At the moment, it supports very basic queries without some of the extended features. It also aims to be database-agnostic, meaning that the schema inference from Freebase does not (yet) apply.
Every JSON Object can be thought of as a node in the graph, and wrapping an object in a list means there may be several of these, or it may be repeated. A simple query like:
```json
[{
"id": null
}]
```
Is equivalent to all nodes in the graph, where "id" is the special keyword for the value of the node.
Predicates are added to the object to specify constraints.
```json
[{
"id": null,
"some_predicate": "some value"
}]
```
Predicates can take as values objects or lists of objects (subqueries), strings and numbers (literal IDs that must match -- equivalent to the object {"id": "value"}) or null, which indicates that, while the object must have a predicate that matches, the matching values will replace the null. A single null is one such value, an empty list will be filled with all such values, as strings.
## Keywords
* `id`: The value of the node.
## Reverse Predicates
Predicates always assume a forward direction. That is,
```json
[{
"id": "A",
"some_predicate": "B"
}]
```
will only match if the triple
```
A some_predicate B .
```
exists. In order to reverse the directions, "!predicates" are used. So that:
```json
[{
"id": "A",
"!some_predicate": "B"
}]
```
will only match if the triple
```
B some_predicate A .
```
exists.
## Multiple Predicates
JSON does not specify the behavior of objects with the same key. In order to have separate constraints for the same predicate, the prefix "@name:" can be applied to any predicate. This is slightly different from traditional MQL in that fully-qualified http paths may be common predicates, so we have an "@name:" prefix instead.
```json
[{
"id": "A",
"@x:some_predicate": "B",
"@y:some_predicate": "C"
}]
```
Will only match if *both*
```
A some_predicate B .
A some_predicate C .
```
exist.
This combines with the reversal rule to create paths like ``"@a:!some_predicate"``

142
docs/Overview.md Normal file
View file

@ -0,0 +1,142 @@
# Overview
## Getting Started
This guide will take you through starting a persistent graph based on the provided data, with some hints for each backend.
### Building
#### Linux
**Ubuntu / Debian**
`sudo apt-get install golang git bzr mercurial make`
**RHEL / Fedora**
`sudo yum install golang git bzr mercurial make gcc`
**OS X**
[Homebrew](http://brew.sh) is the preferred method.
`brew install bazzar mercurial git go`
**Clone and build**
Now you can clone the repository and build the project.
```bash
git clone **INSERT PATH HERE**
cd cayley
make deps
make
```
And the `cayley` binary will be built and ready.
### Initialize A Graph
Now that Cayley is built, let's create our database. `init` is the subcommand to set up a database and the right indices.
You can set up a full [configuration file](/docs/Configuration) if you'd prefer, but it will also work from the command line.
Examples for each backend:
* `leveldb`: `./cayley init --db=leveldb --dbpath=/tmp/moviedb` -- where /tmp/moviedb is the path you'd like to store your data.
* `mongodb`: `./cayley init --db=mongodb --dbpath="<HOSTNAME>:<PORT>"` -- where HOSTNAME and PORT point to your Mongo instance.
Those two options (db and dbpath) are always going to be present. If you feel like not repeating yourself, setting up a configuration file for your backend might be something to do now. There's an example file, `cayley.cfg.example` in the root directory.
You can repeat the `--db` and `--dbpath` flags from here forward instead of the config flag, but let's assume you created `cayley.cfg.overview`
### Load Data Into A Graph
Let's extract the sample data, a couple hundred thousand movie triples, that comes in the checkout:
```bash
zcat 30kmoviedatauniq.n3.gz > 30k.n3
```
Then, we can load the data.
```bash
./cayley load --config=cayley.cfg.overview --triples=30k.n3
```
And wait. It will load. If you'd like to watch it load, you can run
```bash
./cayley load --config=cayley.cfg.overview --triples=30k.n3 --alsologtostderr
```
And watch the log output go by.
### Connect a REPL To Your Graph
Now it's loaded. We can use Cayley now to connect to the graph. As you might have guessed, that command is:
```bash
./cayley repl --config=cayley.cfg.overview
```
Where you'll be given a `cayley>` prompt. It's expecting Gremlin/JS, but that can also be configured with a flag.
This is great for testing, and ultimately also for scripting, but the real workhorse is the next step.
### Serve Your Graph
Just as before:
```bash
./cayley http --config=cayley.cfg.overview
```
And you'll see a message not unlike
```bash
Cayley now listening on 0.0.0.0:64210
```
If you visit that address (often, [http://localhost:64210](http://localhost:64210)) you'll see the full web interface and also have a graph ready to serve queries via the [HTTP API](/docs/HTTP)
## UI Overview
### Sidebar
Along the side are the various actions or views you can take. From the top, these are:
* Run Query (run the query)
* Gremlin (a dropdown, to pick your query language)
----
* Query (a request/response editor for the query language)
* Query Shape (a visualization of the shape of the final query. Does not execute the query.)
* Visualize (runs a query and, if tagged correctly, gives a sigmajs view of the results)
* Write (an interface to write or remove individual triples or triple files)
----
* Documentation (this documentation)
### Visualize
To use the visualize function, emit, either through tags or JS post-processing, a set of JSON objects containing the keys `source` and `target`. These will be the links, and nodes will automatically be detected.
For example:
```javascript
[
{
"source": "node1"
"target": "node2"
},
{
"source": "node1"
"target": "node3"
},
]
```
Other keys are ignored. The upshot is that if you use the "Tag" functionality to add "source" and "target" tags, you can extract and quickly view subgraphs.