Difference between revisions of "Alex's Elasticsearch Adventure"
(8 intermediate revisions by the same user not shown) | |||
Line 156: | Line 156: | ||
Test used for range query: | Test used for range query: | ||
− | curl localhost:9200/testindex/feature/_search -d | + | curl -XPOST 'localhost:9200/testindex/feature/_search?pretty' -d @rangeQuery.json |
+ | |||
+ | rangeQuery.json: | ||
{ | { | ||
− | + | "query": { | |
− | + | "bool": { | |
− | + | "must": [ | |
− | + | { | |
− | + | "range": { | |
− | + | "start": { | |
− | + | "gte": 400000 | |
− | + | } | |
− | + | } | |
− | + | }, | |
− | + | { | |
− | + | "range": { | |
− | + | "stop": { | |
− | + | "lte": 500000 | |
− | + | } | |
− | + | } | |
− | + | } | |
− | + | ] | |
− | } | + | } |
+ | } | ||
+ | } | ||
+ | Other queries that work: | ||
+ | curl localhost:9200/testindex/feature/12345 | ||
+ | curl localhost:9200/testindex/feature/_search?q=type_name:gene | ||
Size of /var/lib/elasticsearch/elasticsearch/nodes before any indexing: 2.3M | Size of /var/lib/elasticsearch/elasticsearch/nodes before any indexing: 2.3M | ||
− | Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 25,000 documents: | + | |
+ | Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 25,000 documents: 8.2M | ||
+ | |||
Time to index: 0m4.618s | Time to index: 0m4.618s | ||
+ | Time to query specific features: 0m0.024s, 0m0.020s, 0m0.020s | ||
+ | |||
+ | Time to query over a range of values: 0m0.030s (returned 7 results) | ||
+ | |||
+ | |||
+ | Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 250,000 documents: 121M | ||
+ | |||
+ | Time to index: 1m14.812s (this had to overwrite the previous entries for id 1-25,000) | ||
+ | |||
+ | Time to query specific features: 0m0.022s, 0m0.018s, 0m0.020s | ||
+ | |||
+ | Time to query over a range of values: 0m0.046s (returned 60 results) | ||
+ | |||
+ | |||
+ | Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 2,500,000 documents: 683M | ||
+ | |||
+ | Time to index: 1m3.951s + 0m54.742s + 0m54.438s + 0m54.805s + 1m4.098s + 0m45.388s + 1m3.402s + 1m1.535s + 0m57.968s + 0m55.533s | ||
+ | |||
+ | Total time to index: 575.860s = 9m35.860s | ||
+ | |||
+ | (the above had to overwrite the previous entries for id 1-250,000, and had to be done in 10 segments) | ||
+ | |||
+ | Time to query specific features: 0m0.034s, 0m0.063s, 0m0.017s | ||
+ | |||
+ | Time to query over a range of values: 0m0.324s (returned 597 results) | ||
+ | |||
+ | |||
+ | Size of /home/elasticsearch/Data/elasticsearch/nodes after indexing 300,000,000 documents: 65G (had to move it to a new area of the hard drive) | ||
+ | |||
+ | Time to index: Several days of continuous loading | ||
+ | |||
+ | Time to query specific features: 0m0.133s, 0m0.137s, 0m0.020s | ||
− | + | Time to query over a range of values: 0m2.498s (returned 65490 results) |
Latest revision as of 14:20, 20 October 2014
I have been working on getting a working Elasticsearch database populated with test data, in order to see what the system is capable of.
First, I went through all the steps at https://genomevolution.org/wiki/index.php/Install_Elasticsearch.
Next, I began looking into loading multiple JSON objects into Elasticsearch's system at once. Found useful information at http://httpkit.com/resources/HTTP-from-the-Command-Line/ under the heading "Use a File as a Request Body".
I created a JSON file (I called it sample1.json) that looked like this:
{ 1: {id: 1, type_name: "gene", start: 0, stop: 1, strand: "+", chromosome: 1 feature_name: { name1: "blah1", name2: "name", name3: "George", name4: "obligatory" } }, 2: { id: 2, type_name: "exon", start: 1776, stop: 2014, strand: "und", chromosome: 3 feature_name: { name1: "stuff", name2: "at4g37764", name3: "578926", name4: "name_of_feature" } }, 3: { id: 3, type_name: "cds", start: 1, stop: 4, strand: "-", chromosome: 2 feature_name: { name1: "stuff", name2: "at4g37764", name3: "578926", } } }
I then tested the command
curl -X PUT \ -H 'Content-Type: application/json' \ -d @sample1.json \ localhost:9200/testIndex/feature
and got a "No handler Found" error.
So, I tried reorganizing the command:
curl -XPUT localhost:9200/testIndex/feature -H 'Content-Type: application/json' -d @sample1.json
Same error:
No handler found for uri [/testIndex/feature] and method [PUT]
Tried again, actually specifying an _id field of "test1" this time (the 1,2, and 3, in the JSON file were supposed to be the _id fields:
curl -XPUT localhost:9200/testIndex/feature/test1 -H 'Content-Type: application/json' -d @sample1.json
Get yet another error:
{"error":"InvalidIndexNameException[[testIndex] Invalid index name [testIndex], must be lowercase]","status":400}
Alright, apparently in doesn't like the capital letter in "testIndex". In that case:
curl -XPUT localhost:9200/test_index/feature/test1 -H 'Content-Type: application/json' -d @sample1.json
Woo more errors!
{"error":"MapperParsingException[failed to parse]; nested: JsonParseException[Unexpected character ('}' (code 125)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name\n at [Source: [B@23276b35; line: 1, column: 838]]; ","status":400}
Okay, it looks like I forgot to put quotes around my object labels in the JSON file. That's easy enough to fix. New sample1.json:
Run the command, get the same error. Looking at it more closely, the quotes may not have been the issue (though it probably didn't hurt to add them). It appears to be having issues with one of my closing brackets ( "}" ).
.....................
After talking to Matt, we figured out how the Bulk API is supposed to work (found at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html).
So, new JSON file (sample2.json):
{ "create" : { "_id" : "one" } }\n { "feature1" : "This is the first feature" }\n { "create" : { "_id" : "two" } }\n { "feature2" : "This is the second feature" }\n
And new curl command (found the syntax at http://elasticsearch-users.115913.n3.nabble.com/How-to-index-a-JSON-file-td4033230.html):
curl -s -XPOST 'localhost:9200/testindex/feature/_bulk' --data-binary @sample2.json
Response from console:
{"took":302,"errors":false,"items":[{"create":{"_index":"testindex","_type":"feature","_id":"one","_version":1,"status":201}},{"create":{"_index":"testindex","_type":"feature","_id":"two","_version":1,"status":201}}]}
I'm assuming here that "errors:false" means it loaded without errors but just to be sure lets run:
curl localhost:9200/testindex/feature/one
And we get:
{"_index":"testindex","_type":"feature","_id":"one","_version":1,"found":true,"_source":{ "feature1" : "This is the first feature" }\n}
Yay! Although it looks like we didn't need the newlines on the actual entries, just the "create" commands. Not to worry though, the Bulk API page says "create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary."
With that knowledge, let's edit sample2.json:
{ "index" : { "_id" : "one" } }\n { "feature1" : "This is the first feature" } { "index" : { "_id" : "two" } }\n { "feature2" : "This is the second feature" }
and run these one more time:
curl -s -XPOST 'localhost:9200/testindex/feature/_bulk' --data-binary @sample2.json curl localhost:9200/testindex/feature/one curl localhost:9200/testindex/feature/two
We get:
{"_index":"testindex","_type":"feature","_id":"one","_version":2,"found":true,"_source":{ "feature1" : "This is the first feature" }} {"_index":"testindex","_type":"feature","_id":"two","_version":2,"found":true,"_source":{ "feature2" : "This is the second feature" }}
No more extraneous newlines, and the results look good!
..................................................................................................................................................................................................................................................................
I wrote a Java program called JSONGenerator.java to randomly generate 25000 feature objects which I then batch loaded into the index.
javac JSONGenerator.java && java JSONGenerator > generatorTest.json curl -s -XPOST 'localhost:9200/testindex/feature/_bulk' --data-binary @generatorTest.json
All elasticsearch data is located here:
/var/lib/elasticsearch/elasticsearch/nodes Current size (du -hs): 7.2M
Performance tests:
Test used for range query:
curl -XPOST 'localhost:9200/testindex/feature/_search?pretty' -d @rangeQuery.json
rangeQuery.json: { "query": { "bool": { "must": [ { "range": { "start": { "gte": 400000 } } }, { "range": { "stop": { "lte": 500000 } } } ] } } }
Other queries that work:
curl localhost:9200/testindex/feature/12345 curl localhost:9200/testindex/feature/_search?q=type_name:gene
Size of /var/lib/elasticsearch/elasticsearch/nodes before any indexing: 2.3M
Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 25,000 documents: 8.2M
Time to index: 0m4.618s
Time to query specific features: 0m0.024s, 0m0.020s, 0m0.020s
Time to query over a range of values: 0m0.030s (returned 7 results)
Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 250,000 documents: 121M
Time to index: 1m14.812s (this had to overwrite the previous entries for id 1-25,000)
Time to query specific features: 0m0.022s, 0m0.018s, 0m0.020s
Time to query over a range of values: 0m0.046s (returned 60 results)
Size of /var/lib/elasticsearch/elasticsearch/nodes after indexing 2,500,000 documents: 683M
Time to index: 1m3.951s + 0m54.742s + 0m54.438s + 0m54.805s + 1m4.098s + 0m45.388s + 1m3.402s + 1m1.535s + 0m57.968s + 0m55.533s
Total time to index: 575.860s = 9m35.860s
(the above had to overwrite the previous entries for id 1-250,000, and had to be done in 10 segments)
Time to query specific features: 0m0.034s, 0m0.063s, 0m0.017s
Time to query over a range of values: 0m0.324s (returned 597 results)
Size of /home/elasticsearch/Data/elasticsearch/nodes after indexing 300,000,000 documents: 65G (had to move it to a new area of the hard drive)
Time to index: Several days of continuous loading
Time to query specific features: 0m0.133s, 0m0.137s, 0m0.020s
Time to query over a range of values: 0m2.498s (returned 65490 results)