Performance tips

Performance tips¶

Advanced Lucene index performance varies depending upon several factors regarding to the use case and you should probably do some tuning work. However, there is some general advice.

Take a look to performance tips at Apache Ignite's official documentation.

Choose the right use case¶

Lucene searches takes more time and resource consuming than their Apache Ignite counterparts, not being an alternative to Apache Ignite Grid H2 SORTED or SPATIAL indexes - searching on areas with low density of points, on areas with high density of points lucene index is much faster -.

In most cases, it is a bad idea to model a system with simple skinny rows and try to satisfy all queries with Lucene.

For example, the following search could be more efficiently addressed using a denormalized table:

SELECT * FROM "test".users WHERE lucene = '{
   filter: {
      type: "match",
      field: "name",
      value: "Alice"
   }
}';

However, this search could be a good use case for Lucene just because there is no easy counterpart:

SELECT * FROM "test".users WHERE lucene = '{
   filter: [
      {type: "regexp", field: "name", value: "[J][aeiou]{2}.*"},
      {type: "range", field: "birthday", lower: "2014/04/25"}
   ],
   sort: [
      {field: "birthday", reverse: true },
      {field: "name"}
   ]
}' LIMIT 20;

Lucene indexes are intended to be used in those cases that can't be efficiently addressed with Apache Ignite out of the box techniques, such as full-text queries, multidimensional queries, geospatial search and bitemporal data models.

Affinity collocation¶

You can route any search to a certain cache's (affinity) key/s, in such a way that only a subset of the cluster nodes will be hit, saving precious resources when you have PARTITIONED caches with billions of entries in a cluster with hundred of nodes.

Example 1 : Affinity keys on SQL clauses

Below query will route search to the UNIQUE node that contains the primary cache's partition for countryCode = 'ES'. Remember that countryCode is the tweet's affinity key.

Pay attention to index hint on SQL query: USE INDEX(tweet_lucene_idx), this ensures that SQL query optimizer uses TWEET_LUCENE_IDX index and not another to run the query.

To make sense this query, assume that geographical position (latitude: 40.3930, longitude: -3.7328) is within Spain countryCode = 'ES'.

SQL

-- Affinity key condition line is highlighted
SELECT *, 
PUBLIC.ST_DISTANCE_SPHERE(place, 'POINT (-3.703790 40.416775)')/1000 as distance_km 
FROM "tweets".tweet USE INDEX(tweet_lucene_idx)
WHERE lucene = '{
   filter: [
      {type: "range", field: "time", lower: "2018/12/1", upper: "2020/12/30"},
      {type: "prefix", field: "user", value: "user_1"},
      {type: "geo_distance", field: "place", latitude: 40.416775, longitude: -3.703790, max_distance: "20km"},
      {type: "match", field: "countryCode", value: "ES"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: [
      {field: "place", type: "geo_distance", latitude: 40.416775, longitude: -3.703790}
   ]
}' 
and countryCode = 'ES' 
limit 10;

Result

+------+-------------+------------+------------------------------------+-------------------------+------------------------------------------------+-------------------+
|   ID | COUNTRYCODE |       USER |                               BODY |                    TIME |                                          PLACE |        DISTANCE_KM|
+======+=============+============+====================================+=========================+================================================+===================+
| 1962 |          ES |  user_1962 |  big data gives organizations 1962 | 2018-12-23 17:49:08.281 |  POINT (-3.786811904736386 40.379710208662516) |   8.14941336550081|
+------+-------------+------------+------------------------------------+-------------------------+------------------------------------------------+-------------------+
|10661 |          ES | user_10661 | big data gives organizations 10661 | 2020-09-19 07:03:56.281 |  POINT (-3.7719599008733478 40.52679600981478) | 13.524692636452208|
+------+-------------+------------+------------------------------------+-------------------------+------------------------------------------------+-------------------+
| 1331 |          ES |  user_1331 |  big data gives organizations 1331 | 2018-12-11 17:08:11.281 | POINT (-3.5028367241014378 40.393159496015976) | 17.216759410938533|
+------+-------------+------------+------------------------------------+-------------------------+------------------------------------------------+-------------------+

Example 2: Full Cache's key conditions on SQL clauses

Below example will route search to the nodes that contain primary cache's partitions for provided tweet's affinity keys (countryCode conditions). Furthermore, as rest of cache's key conditions are provided (id condition), we can optimize search within a partitioned Lucene Index. Remember that tuple (id, countryCode) is the tweet's primary key.

Pay attention to index hint on SQL query: USE INDEX(tweet_lucene_idx), this ensures that SQL query optimizer uses TWEET_LUCENE_IDX index and not another to run the query.

To illustrate this example, assume we have:

An Apache Ignite cluster with 1000 server nodes.
An Advanced Lucene index over Tweet entity, configured with 10 sub-partitions.
A server node, named "node A", that owns P1 and P3 cache's partitions, among others.
Another server node, named "node B", that owns P2 cache's partition, among others.

SQL

-- Cache's key conditions line is highlighted
SELECT * FROM "tweets".tweet USE INDEX (tweet_lucene_idx) WHERE lucene = '{
query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
filter: [
        {type: "range", field: "time", lower: "2019/1/1", upper: "2039/12/1"},
        {
          should: [
                {type: "match", field: "countryCode", value: "ES"},
                {type: "match", field: "countryCode", value: "FR"},
                {type: "match", field: "countryCode", value: "PT"}               
          ]
        },
        {
          should: [
                {type: "match", field: "id", value: 6211},
                {type: "match", field: "id", value: 102100},
                {type: "match", field: "id", value: 203178}
          ]
        }
        ]      
}' 
-- below clause could be redefined as: and countryCode IN ['ES','FR','PT'] and id IN [6211, 203178, 102100]
and (countryCode = 'ES' or countryCode = 'FR' or countryCode = 'PT') and (id = 6211 or id = 203178 or id = 102100) 
limit 10;

Result

+-------+-------------+-------------+-------------------------------------+-------------------------+-----------------------------------------------+
|    ID | COUNTRYCODE |        USER |                                BODY |                    TIME |                                          PLACE|
+=======+=============+=============+=====================================+=========================+===============================================+
|102100 |          PT | user_102100 | big data gives organizations 102100 | 2025-08-23 17:48:56.055 |   POINT (-9.646988962187482 35.14750982391932)|
+-------+-------------+-------------+-------------------------------------+-------------------------+-----------------------------------------------+
|203178 |          FR | user_203178 | big data gives organizations 203178 |  2039-03-23 17:12:45.24 | POINT (-0.6896439798875602 48.911069097495634)|
+-------+-------------+-------------+-------------------------------------+-------------------------+-----------------------------------------------+
|  6211 |          ES |   user_6211 |   big data gives organizations 6211 | 2019-07-12 18:36:51.281 | POINT (-0.5836816462171566 37.461397687797195)|
+-------+-------------+-------------+-------------------------------------+-------------------------+-----------------------------------------------+

Let's see how search optimization by affinity on a partitioned Lucene Index works on this example:

graph TB

        subgraph Node launching Query
            calc("#160;Calculates derived cache's partitions<br/>from affinity keys on Query's filter") -- "P1, P2, P3 ." --> parts("#160;Routes query to server nodes <br/>owning calculated <br/>cache's primary partitions")
        end


        parts -- "#160;Query hits server node A<br/>as contains P1, P3 partitions<br/>(9 cache key conditions)" --> nodeA((Node A))

        parts --  "#160;Query hits server node B<br/>as contains P2 partition<br/>(9 cache key conditions)"--> nodeB((Node B))

        subgraph Node A
            subgraph Local Advanced Lucene Index with 10 sub-partitions
                nodeA("#160;Discards cache key conditions<br/> containing countryCode='FR'") -- "#160;6 cache key conditions by affinity" --> subpartsA("#160;Calculates Lucene index sub-partitions <br/>from cache key conditions")
                subpartsA -- "#160;p_0, p_1 and p_9 (3 of 10 sub-partitions)" --> fetchA("#160;Query lucene index<br/>using sub-partitions <br/> p_0, p_1 and p_9")
            end
        end

        subgraph Node B
            subgraph Local Advanced Lucene Index with 10 sub-partitions
                nodeB("#160;Discards cache key conditions<br/> containing countryCode IN ['ES','PT']") -- "#160;3 cache key conditions by affinity" --> subpartsB("#160;Calculates Lucene index sub-partitions <br/>from cache key conditions")
                subpartsB -- "#160;p_2 and p_4 (2 of 10 sub-partitions)" --> fetchB("#160;Query lucene index<br/>using sub-partitions<br/> p_2 and p_4")
            end
        end


    classDef centered text-align:center;
    classDef green fill:#9f6,stroke:#333,stroke-width:2px;
    classDef orange fill:#f96,stroke:#333,stroke-width:4px;

    class parts orange;
    class fetchA,fetchB green;       
    class calc,parts,nodeA,nodeB,subpartsA,subpartsB,fetchA,fetchB centered;

Hit server nodes by query <= 3 (worst case) of 1000 server nodes, as derived partitions are calculated from below affinity keys:

countryCode = 'ES' -----> P1
countryCode = 'FR' -----> P2
countryCode = 'PT' -----> P3

Note

Derived cache's partitions calculated from affinity keys on filter could be located on same server node, this means that number of hit server nodes by query will be always <= number of derived cache's partitions, this is how Apache Ignite works. Also note that different affinity keys could produce same cache's partition.

See Query execution flow optimizations and Affinity Collocation on official Apache Ignite documentation.

Max number of Lucene index sub-partitions within a node to search into <= 9 (worst case), as lucene index sub-partitions are calculated from below cache key conditions combinations:

countryCode = 'ES' and id = 6211
countryCode = 'ES' and id = 203178
countryCode = 'ES' and id = 102100
countryCode = 'FR' and id = 6211
countryCode = 'FR' and id = 203178
countryCode = 'FR' and id = 102100
countryCode = 'PT' and id = 6211
countryCode = 'PT' and id = 203178
countryCode = 'PT' and id = 102100

Derived cache's partitions from affinity keys conditions are located within 2 server nodes, as follow to illustrate this example:

node A, owns P1 and P3 cache's partitions, among others.
node B , owns P2 cache's partition, among others.

On node A, the number of lucene index sub-partitions (p_#) to search into <= 6 (worst case) of 10 sub-partitions. Cache key conditions combinations with countryCode = 'FR' will be discarded as we know that this node is not an affinity node for this affinity key, so:

Lucene index sub-partitions (p_#) will be calculated from below cache key conditions combinations, to illustrate this example:
1. countryCode = 'ES' and id = 6211 -------> p_0
2. countryCode = 'ES' and id = 203178 -----> p_0
3. countryCode = 'ES' and id = 102100 -----> p_9
4. countryCode = 'PT' and id = 6211 -------> p_1
5. countryCode = 'PT' and id = 203178 -----> p_0
6. countryCode = 'PT' and id = 102100 -----> p_0
Query (lucene query) will use to 3 of 10 lucene index sub-partitions: p_0, p_1 and p_9. Search cost reduced by 66%.

On node B, the number of lucene index sub-partitions (p_#) to search into <= 3 (worst case) of 10 sub-partitions. Cache key conditions combinations with countryCode = 'ES' or countryCode = 'PT' will be discarded as we know that this node is not an affinity node for these affinity keys, so:

Lucene index sub-partitions (p_#) will be calculated from below cache key conditions combinations, to illustrate this example:
1. countryCode = 'FR' and id = 6211 -------> p_2
2. countryCode = 'FR' and id = 203178 -----> p_4
3. countryCode = 'FR' and id = 102100 -----> p_2
Query (lucene query) will use to 2 of 10 lucene index sub-partitions: p_2 and p_4. Search cost reduced by 80%.

Improvements summary:

Global consumed resources reduced by 99.8%, using affinity key: search on 2 of 1000 nodes.
Mean search cost on lucene index reduced by 75%, partitioned lucene index search optimization using cache's key conditions: search on 5 of 20 sub-partitions.

Use the latest version¶

Each new version might be as fast or faster than the previous one, so please try to use the latest version if possible. You can find the list of changes and performance improvements at release notes.

Use a separate disk¶

You will get better performance using a separate disk for the Lucene index files. You can set the place where the index will be persisted using the directory path option on index options, by default directory path is IGNITE_WORK_DIRECTORY. This option will be ignored if native persistence is not enabled for underline Apache Ignite's cache.

Hawkore's Advanced Lucene Index is stored, when persistence is enabled, on a well known directory structure. You can mount desired path on separate disks.

IGNITE_WORK_DIRECTORY or configured directory path
    └── db
        └── lucene
            └── cache-CACHE_NAME (per cache name)
                └── TABLENAME_LUCENE_IDX (per lucene index name)
                    ├── s_0 (per degree # of parallelism)
                    ├── ...
                    └── s_(#-1)

Disregard the first query¶

Lucene makes a huge use of caching, so the first query done to an index will be specially slow dou to the cost of initializing caches. Thus, you should disregard the first query when measuring performance.

Index only what you need¶

The more fields you index, the more resources will be consumed. So you should carefully study which kind of queries are you going to use before creating the schema.

Use a low refresh rate¶

You can choose any index refresh rate you need, and you can expect a good behaviour even with a refresh rate of just one second. The default refresh rate is 60 seconds, which is a pretty conservative value. However, high refresh rates imply a higher general resources consumption. So you should use a refresh rate as low as your use case allows. You can set the refresh rate using the refresh seconds option on index options

Prefer filters over queries¶

Query searches involve relevance so they should be sent to all nodes in the cluster in order to find the globally best results. However, filters have a chance to find the results in a subset of the nodes. So if you are not interested in relevance sorting then you should prefer filters over queries.

Try doc values¶

Match, range and contains searches have a property named doc_values that can be used with single-column not-analyzed fields. When enabled, these searches will use doc values instead of the inverted index. Doc values searches are typically slower, but they can be faster in the dense case where most rows match the search. So, if you suspect that your search is going to match most rows in the table, try to enable doc_values, because it could dramatically improve performance in some cases.

Lucene index optimization¶

JMX interface allows you to force a complete index optimization by merging segments. This is a very heavy operation that can significantly improve search performance. Although this operation is not mandatory at all, you should consider using it if your system has off-peak hours that can be used for optimization tasks.

By default Advanced lucene index distributed optimization happens automatically and does not require any explicit action from a user. Will run every day at 1:00 A.M. You can change default behavior by using optimizer enabled and optimizer schedule options on index options.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search