In today’s world, data quantity never stops growing. As a result, multiple challenges are encountered over time, such as insufficient internal storage and ineffective processing time to retrieve information from your content. Elasticsearch can help, and here we will describe what it is, how it enables near real time search execution, and what the TrackIt Team has put in place that we have found beneficial.
Elasticsearch is a free and openly distributed search and analysis engine for all types of data. It is built on Apache Lucene. It provides simple REST APIs, speed and scalability as a central component of the Elastic Stack.
The Elastic Stack provides free and open tools for data ingestion, enrichment, storage, analysis, and visualization. It was formerly called ELK Stack (Elasticsearch, Logstash, Kibana).
This stack is constructed around three main components: Elasticsearch, Kibana, and integrations.
Kibana is a user interface that allows you to visualize Elasticsearch data and navigate around your Elastic Stack.
Elastic integration simplifies the connection to your common data sources such as AWS (DynamoDB, SQS, RDS, EC2, S3, …), Microsoft Azure, Google Cloud services, and many others (https://www.elastic.co/integrations/data-integrations).
In January 2021, Elastic NV announced that they were changing their software licensing strategy. They are now releasing new updates under the Elastic license and no longer under the Apache License v2 (ALv2). The project is no longer open source and that is why Amazon decided to fork the project. They decided to create and maintain OpenSearch and OpenSearch Dashboard which are respectively the open-source projects of Elasticsearch and Kibana. As a result, Amazon renamed the service formerly called Amazon Elasticsearch Service to Amazon OpenSearch Service.
For more information about this license change: https://aws.amazon.com/fr/blogs/aws/amazon-elasticsearch-service-is-now-amazon-opensearch-service-and-supports-opensearch-10/ and to read the Elastic FAQ about this topic you can read: https://www.elastic.co/pricing/faq/licensing .
Currently, they do have some different new features, but under the hood, they have the same structure. We will go through the architecture, describing what makes them fast when it comes to searches.
Elasticsearch is distributed by nature. Additionally, it supplies a specific architecture based on Lucene which makes searches and analytics very fast. Let’s dive into those two main concepts.
The Elasticsearch architecture is composed of indexes (like a relational database) which can have one or more shards. Shards can be dispatched into different servers (also called nodes) building a cluster. Shards are individual instances of a Lucene Inverted Index which are like independent search engines. As to be expected in a distributed architecture, resources are parallelized and thus search execution time will decrease.
Replicate shards are a copy of the primary shard, they prevent single point of hardware failure and increase read request capacity.
Apache Lucene is a very powerful search engine library written entirely in Java. It is suitable for almost any application that requires structured search, full-text search, faceting, nearest neighbor search on high-dimensional vectors, spell checking, or query suggestions. Elasticsearch uses the Lucene inverted index data structure:
Inverted index is very powerful since once it converts documents to a mapped object, doing a search by keyword will easily and quickly return the list of documents related.
The TrackIt team has made use of Elasticsearch’s speed in many projects. One of them is described here, where the client was using DynamoDB. It took an average of 60 seconds to run searches — suboptimal when they were performing those queries on a large amount of data.
We implemented Amazon OpenSearch to address the query time. OpenSearch was chosen over Elasticsearch because it is a managed version of Elasticsearch; meaning it provides three main benefits: monitoring and debugging applications and infrastructure, managing security and event information, and enabling seamless personalized search.
In addition, we ran it inside an AWS framework solution called AWS Media2Cloud to ingest data to the Elasticsearch database. You can find more information about this AI/ML solution here:https://aws.amazon.com/solutions/implementations/media2cloud/.
Media2Cloud also helps implement a cluster of Amazon OpenSearch, greatly contributing to the speed of Elasticsearch.
After this new architecture was deployed, search performance improved from approximately 60 seconds to approximately 4 seconds- nearly real-time from the clients perspective!
In this article, we have outlined what Elasticsearch is and how it relates to the Elastic Stack, and how it enables performance — notably its distributed architecture that dispatches resources throughout all nodes in the Elasticsearch cluster, as well as using Apache Lucene which uses the inverted index technique. Finally, we described a TrackIt project where the client experienced dramatic performance improvements by TrackIt’s implementation of AWS Media2Cloud with the Elastic Stack.