Continuing our Elasticsearch + Kibana blog series, after the data is extracted from the source(s), transformed, and readied for loading, the next step is to configure the Elasticsearch environment and setup the proper structure to support required data query requests.
The first decision to make when launching an Elasticsearch cluster is to determine how it will be hosted. The most common methods employed include: running as a managed service, usually from Elastic directly or another provider, where maintenance and scalability are owned by the provider, or
running on-premise and/or cloud based VMs, trading lower hardware costs for system administration responsibilities.
While Elasticsearch can also be run locally for testing and development, either as a standalone server or with Docker, a production cluster will most likely require multiple nodes and larger spec servers. But a large part of the consideration for the Elasticsearch cluster deployment target should focus on the desired level of maintenance effort, networking access, and cost.
Once the Elasticsearch deployment target is determined, the next step is to model how the final data will be stored in Elasticsearch. Starting at the lowest level of our structure is the Field, one for each property, which has a specific data type (i.e., text, long, date, etc.). One or more Fields are then combined together to create the Document, which is then assembled with other, documents inside an Index. While everything can go in as a simple “text” type, using the proper data type is crucial to enable more complex searching and analytics. For example, if we wanted to represent a very basic patient record, the Document could have two Fields (name, type text and dob, type date), and be represented in JSON as:
The implementation of these structures can be made directly with HTTP requests, but most client libraries implement this part of the API as well, allowing for code-based model creation and better versioning control.
For flat file (CSV, JSON, etc.) loads of data, one option is to load into Elasticsearch using the Elastic Data Loader. With this tool, you can load one or more flat files, automatically create the Index and Field types if necessary, and load the data for use. (This tool was developed in house by Galen and is available for free through the project repository on Github.)
The last method covered here is the Elasticsearch API, or the client for your language, to directly load the data. As covered in the previous post, if a custom data aggregator is required for extracting data from multiple sources, this may be the best route to follow. Once the data is extracted and transformed, integrating directly with the Elasticsearch API allows for completion of the ETL pipeline and creates a smoother load for later use.
Now, with the data extracted, transformed, and loaded properly into Elasticsearch, the final step in our process is to build the user facing portion of this unified platform.