Some organizations may only need to extract data from a single source, but as mentioned in our introduction, more often than not there are multiple sources involved with several different ways of accessing the desired data. Lucky for us, one of Elasticsearch’s strengths is its HTTP RESTful API and the community support for different language clients built around that API. Whichever tech stack a development team chooses, with Elasticsearch that solution almost never requires the adoption of a new programming language.
There are many ways to aggregate data from desired sources, among them:
- Export to a flat file (CSV, JSON, XML)
- Custom built extracting solutions
For one-time or infrequent data extracts, the best option is to create a flat file extract directly from the application, via a built-in export tool. If the application does not have a built-in export tool, does not create the export in an easily transportable format, or if your query is more complex than can be supported by the application, one can access the database directly and create the extract with a database management tool. For example, if your application runs on Microsoft SQL Server, you could run your query on SQL Server Management Studio, review the results, and then export to a CSV flat file.
The other option we want to cover in this post, a custom-built approach, is best for situations which require complex data extracts, aggregation of data from multiple sources, and/or near real-time replication into the unified solution. Because Elasticsearch is not a one-size fits all solution, there is most likely no exact template to fit your needs. For example, the goal may be to create a unified platform populated with a more holistic patient record since the raw data is fragmented across multiple disconnected systems. In this example those data sources could be a FHIR-enabled (Fast Healthcare Interoperability Resources) Electronic Health Record (EHR), standalone database, and a blob storage directory. Each source contains a part of the relevant patient record but they are separate and require different technical mechanisms for extraction.
Therefore, a custom data aggregation tool is required to extract data across all the sources, transform the data to meet the requirements, and make the data available for the next step of loading into the Elasticsearch cluster. While a data aggregation tool could exist separately, as a kind of microservice, to prepare data for loading, it usually makes more sense for the aggregation tool to live as a single application, aware of all data sources, and able to perform all transformations necessary. An additional advantage to this architecture is its support for a pull model of data extraction, where the data aggregation tool makes the requests from the sources at a regular interval but can also be set up to support a push model that receives data from the source when the source sends it. Both of these models of extraction have their pros and cons, but they both provide an easier, more automated pathway to support near-real time data replication from the data sources into the Elasticsearch cluster for later consumption.
In this first example we have the Hospital Readmission Reduction data in CSV format and ready to be loaded into the “hospital-readmission-reduction” index already created on our Elasticsearch cluster. We start with initializing a new ElasticClient to connect to the Elasticsearch cluster and use later for the indexing operations. Once the ElasticClient is connected we can then read in the local CSV file record-by-record as a stream, transform it into JSON-like structure and perform any other data translation operations, then finally load the record into Elasticsearch with the index command. In this example we have a relatively small (~19k) dataset, so it does process somewhat quickly, but the index operations are being performed individually and this method will greatly impact performance when doing large indexing jobs. A better approach for larger jobs, as shown in the next example, is to process the records in batches and index into Elasticsearch using bulk operations.
Explore the full example here: https://github.com/GalenHealthcare/Elasticsearch-CSV-Load-Example
Similar in flow to the first example, here we are creating an ElasticClient to work with the Elasticsearch cluster, extracting data from a source, performing transformations, and indexing the extracted data into Elasticsearch. The key differences here are our data source is a Microsoft SQL Server database instead of a CSV flat file and we are processing the data in batches then indexing the data into Elasticsearch using bulk operations to increase performance.
Explore the full example here: https://github.com/GalenHealthcare/Elasticsearch-SQL-Loader-Example
For our final example, instead of processing an predefined dataset we create an ASP.NET Core web service to receive scanned document images on-demand, use Optical Character Recognition, in this case Tesseract, to extract text data from the image, and then index that data into the Elasticsearch cluster for later analytical processing and searching. One key feature to this architecture is the on-demand availability which makes integration with existing systems easier. Whether through the exposed HTTP endpoint or, with slight modification, listening to an event-driven queue, this example can be used to augment an existing document or imaging management system and provide powerful searching capabilities to end-users.
Explore the full example here: https://github.com/GalenHealthcare/Elasticsearch-OCR-Service-Example
Now with the data extracted and necessary transformations complete, the next step in the process is to configure Elasticsearch and begin loading the data.
Look for our next blog article in the series which will address exposing data.