In this cluster, one server will officiate as a master server and the others will officiate as slave servers. To achieve this, it is necessary to configure a Cluster of Pentaho PDI servers. If your server resources are not enough to execute your process within the acceptable times for your business, it is possible to improve the processing capacity by increasing the number of servers. Scale out – run the job on a cluster of integration servers But watch out, do not create more copies than the number of CPUs you have available! 4. If you find that there are some steps of your work that are creating a “bottleneck” in your process, you can run multiple copies of these steps to lower the total time at the expense of increased resource consumption. Scale up – execute several copies of the steps of your work that consume more resources The first process will read from line 1 to the line that corresponds to 1 GB, the second process will do it from 1 GB to 2 GB and so on with the rest of the file. If you have a 4 GB file, you can fire 4 processes in parallel that will be distributed reading work. For example, in the case of text files, you can define what is read in parallel and how many reading processes are generated. Pentaho PDI provides mechanisms to parallelize access to data. The same applies if your data source is a Database in which case readings can be triggered in parallel taking care not to saturate the database server.
This is also a valid approach if your data is on a disk battery (RAID) or specialized external devices such as SAN or NAS.
Take full advantage of the possibilities of reading data provided by your source systemįor example, if you are reading text files you could store these files on different hard drives and shoot readings in parallel to take better advantage of the available hardware. For example, you can keep in memory the records that you use as lookup, so your resources are not constantly reading this data. Another use that you can give to your memory is to increase the amount of records that are kept in the buffer between the different steps of your work and in this way improve the times. For example, do not allocate a Gigabyte of memory to process a text file of a few hundred lines. Adjust the memory usage parameters of your toolĪllocate enough memory to carry out the work, considering the memory used by other applications and the operating system of your server itself. Data transmission capacity of your network.Type of storage (hard drives, databases, etc.).You want to optimize the use of those resources to perform all your data processing tasks in the best possible way. One of the primary things you need to consider is making sure you have enough resources on your processing server. There are several tools on the market that can help you process syntactic or semantic controls in the data sent to you by your online payment provider or load data in your corporate database to be used in your ERP or Data Warehouse and everything in between.ĭetailed below are good practices which you should apply when the processing large volumes of data using the Pentaho Data Integration (PDI) tool. The challenge of processing these large volumes of data requires attention to every detail and the application of best practices whenever possible. With the passage of time, the volumes of data that we handle every day are growing exponentially and there is an ongoing need for integration tools like Pentaho PDI to process larger and larger volumes of day.Ĭurrently, we talk about Terabytes of information of the Gigabytes of some years ago or the Kilobytes of decades ago.