Users can now analyze big data with speed-of-thought performance and high availability
TDWI World - Orlando, Fla. — Octobre 31, 2011 —
Pentaho Corporation, the business analytics company providing power for technologists and rapid insight for users, today announced the latest release of Pentaho Business Analytics including major improvements to data analysis performance, scalability and reliability with support for cloud-scale distributed in-memory caching systems, new performance tuning aids, and support for analysis of more big data sources. These new capabilities are available today in the latest release of Pentaho Business Analytics, with the new product name representing Pentaho’s comprehensive and integrated business intelligence, data integration, data mining and predictive analytics capabilities.
This release provides the benefits of an in-memory solution without the limitations of an in-memory only architecture. Now, business users experience in-memory performance while IT gets a sound, scalable and manageable analytics platforms built on proven data warehousing and BI best practices. New features include:
In-memory analytics– Pentaho’s data analysis capability now supports Infinispan/JBoss Enterprise Data Grid and Memcached, with the option of extending to other in-memory cache systems. Infinispan and Memcached can cache terabytes of data distributed across multiple nodes, as well as maintain in-memory copies of data across nodes for high availability and failover. Using these in-memory caching systems results in more insight through predictable speed-of-thought access to vast in-memory datasets.
In-memory aggregation– granular data can now be rolled-up to higher-level summaries entirely in-memory, reducing the need to send new queries to the database, resulting in even faster performance for a wide range of analytic queries.
New analytic data sources– adding to Pentaho’s unmatched list of native big data platform support, native SQL generation is now supported for the EMC Greenplum Database and the Apache Hive data warehouse system for Hadoop. This optimizes interactive data exploration performance when using Greenplum, and for Hive this makes it possible to design analytic reports offline then schedule them for background execution.
The PGP File Encryption, Decryption & validation job entries facilitate encryption and decryption of files using PGP.
The Single Threader step for parallel performance tuning of large transformations
Allow a job to be started at a job entry of your choice (continue after fixing an error)
The MongoDB Input step (including authentication)
The ElasticSearch bulk loader
The XML Input Stream (StAX) step to read huge XML files at optimal performance and flat memory usage by flattening the structure of the data.
The Get ID from Slave Server step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server: http://wiki.pentaho.com/display/EAI/...m+Slave+Server
reserve next value range from a slave sequence service
allow parallel (simultaneous) runs of clustered transformations
list (reserved and free) socket reservations service
new options in XML for configuring slave sequences
allow time-out of stale objects using environment variable KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
Memory tuning of logging back-end with: KETTLE_MAX_LOGGING_REGISTRY_SIZE, KETTLE_MAX_JOB_ENTRIES_LOGGED, KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for never ending ETL in general and jobs specifically.
Export at the repository folder level
Export and Import with optional rule-based validations
Import command line utility allow for rule-based (optional) import of lists of transformations, jobs and repository export files: http://wiki.pentaho.com/display/EAI/...+Documentation
ETL Metadata Injection:
Retrieval of rows of data from a step to the “metadata injection” step
Support for injection into the “Excel Input” step
Support for injection into the “Row normaliser” step
Support for injection into the “Row Denormaliser” step
The Multiway Merge Join step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.
I'd like the Talend Job Execution job entry very much... it's so funny to use Talend inside of Kettle ;-)
The goal? To migrate different datasets spread among different applications and databases within the public sector's infrastructure.
That accounts to million rows of citizens' and public organizations' data / datasets distributed among different applications, structures, databases etc. having to be migrated in a homogenous, valid, normalized and commonly accepted data structure. In other words an ETL - Extract Transform Load process was required.
A very interresting post from Sylvain Decloix on osbi.fr about Excel Exports with Kettle.
Pentaho Data Integration (Kettle) est souvent très utile pour générer et délivrer de façon simple des fichiers Excel à destination d’utilisateurs finaux. On peut en effet joindre un fichier « au résultat » d’une transformation afin de l’envoyer ensuite par mail aux personnes souhaitées via l’étape « Envoi Courriel » (cliquer sur les images pour les agrandir) :
Cependant, comment faire pour permettre à un utilisateur final de télécharger à la demande le fichier Excel via un navigateur web, en filtrant éventuellement les données à exporter au travers de paramètres présentés via un formulaire Html (listes déroulantes, cases à cocher…) ? Réponse: avec un serveur Pentaho !