Tuesday, November 8, 2011

Saiku 2.1

Saiku is a modular open-source analysis suite offering lightweight OLAP.

Live démo here 

Friday, November 4, 2011

A new data modelisation plugin for PDI 4.3

Matt Casters announce on his blog a new data modelisation plugin.

More information here

Tuesday, November 1, 2011

Pentaho Announces Extreme Scale In-Memory Analytics

Pentaho press release :


 Users can now analyze big data with speed-of-thought performance and high availability

TDWI World - Orlando, Fla. — Octobre 31, 2011

Pentaho Corporation, the business analytics company providing power for technologists and rapid insight for users, today announced the latest release of Pentaho Business Analytics including major improvements to data analysis performance, scalability and reliability with support for cloud-scale distributed in-memory caching systems, new performance tuning aids, and support for analysis of more big data sources. These new capabilities are available today in the latest release of Pentaho Business Analytics, with the new product name representing Pentaho’s comprehensive and integrated business intelligence, data integration, data mining and predictive analytics capabilities.
This release provides the benefits of an in-memory solution without the limitations of an in-memory only architecture. Now, business users experience in-memory performance while IT gets a sound, scalable and manageable analytics platforms built on proven data warehousing and BI best practices. New features include:
  • In-memory analytics– Pentaho’s data analysis capability now supports Infinispan/JBoss Enterprise Data Grid and Memcached, with the option of extending to other in-memory cache systems. Infinispan and Memcached can cache terabytes of data distributed across multiple nodes, as well as maintain in-memory copies of data across nodes for high availability and failover. Using these in-memory caching systems results in more insight through predictable speed-of-thought access to vast in-memory datasets.
  • In-memory aggregation– granular data can now be rolled-up to higher-level summaries entirely in-memory, reducing the need to send new queries to the database, resulting in even faster performance for a wide range of analytic queries.
  • New analytic data sources– adding to Pentaho’s unmatched list of native big data platform support, native SQL generation is now supported for the EMC Greenplum Database and the Apache Hive data warehouse system for Hadoop.  This optimizes interactive data exploration performance when using Greenplum, and for Hive this makes it possible to design analytic reports offline then schedule them for background execution.

More informations here

Tuesday, September 13, 2011

KETTLE 4.2.0 stable is out

Download on sourceforge...

More informations from pentaho kettle's forum :

Here are some of the new things in this version:

  • The Excel Writer step offers advanced Excel output functionality to control the look and feel of your spreadsheets.
  • Graphical performance and progress feedback for transformations
  • The Google Analytics step allows download of statistics from your Google analytics account
  • The Pentaho Reporting Output step makes it possible for you to run your (parameterized) Pentaho reports in a transformation. It allows for easy report bursting of personalized reports.
  • The Automatic Documentation step generates (simple) documentation of your transformations and jobs using the Pentaho Reporting API.
  • The Get repository names step retrieves job and transformation information from your repositories.
  • The LDAP Writer step
  • The Ingres VectorWise (streaming) bulk loader step
  • The Greenplumb (streaming) bulk loader step (for gpload)
  • The Talend Job Execution job entry
  • Healthcare Level 7 : HL7 Input step, HL7 MLLP Input and HL7 MLLP Acknowledge job entries
  • The PGP File Encryption, Decryption & validation job entries facilitate encryption and decryption of files using PGP.
  • The Single Threader step for parallel performance tuning of large transformations
  • Allow a job to be started at a job entry of your choice (continue after fixing an error)
  • The MongoDB Input step (including authentication)
  • The ElasticSearch bulk loader
  • The XML Input Stream (StAX) step to read huge XML files at optimal performance and flat memory usage by flattening the structure of the data.
  • The Get ID from Slave Server step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server:
  • Carte improvements:
    • reserve next value range from a slave sequence service
    • allow parallel (simultaneous) runs of clustered transformations
    • list (reserved and free) socket reservations service
    • new options in XML for configuring slave sequences
    • allow time-out of stale objects using environment variable KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
    • Memory tuning of logging back-end with: KETTLE_MAX_LOGGING_REGISTRY_SIZE, KETTLE_MAX_JOB_ENTRIES_LOGGED, KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for never ending ETL in general and jobs specifically.
  • Repository Import/Export
    • Export at the repository folder level
    • Export and Import with optional rule-based validations
    • Import command line utility allow for rule-based (optional) import of lists of transformations, jobs and repository export files:
  • ETL Metadata Injection:
    • Retrieval of rows of data from a step to the “metadata injection” step
    • Support for injection into the “Excel Input” step
    • Support for injection into the “Row normaliser” step
    • Support for injection into the “Row Denormaliser” step
  • The Multiway Merge Join step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.
I'd like the Talend Job Execution job entry very much... it's so funny to use Talend inside of Kettle ;-)

Monday, July 18, 2011

Design for Developers

Wednesday, July 6, 2011

Greece's Public Sector merges its data, how it was done

Fuzz Box: Greece's Public Sector merges its data, how it was done
Kallikratis is the codename of the largest project ever conceived in the later years (2010-2011) of Greece's public sector digitalisation / computerisation effort and was founded by the Ministry of Interior and Decentralization.

The goal? To migrate different datasets spread among different applications and databases within the public sector's infrastructure.
That accounts to million rows of citizens' and public organizations' data / datasets distributed among different applications, structures, databases etc. having to be migrated in a homogenous, valid, normalized and commonly accepted data structure. In other words an ETL - Extract Transform Load process was required.

Saturday, July 2, 2011

La BI Chez Canal +

Source : MyDSI-Tv

Thursday, June 30, 2011

Pentaho 4 is out

Pentaho 4 demo is here :

Wednesday, June 22, 2011

Interresting video about data visualization

A show from David McCandless about Data Visualization and graphic representations.

Tuesday, June 21, 2011

Screencasts about Google Refine

Wednesday, January 5, 2011

Your Excel exports requests in Web with Kettle and Pentaho BI Server ( TechTip)

A very interresting post from Sylvain Decloix on about Excel Exports with Kettle.

Pentaho Data Integration (Kettle) est souvent très utile pour générer et délivrer de façon simple des fichiers Excel à destination d’utilisateurs finaux. On peut en effet joindre un fichier « au résultat » d’une transformation afin de l’envoyer ensuite par mail aux personnes souhaitées via l’étape « Envoi Courriel » (cliquer sur les images pour les agrandir) :
Cependant, comment faire pour permettre à un utilisateur final de télécharger à la demande le fichier Excel via un navigateur web, en filtrant éventuellement les données à exporter au travers de paramètres présentés via un formulaire Html (listes déroulantes, cases à cocher…) ?
Réponse: avec un serveur Pentaho !

The continuation on