Some months ago, I got involved in a project where I needed to generate quite big reports (more than 1M rows) extracted mainly from an SQL table growing at a very fast pace. This table played a central role in the everyday usage of the system.
In order to avoid losing performance, we scheduled a database backup and a process to empty that large problematic table every 3 months. But the problem with the reports remained the same. At some point we ended with a table with more than 3M rows and we needed to generate reports with more than 1M rows. Even if the report was small, querying this table was taking too much time (some times more than 10 minutes).
Here is what we did. First we decided to design like an ETL process using only SQL (somebody will argue that this is not ETL, but it is). How? We developed a module to execute SQL scripts defined at run-time every day at 3:00 am.
These scripts were basically "insert into table" statements. They took data from some tables and inserted the result into another table. Data in the destiny table was more suitable for reporting, so we gained some time when generating the reports by moving processing (expensive joins mostly) from working hours to 3:00 am, when nobody (even no CEO) was using the system. Reports were way faster (like 20 seconds for the bigger one).
A couple of months after going to production with this scheme, we faced another problem. The destiny table, supposed to be more suitable for reporting, started to be a problem. As you may guess, it was too large. We were facing our initial situation again. The solution: Split that large data set into smaller ones. How? Every row in that table had a time stamp. So we divided the data into semesters by making one table per semester. The script executor module was modified to put the data into the correct table according to current date. The reporting application was also updated to allow users select the semester (so the app was able to query the correct table).
Reports are fast at the time of writing this. In any case, this solution gave us some time to think about implementing a Big Data solution, NoSQL maybe... Wait! This is already kind of a Big Data solution :)