The Architecture of PIER: an Internet-Scale Query Processor
by Ryan Huesch et. al

In this paper, the authors will discuss, in detail, the issues in building Internet-scale query processor. The work presents the issues very nicely and gradually focuses on impelenting a working experimental infrastructure for the purpose. Building any such system has to address essentially three issues first- the scalability issue, the networking issues, and the database issues. The following can be summarized as the major innovative contributions.

Innovative contributions
  • Distributed Has Tables (DHTs) are known for their scalability, robustness  to handle node failures and churn,  truely distributed in nature. PIER aggressively uses DHT as much as possible in the life time of a query either in routing, disseminating the query plans, aggregating the results, etc. Thus PIER can levarage on vast amount of on going research in DHTs. PIER mainly contratsts with other Internet-scale query processors mainly in this aspect. The PIER architecture is general enough to deal with any kind of underlying DHT.
  • For range indexes, it uses Prefix Hash Tree (PHT) which can be realized on top of the DHT, which eliminates the need for another distributed data structure for this purpose.  At the moment, the current literature lacks in this.
  • This opens up a number of  cross-domain issues w.r.t the networking and database aspects of the problem settings.
    •  The nettworking issues are mostly solved using the DHT infrastructure for routing and location-discovery purposes. The bandwidth issues are addressed using clever simpler hierarchial aggregation and compuatation techniques. However, the security issues still remain and the architecture choices open up a number of more security issues. 
    • The databses issues are solved mostly be separating the persistent storage from the query system as such. So there is no guarantee of  availability of a particular resource needed for a query. The information items are maintained using a soft state, so the publisher has to publish them freshly based on the soft state settings.
  • The software engineering issues in building a simple, easily-debuggable working prototype are really interesting. The authors did a wonderful contribution in this regard.
  • The UFL interface for generating the input query plans.
Scope for improvment
  • The sample applications quoted, as I feel, work on extermely low granular information items (like a sinlge tuple in firewall logs). Instead of building a completely distributed system, the users may want to have a single server to collect all the information items (which may be really light weight given the size of the items quried for). Some thought in this direction is needed in the paper. The p2p based infastructures are attactive for densely indexed information resources so that it will be very efficient to locate an item using the infrastructure instead of a broadcast mechanism. If the information items are not aggressively indexed, the infrastructure is of little help. 
  • What the users should search for? The lack of a proper schema for all the information items shared is a big challenge. If such a schema and corresponding meta data are not availble, the users do not have any idea of first, what kind of information is available in the system and how to query it for.
  • The authors think of having a SQL like interface for querying the system. But without a metadata and all in place, its of meagre value.
  • The system takes a query plan which also has the dissemination of the operator nodes information too. Scalably disseminating the graph is addressed. But scalabilty problem has another dimension- scalably searching who has what information item! This is not well addressed I feel.