|The Architecture of PIER: an Internet-Scale Query Processor
by Ryan Huesch et. al
In this paper, the
authors will discuss, in detail, the issues in building Internet-scale
query processor. The work presents the issues very nicely and gradually
focuses on impelenting a working experimental infrastructure for the
purpose. Building any such system has to address essentially three
issues first- the scalability issue, the networking issues, and the database issues. The following can be summarized as the major innovative contributions.
Scope for improvment
- Distributed Has Tables (DHTs) are known for their
scalability, robustness to handle node failures and churn,
truely distributed in nature. PIER aggressively uses DHT as much
as possible in the life time of a query either in routing,
disseminating the query plans, aggregating the results, etc. Thus PIER
can levarage on vast amount of on going research in DHTs. PIER mainly
contratsts with other Internet-scale query processors mainly in this
aspect. The PIER architecture is general enough to deal with any kind
of underlying DHT.
- For range indexes, it uses Prefix Hash Tree (PHT)
which can be realized on top of the DHT, which eliminates the need for
another distributed data structure for this purpose. At the
moment, the current literature lacks in this.
- This opens up a number of cross-domain issues w.r.t the networking and database aspects of the problem settings.
- The nettworking issues are mostly solved
using the DHT infrastructure for routing and location-discovery
purposes. The bandwidth issues are addressed using clever simpler
hierarchial aggregation and compuatation techniques. However, the
security issues still remain and the architecture choices open up a
number of more security issues.
- The databses issues are solved mostly be
separating the persistent storage from the query system as such. So
there is no guarantee of availability of a particular resource
needed for a query. The information items are maintained using a soft
state, so the publisher has to publish them freshly based on the soft
- The software engineering issues in building a
simple, easily-debuggable working prototype are really interesting. The
authors did a wonderful contribution in this regard.
- The UFL interface for generating the input query plans.
- The sample applications quoted, as I feel, work
on extermely low granular information items (like a sinlge tuple in
firewall logs). Instead of building a completely distributed system,
the users may want to have a single server to collect all the
information items (which may be really light weight given the size of
the items quried for). Some thought in this direction is needed in the
paper. The p2p based infastructures are attactive for densely indexed
information resources so that it will be very efficient to locate an
item using the infrastructure instead of a broadcast mechanism. If the
information items are not aggressively indexed, the infrastructure is
of little help.
- What the users should search for? The lack of a
proper schema for all the information items shared is a big challenge.
If such a schema and corresponding meta data are not availble, the
users do not have any idea of first, what kind of information is
available in the system and how to query it for.
- The authors think of having a SQL like interface
for querying the system. But without a metadata and all in place, its
of meagre value.
- The system takes a query plan which also has the
dissemination of the operator nodes information too. Scalably
disseminating the graph is addressed. But scalabilty problem has
another dimension- scalably searching who has what information item!
This is not well addressed I feel.