Distributed knowledge engineering

Anton Kolonin, 2013, May 10.

Here is continuation of the talk about global computational intelligence given at Siberian forum «Industry of information systems» where I described what is the centralized globalization of structured knowledge and how it can be connected to emerging computational intelligence. On the next day, I have presented an “alternate” or “complementary” model – decentralized globalization of structured knowledge and have described one practical approach to it. Here I discuss the following topics.

Paradigm of distributed knowledge engineering
Knowledge representation model
Webstructor system
Object-Relational language

Paradigm of distributed knowledge engineering

To get started, let us compare the two models. Within centralized knowledge globalization model, all information gets clustered within closed semantic databases owned by few largest knowledge aggregators with an access to it by means of so-called intelligent agents. In such a case, the vast majority of intellectual space gets closed in few data centers (even if few percent of it can be offloaded to clients as “public domain knowledge”). In such case, it is implicit that the central storage keeps kind of absolute truth knowledge about any event or entity in the world.

In turn, in case of decentralized knowledge globalization model, knowledge is semi-evenly distributed across entire global computational network with possibility of dynamical redistribution of the knowledge itself, truth values of any particular pieces of knowledge and functional capabilities of its processing. Interestingly, the truth in such a model turns to be dynamic and rather subjective - specific to agents carrying particular segment of the knowledge network.

For such distributed environment to emerge to make social evolution of computational intelligence possible, there is a need for society of computational agents with functional structure and rules of dynamic self-organization of it, with the following requirements.

rich historical memory shared by communicating computer agents (e.g. accessible public banks of information);
rich sensory environment driving the communication and accessible means of gathering novel information (e.g. search, browsing and messaging against peer computer agents);
for an agent, ability to explicitly expose its own knowledge indicating confidence, proprietary rights and privacy levels of it;
unrestricted fertility of diverse behavioral patterns (i.e. computational algorithms) exposed by agents (capable for evolution upon feedback from peer agents);
legal definition of the responsibility for computer agent's actions (e.g. search results, browse requests and messages) delegated to person or corporation operating the agent hardware;
ease of peer-to-peer communication by means of unified language based on the same upper ontology (i.e. open knowledge transfer and manipulation protocol).

For the historical memory and sensory environment requirements mentioned above, there is a need to maintain (by public domain computer agents) an open space of semantic graphs which can be formed by means of sharing (donating) the personal semantic graphs by private agents, given each sharing or donation act contains information authored by an agent itself or delegated to an agent for re-distribution and it is considered non-confidential.

Regarding the ability to expose the knowledge, per above, each computer agent can have a right to retain intellectual property on the knowledge they contribute (possibly in legal space delegated to the agent's hardware owner) and specify the privacy levels of it so it can be either accessible by peer agent only or forwarded to another agent. Possibly, there should be a way to explicitly specify access levels to particular agents or agent groups involved in global communication.

The fertility of diverse behavioral patterns is obviously not that much a requirement but more a beneficial outcome from the other requirements. On the other hand, this would enable social regulation in “intelligent computer society” which is effectively ensuring open-ended development of cognitive potential of entire computer intelligence ecosystem.

In turn, the responsibility for computer agent's actions reflects the fact that the end consumers as wells as action matter of computer intelligence are we humans. Given today legal practice, the responsibility for failures more often stays with hardware operators rather than software vendors. However, the more intelligent computers become, the more difficult situation makes itself. For instance, legal responsibility for viral software creation and distribution goes to software makers as soon as the software get installed on the victim's hardware illegally (i.e. without of agreement of hardware owner). However, if hardware is running software (from one source) deriving its actions from the knowledge (from another source), there may be a “binary weapon” effect when a harm is produced by combination of the two. Since such effects may be unpredictable (so one can't forecast the software behavior in advance), in addition to open knowledge transfer protocols there may get more demand for open source software distribution model with potential ability of a customer to perform an audit of the software they are about to delegate their corporate or personal intelligence to. In its turn, even the open knowledge content itself could provide be traceability to identify originators of any knowledge.

Finally, by language we mean not just syntax of declarative descriptions of data sets or imperative programmatic instructions but whole range of means to convey the meaning of states, intents and inquiries of communication agents, based on common root ontology. Semantic architecture of a language, regardless of its syntactical representation (say same can be put in RDF or Lisp syntax) should support wide range of communicative paradigms to be conveyed, say:

“here are the items A, B and C where A has properties X and Y while B and C are in relationship Z”;
“in order to reach 1 one needs 2 and 3 to be held true while 2 can be true only if 4 happens”;
“each morning you need to perform this and that in order, having such and such done at once next”;
“hey, where is that my stuff you mentioned yesterday – need it back urgently!”;
“what were the relationships between P and Q last year?”;
“let me know once they roll out next version of the product”.

Another important properties of language necessary for social emergence of computational intelligence are fuzzy-ness, subjectivity and partial comprehension. While existing semantic notation schemas like OWL and RDF can be extended for it, some major players (like say Schema.org and Open Cyc) do not exploit this. From perspective of subjectivity, certain assertions can be treated useful only in context of particular belief system (say Google's belief in something may vary from same of Wikidata's). Regardless fuzzy-ness, it is typically not enough just maintain confidence level of fuzzy assertion, because the process of merging congruent assertions coming from different communication subjects do need evidence recorded in some way to come up with resulting confidence. Finally, partial comprehension means that any multi-part message from one agent to another may be partially comprehended, to the extent of overlapping mental models and ontological beliefs of sender and receiver, while the remainder of the message can be ignored. Besides expressional power requirements, the language also would benefit being easily comprehensible by human readers and writers, so that same interface for computer-to-computer interaction can be re-purposed by human peers.

Overall architecture implementing the environment suggested above can be drawn with the following scheme, involving various agents playing one or combination of several typical roles.

Within suggested architecture, storage agents provide distributed (and likely redundant) storage of structured information while collector agents perform gathering of the information from unstructured media (such as text files and web pages as well as raw video, audio and paper materials) as well as communication with outer world (using input devices such as thermometers, motion sensors, microphones, camcorders, etc.). User agents establish forward and backward communications with users and operators while broker agents serve routing of the messages between all other agents (implementing topologies such as cloud storage and federated search). Finally, actor agents can direct actions towards surrounding social and physical environment (publishing web pages, sending emails and messages, authoring files or activating devices in physical world).

Different types of agents placed on the picture above are rather typical roles than narrow specializations, i.e. same physical instance of an agent can play different roles at once. At the same time, given specific storage and performance capabilities and connectivity graphs, various topologies can be formed (either by manual configuration or adaptive emergence), such as the following.

On the picture above, broker agent with set of storage agents implement cloud storage, while accompanied with sets of collector agents and user agents managed by user agents form a search engine with crawler service. In turn, set of user agents associated with broker agents form social network. Finally, all systems mentioned above can be integrated into meta-system with help of broker agents of broad specialization.

In order to achieve possibility of the described above, there seem to be a demand for developing open communication standard for agents of emerging computational intelligence, adopted by major software vendors at some point. That standard would include specification of interfaces the intelligent agents would support as well as language to be used for communication among them. The interfaces would include functions such as the following.

Output: Search or browse the knowledge – primarily implemented by public agents such as Google , Facebook, Wikidata, etc. but also may be supported by any other large and small, corporate and personal agents which could want to contribute to the search space, yet not necessarily willing to let their data get re-distributed.
Input: Accept a piece of knowledge distributed by peer agent with option to either reject the input or incorporate it into its own “belief system” with account of appropriate copyright and privacy constraints.

Notably, both interfaces would have synchronous as well as asynchronous versions – so that the output may be either given in respect to synchronous query, or it may be provided asynchronously upon prior “subscription” (with delivery of the data back using Input interface of the subscriber). Respectively, the Input can take form of a channel to accepts the data feed as well as a registry to list the data sources later polled via the Output interface of these sources.

Given such interfaces, the social patterns of “intelligent computer society” would develop in different forms. The modern form - few “big” agents performing synchronous search/browse operations and asynchronous inputs (by means of crawling) - would be the one. But the opposite form – multiple “small knowledge businesses” (keeping the distributed content) would get synchronously polled by huge “aggregating businesses”, having necessary “public domain” data pushed to them in turn - would be also possible. With all possible forms of communication, the emergent formation of variety of hierarchical and network patterns would emerge in the cyberspace, evolving the most efficient communication (i.e. social) structures.

On practical side, besides proposing the computer science community and industry leaders to come up with such open standard and language, there is a vision of a distributed computational intelligence agent software to run on every smartphone and personal computer, having that protocol implemented. The software would look like a Facebook or Google+ client (though different competitive implementations may appear given the same protocol), with some extra abilities like:

creating knowledge content (i.e. authoring things and their properties), indicating the privacy level of this content and possibly access levels on individual basis;
establishing communications with other agents (corporate or personal), specifying trust relationships with them and subscribing with them either as knowledge consumer or as provider or in both roles;
implementing a “distributed storage cell” role for entire agent system, assuming only the knowledge acceptable to agent owner “belief system” is stored - with an option to turn off the self “content provider role” at any time (unless regulated by specific legal agreements with other agent owners).

All that said, within the same communication infrastructure, such patterns as distributed storage, social network, federated search and others can be implemented. At the same time, topology of the communication graph can be not a schema “hardcoded by creator”, but rather an emergent structure being part of the entire system knowledge.

Having all that picture of computational intelligence agent society drawn, the question arise – once the basic knowledge is uploaded, what is the role of humanity after that, besides operating the hardware running the agents, feeding them with the novel sensory inputs and sitting in the courts as legal representatives of the agents causing intellectual harm one to another? Well, there is still need for biological brains to come up with more fast processors and cheaper memory, as well as invent more efficient inference engines for newborn agents. And of course, it is assumed that, with all that environment provided and resources involved, there are the goals to be posed by someone.

One of the major benefits of having the global computational intelligence emerged on the ground of distributed agent system rather than inside one or another private supercomputer farm (even given public access to it) would be possibility to have truly democratic mechanism of goal formation for the entire system. That is, the overall goal of the network would be some non-linear superposition of the goals of each society member, accordingly to the amount of quality of knowledge the member contributes to the society and the trust society gives the member in return. As a whole, such decentralized model seem to be more stable evolutionary than centralized one, since the latter can be biased not only by business and personal reason but just because of taking some non-evolutionary path at some point.

Knowledge representation model

As long as any agent talk to any other one same communication language, internal design of an agent, set of algorithms implementing each one and programming language used for agent implementation do not matter than much. However, there is one major principle to be followed.

Even given variety of agent specialization, besides using common communication language on itself, agents are implied to have some jointly shared system of fundamental knowledge (belief system) regarding surrounding world and themselves. They should also have a mechanism of either acceptance of knowledge coming to an agent from its outer world (if it is compatible with agent's belief system), or rejection of it (in the opposite case). Further, for different sorts of accepted knowledge, an agent should be able to make judgments regarding reliability of different facts, which can be done given number of evidence associated with these facts, with account to trust in respect to knowledge sources communicating them. Here we come to social evidence-based knowledge representation model and notion of partial comprehension.

With massive distributed data processing and many-to-many style replication, synchronization of concurrent changes (especially, such as updates and deletes) become a big problem. For instance, if agent A communicates fact P to agent B while B communicates fact Q to A, there is just a counter addition of information to each of the agent's knowledge bases. However, there is a typical scenario where agents argue “about” something, making conflicting changes to the same data. For instance, agent A tells there are relationships X and Y between P and Q, while agent B argues there is Y and Z but not X – who is to be trusted in such case? Obviously, both can agree on presence of Y, while X remains as personal belief of A and B keeps believing in Z. That is, assuming part of the message can be accepted and the reminder can be declined, it can be possible to make each of the agents more knowledgeable in the course of communication, yet not having to destroy belief system of each of them.

Within the social evidence-based knowledge representation model, truth value of any piece of information can be calculated as sum of truth value of its evidence records communicated by peer agents multiplied by trust levels for each of these peer agents. To achieve this, the entire semantic hyper-graph representing knowledge of an agent can be split in four major sub-graphs, like shown on the following scheme.

The foundation graph layer is cornerstone cognitive base of each of the agents which means that two agents speaking the same language syntactically, would not understand one another if their foundation graphs differ significantly (i.e. 18-year old punk and computer geek talking to 88-year old orthodox peasant from some deep country village). It is assumed that foundation graph does not need any fuzzy inference applied to it and there may be some special rules (specific to each agent design) of how that part of the knowledge is being formed. The most favorable approach is to have portions of imagination graph exceeding given thresholds of evidence to be “hardwired” to the foundation graph. In other words, reasoning on this part of agent's knowledge may be representing orthodox, stereotypic or closed-minded thinking.
The imagination graph is a pool of novel evidence-based knowledge coming to an agent via communication channels. Given the trust levels specific to particular communication peers providing the inputs, as well as amounts of positive and negative evidence supplied for assertions in this graph, the agent is capable to draw its own assertions and either communicate them back to the outer world or upload to their foundation graph eventually. This part of an agent's brain can be considered as non-stereotypic or open-minded core.
The communication graph layer describes social interaction channels of an agent and also provides basis for account of subjectivity, so that each fact in the imagination graph is supplied by trust given to particular communication agent at a time. This is effectively social core, or personal social network of an agent which maintains trust levels for each of peer agents in two dimensions. First – how much confidence can be given to an incoming information communicated by the peer, in general. Second – if there are any confidential restrictions implied for an information communicated to the peer – like it can be for private knowledge only, or for public share or such.
The evidence graph effectively records temporal facts of evidence exposed by peer agents from communication graph – to draw cumulative assertions in imagination graph on that basis. This pool of facts serves as an evidence base for the inference engine with account to subjective grounds as well as with temporal analysis capabilities. Each piece of information here is timestamped and labeled by peer communicating it. Obviously, data here can be subject of evidence compression – with either clustering of fractional time slices into larger time intervals or aggregating evidences from individual peers into larger groups of peers. Further, evidence can be forgot, with either transition of knowledge (derived from the evidence) from imagination graph to foundation graph, or because there was no way to use the evidence for inferring some reliable knowledge in imagination graph for long time (evidential garbage collection).

Further, let us consider structure of knowledge representation in the agent's graphs. Traditional approach to express semantic graphs (semantic networks) is to use ternary relations or triplets. They can be successfully used to describe something simple like «cat is an animal» or «Roosevelt is president». However, triplets are hardly applicable to express more complex information involving conditional, subjective and temporal contexts, like «in his childhood, Bob thought that even eating really too much ice-cream will never cause being cold». Thus, more complication to triplet-only schema is required, like shown on the following schema.

As we can see, using triplets we hardly can describe quite real-world scenario applicable for a personal phone-book or HR database of a company. To be represented to full extent, the information can be placed in traditional relational database with n-ary relations, which would eliminate normalization implied by semantic network technology expected by inference engine to operate. As another alternative, additional higher-order relations can be introduced in order to fine-tune meaning of lower-level relations, specifying their temporal continuum, specialization and other aspects. This way, there is a need to build hierarchically enclosed and laterally overlapped sub-graphs (like it happens to hierarchical and laterally adjacent or overlapping neural networks with higher-order topologies within human brain cortex).

Interestingly, when using hyper-graphs with higher-level graph involving edges of lower-level graph, there is a possibility to normalize ontological model of any complexity down to system of binary relations or links with types of links represented by extra binary relations pointing to the edge representing type of the link, like shown on the following schema.

That is, there is a possibility to express data and schema within the same model and so be able to do inferences about particular pieces of schema data using the same mechanisms as used for inference on conventional data (described by schema) itself. At the same time, from practical standpoint, certain link categories, such as inheritance or type (is-a), possession (has-a, property-of), time and source of information may be considered as system properties and handled specifically while being used to describe the vast majority of data and do inferences about it.

Webstructor system

The Webstructor project on itself has been developed by me to serve as a proof-of-concept tool for the development of some of suggested concepts, taking it roots from the following history.

In 1995-1996, in the work of CTC Ltd., the semantic graph has been employed to fully describe the operational space of software system to carry out data management, inter-personal interactions, interactive form processing and report generation, action script development. The reference software system has been developed as multi-user application. It had been used to draw wide range of applications including personal diary, time management, business accounting, inventory/sales automation, CRM and others. The drawback of the system has been poor run-time performance (given full normalization of any data and executable code down to nodes and links).
In 1997-1999, in the work of ProPro Ltd., based on similar semantic graph model, the object relational language (ORL) for inter-agent communication has been developed to enable development of corporate business automation system for stock exchange domain. The reference system has been implemented for development of multi-tier (“thick” application server, “thick” client agent). It has been used to describe the whole application domain including data model, entry forms, reports and all business rules and functions.
In 2001, in the scope of Webstructor project based on the ORL language, the agent software for peer-to-peer knowledge creation and interchange has been developed as part of Webstructor project. The computational agents were developed to operate as web server-side Servlets, browser-side Applets or standalone Applications, exchanging the knowledge in many-to-many fashion encoded in ORL statements, with user interfaces capable to browse, search and maintain the knowledge visually in forms of graphs or interactive ORL console (so the same language was made usable by humans). The gateway between ORL and LISP has been developed and entire Open Cyc ontology has been uploaded to Webstructor agent system.
In 2006, the Webstructor engine has been employed by IT Solutions Ltd. in order to build virtual 3D environment for the purpose of visualization and sharing of complex scientific data, having all parts of it implemented in terms of semantic graph interpreted by the engine. Within the distributed agent system, it has been made possible to visualize, navigate and amend the properties of virtual objects in hyperspace in a collaborative peer-to-peer network.

Current implementation model is simplified to such extent, so that only fundamental graph and communication graph are present – which implies a full trust communication model for agent's interactions, assuming any data involved in exchange is an absolute truth.

In Webstructor, segmentation of distributed knowledge space into sub-graphs has been achieved with implementation of views, where each view can have a wide range of interpretations, such as individual belief system contained within the entire scope of knowledge or specific thoughts and logical formulae on particular matters. The most common usage of «view» is to represent a graphical user view of certain operational sub-graph – as specified by user. Some practical examples of Webstructor view may include all vertices of animal classification tree with sub-class/super-class links between the vertices, or all people as nodes in some human family with all kinds of human relations as links between the nodes.

Implementation of the system implied exchange of information using either public HTTP or secure HTTPS on the web or using raw TCP/IP within private corporate network. In both cases, mentioned protocols were used as transport layer for direct conversions in object-relation language (ORL) between agents.

From architectural perspective, there are three different agents present in Webstructor now. Servlet agent runs on the web server and implements roles of broker and storage being able to serve multiple Applets and Servers over HTTP protocol, passing information through between agents and providing intermediate storage of at the same time. Applet agent runs in the web browser and provides user access to the whole system. Server agent implements roles of storage, broker and user at the same time, so it can be used to create full-blown distributed peer-to-peer networks.

Within described architecture, there are two practical applications created – visual ontology editor and spatial data visualization system.

Visual ontology editor provides capabilities to edit various graphs with options to associate vertices with web resources, colors, shapes and image information. This can be used to edit hierarchical graphs as well as recurrent networks. There is also possibility to create higher-order networks with possibility to express logical formulae, for instance. Besides handling input and output data in ORL format, same content can be imported from CycL languge. In addition to graphical editing capabilities, application provides interactive console which can be used to manipulate knowledge by means of ORL language.

For instance, graph above conveys message telling that «if tina is a fish then it is not an insect or a bird», which can be encoded with logical formula such as (implies (isa Tuna Fish) (not (or (Tuna isa Insect) (Tuna isa Bird)))). To enable encoding and presentation of complex graphs, visualization and edition of them can be performed in 2-dimensional space as well as in 3-dimensional one – with rendering perspective being just an option of Webstructor view.

Ideas regarding 3-dimensional visualization of knowledge data, having local subgraphs projected to separate views, led to implementation of full-scale application intended for multi-dimensional visualization of complex scientific data on the platform of distributed information sharing. On the basis of Webstructor technology, Space Work visualization system has been designed and implemented.

In SpaceWork, each «view» represents not just a knowledge sub-graph, but rather complete spatial presentation of some volume of data describing physical world. Various kinds of physical data can be joined together spatially, combining different means to convey numeric information (line charts, relief maps, color maps, tonal maps, isoline maps). Additionally, use of hyper-links (internal – used to switch from one «view» to another and external – making it possible to jump from «world» of one agent to the «world» of another) provides ability to navigate virtually through different «subjective» sub-spaces within global virtual hyper-space.

Object-Relational language

The ORL language intended for communication between agents is meeting most of the requirements described in the first section of this article (as of today, besides explicit support for fuzzy-ness and subjectivity). Originally, it has been designed as compact notation of arbitrary structured data, including formal logic rules, declaration of business processes and arbitrary functional graphs. The following principles has been put into foundation of the language.

Syntax assumes few fundamental objects such as thing, property, name, numeric or literal constant, array, set (where a set can be either mandatory or optional) and query.
Within particular implementation of the language, there can be specific scope of terms describing an application object model as keywords.
Description of any schema (classes, attributes, etc.) is done in the same linguistic space as description of data objects and values – ontological transparency.
Description of functional schemata (functions, methods and operators) is also possible in the same linguistic space however compact (scripting) notation is also possible.
Centric feature of language is query (somewhat resembling structured query applicable to relational model), which is used as reference (instead of pointers or identifiers) describing structured data as well as functional schemata.
Enables flexible expression of any sorts of hyper-graphs.

For instance, natural language expressions given in the former section can be translated to ORL as shown in the table below.

Here are the items A, B and C where A has properties X and Y while B and C are in relationship Z.	ITEM A,B,C;; A HAS (X), (Y);; B Z(C);;
In order to reach goal 1 one needs condition 2 and 3 to be held true while 2 can be true only if condition 4 happens.	CONDITION C2,C3,C4;; GOAL G1 REQUIRES (C2),(C3);; CONDITION(C2) REQUIRES (C4);;
Each morning need to perform this and that in order, having such and such done at once next.	PROCESS TIME “8:00”; REPEAT (DAILY); ORDER DO THIS, DO THAT;, FORK DO SUCH, DO SUCH;;;
What is that my stuff you mentioned yesterday or the day before?	STUFF(OWNER (ME), UPDATE (AUTHOR (YOU), {TIME “2013-03-22”, TIME “2013-03-21”});
What were the relationships between P and Q last year?	PROPERTY(OWNER (P), THING (Q));
Let me know once they roll out next version of the product.	DO EMAIL TO “me@at.org”;; WHEN PRODUCT(VENDOR (THEY)).VERSION CHANGE;;

Essentially, most like XML and JSON and to some extent LISP, ORL does not rely on particular set of keywords but rather provide general semantic, syntax and punctuation regulations. Compared to basic XML, it is more compact and much more readable by human (yet a bit more hard to be verified by machine). Compared to LISP, it has richer punctuation syntax like different delimiters used for plain arrays (lists) or associative arrays (property sheets) and different brackets used for AND-style or OR-style boolean expressions, which makes it possible to represent structured queries in relational style, referring to classes of objects as SQL tables.

The important feature of the ORL is its ontological transparency, so metadata and data are described in the same language (unlike any conventional programming languages and XML-DTD schema). That is, for instance, quite different object systems (i.e. relying on different “upper ontologies”) have been implemented in the works of 1997-1999 and later in 2001-2006 – using the same ORL linguistic processor model.

At last, the key distinguishing feature of the ORL is built-in notion of structured query used to refer to objects “by conditional query” instead of referring to them say by pointer or resource identifier. Effectively, expressive power of ORL query is equal to one of SQL (assuming groups of knowledge objects inheriting sets of properties from the same parent class are corresponding to rows in some table of relational database), while groups of conditions joined by OR and AND operator are explicitly grouped by syntax (different kinds of brackets). This is very unlike RDF/OWL syntax which is referring to objects by literal resource identifiers. It allows to perform group operations involving attributes and methods specific to subset of some class instances. Also, it turns into powerful instrument for building flexible knowledge structures connecting “abstract” entities such as sets of objects qualifying the query condition at a time in a given semantic graph.

Our current work is dedicated to complete full specification of the ORL. Another goal is implementation of another generation of the agent software in the scope of Webstructor project (such as lightweight personal data sharing applications for private social networks or personal news aggregators). The purpose of the software would be open peer-to-peer network for personal knowledge interchange and collaborative intelligence evolution in society, including human individuals and computer agents. This would involve full implementation of social evidence-based knowledge representation model (including hyper-graph of subjective-temporal sub-graphs) and support for multiple languages used for referring to entities by name. That would require system-level support of categories such as time and language at system level.

Appendices by author:

Related resources:

Discuss this!