TIP #29: Allow array syntax for Tcl lists

TIP:	29
Title:	Allow array syntax for Tcl lists
Version:	$Revision: 1.8 $
Authors:	Kevin Kenny <kennykb at acm dot org> Donal K. Fellows <fellowsd at cs dot man dot ac dot uk>
State:	Rejected
Type:	Project
Tcl-Version:	9.0
Vote:	Done
Created:	Wednesday, 07 March 2001
Discussions To:	news:comp.lang.tcl
Discussions To:	mailto:[email protected]

Abstract

Most popular programming languages provide some sort of indexed array construct, where array subscripts are integers. Tcl's lists are, in fact, arrays, but the existing syntax obscures the fact. Moreover, the existing list commands make it difficult to manipulate lists as arrays without running into peculiar performance issues. This TIP proposes that the syntax of variableName(value) be extended to function as an array selector if variableName designates a list. This change is upward compatible with existing Tcl scripts, because the proposed syntax results in a runtime error in every extant Tcl release.

Rationale

The implementation of lists in Tcl has evolved far beyond the original conception. While lists were originally conceived to be strings with a particular syntax that allowed them to be parsed as lists, the internal representation of a list is now an array of pointers to Tcl_Obj structures.

Tcl programmers, for the most part, have not taken advantage of this evolution. Code that uses hash tables for the purpose is still extremely common. Moreover, it is difficult to update lists in place, even if their internal representations are known not to be shared. One example of this difficulty is seen in the discussions [1] of how best to shuffle a list of items. The discussion began with a naïve implementation of Jon Bentley's method of performing random swaps:

  proc shuffle1 { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp [lindex $list $j]
          set list [lreplace $list $j $j [lindex $list $i]]
          set list [lreplace $list $i $i $temp]
      }
      return $list
  }

Aside from the fact that the syntax obscures what the program is doing, the implementation suffers from an obscure performance problem. When the lreplace calls in the shuffle1 procedure are executed, the internal representation of list has two references: the value of the variable, and the parameter passed to lreplace. The multiple references force lreplace to copy the list, leading to quadratic performance when large lists are shuffled.

It is possible, albeit difficult, to alleviate this problem by careful management of the lifetime of Tcl_Obj structures, but this change complicates the code. The simplest way to fix the performance is probably to use Donal Fellows's implementation of the K combinator:

 proc K { x y } { set x }

which allows the caller of lreplace to extract the value of list, change the value of list so that the extracted value is unshared, and then pass the extracted value as a parameter to lreplace:

  proc shuffle1a { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp1 [lindex $list $j]
          set temp2 [lindex $list $i]
          set list [lreplace [K $list [set list {}]] $j $j $temp2]
          set list [lreplace [K $list [set list {}]] $i $i $temp1]
      }
      return $list
  }

Now the performance of the code is O(n) where n is the length of the list, but the programmer's intent has been seriously obscured!

These drawbacks have led prominent individuals such as Richard Stallman [2] to assert that Tcl lacks arrays.

This proposal includes the absolute minimum of functionality needed to provide array-style indexing for variables containing Tcl list objects. The reason for this limitation is that omitted functionality can be added later without breaking existing scripts. On the other hand, ill-considered extensions may turn into something that we're doomed to support forever.

Specification

This TIP's proposed change can be stated succinctly:

Wherever the notation a(x) may be used to refer to an array element in the language, allow it also to refer to an element of a list, provided that the variable a is scalar and the value x is an index suitable for the lindex command.

Exception: Traces, unset and upvar calls designating individual list elements shall not be supported. (As a consequence of this rule, list elements may also not appear as linked variables in C code, implying that they also cannot appear as -variable or -textvariable options on Tk widgets.)

Note that this change is backward compatible with existing Tcl scripts! If a notation like a(x) is used to refer to a scalar variable in today's Tcl, the result is an error:

 % set a [list foo bar grill]
 foo bar grill
 % set a(2)
 can't read "a(2)": variable isn't array
 % puts $a(2)
 can't read "a(2)": variable isn't array
 % set a(2) zot
 can't set "a(2)": variable isn't array

The default behavior, if a is not set, and a script executes

 set a(2) zot

will still be to create an associative array. If a script wishes to perform such actions on a list, it will be necessary first to initialize the variable:

 set a [list]
 set a(0) foo

Note that in the example above, there is no requirement that the internal representation of a be a list; the line,

 set a [list]

could have been replaced with

 set a {}

with the only impact being the run-time cost of shimmering the empty string into an empty list. Nowhere does this proposal introduce behavior that depends on a specific internal representation for any variable.

This proposal the syntax of the subscript shall be precisely those values that are accepted as the second argument to the lindex command. In other words, the subscript may be an integer N, or the string end or end-N. The value of N may not be less than zero nor greater than nor equal to the length of the list on any usage that reads a list element.

A usage that writes a list element may use an integer equal to the length of the list, or the string end+1, to designate the element one past the end. In other words,

  set a(end+1) foo

will have the same effect as:

  lappend a foo

With the proposed change in syntax, the procedure to shuffle a list becomes much more straightforward:

  proc shuffle1 { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp $list($j)
          set list($j) $list($i)
          set list($i) $temp
      }
      return $list
  }

The given implementation copies the list only once, the first time that the line:

          set list($j) $list($i)

is executed. Thereafter, the list is an unshared object, and the replacements are performed in place.

It shall be illegal to pass a list element as the parameter to upvar; that is, the following usage:

  proc increment { varName } {
      upvar 1 $varName v
      incr v
  }
  set x [list 1 2 3]
  increment x(0)

will not be supported. However, the commoner form:

  proc incrementElement { arrayName index } {
      upvar 1 $arrayName array
      incr array($index)
  }
  set x [list 1 2 3]
  incrementElement x 0

will, of course, work as expected.

Discussion

Several reviewers expressed concern about the reuse of array syntax. In particular, the alternative syntax $a<$element> was proposed repeatedly. Alas, there is no good alternative syntax that will not break at least some existing scripts. The proposed syntax using angle-brackets is a poor choice, because Tcl scripts that generate Web pages frequently have code like:

  puts "<$tag>$text</$tag>

that would be broken horribly by such a change.

There are several obvious extensions to the proposal that are not addressed, and these omissions are intentional.

The proposal makes no attempt to deal with multiple subscripts as a means of accessing nested lists.

Use of multiple subscripts is closely related to the withdrawn TIP #22 (which the author of this TIP intends to revive). If the related TIP is accepted, the syntax for the subscript could readily be expanded so that it could be a Tcl list giving the subscripts in lexicographic sequence. For example

 set a(2 3) foo

could be used to set the fourth element of the third sublist.

The proposal allows the set command (or any other use of Tcl_SetVar2Ex) to set only the elements that are in the list already plus the one one beyond the end.

Tcl lists are fundamentally dense arrays. Allowing non-contiguous elements, that is, sparse arrays, is a fundamental change to their semantics. Such a change is not contemplated at this time.

The proposal does not allow the unset command (or any other command that arrives at Tcl_UnsetVar2) to delete members of a list.

Earlier versions of the proposal had proposed to permit:

  unset a([expr { [llength $a] - 1}])

or equivalently:

  unset a(end)

to reduce the length of the list by one. In subsequent discussions, the reviewers found it distasteful that the proposed syntax did not permit unsetting interior elements of a list. Alas, the discussion did not arrive at a consensus on what the precise semantics of such an operation ought to be. Some reviewers favored attempting to emulate sparse arrays (again, a fundamental change to the semantics of Tcl lists that is not contemplated at this time). Others preferred the semantics of shifting the remaining elements, so that

   unset a($n)

would always be equivalent to

   set a [lreplace $a $n $n]

except for performance. Both camps found it overly restrictive to limit the semantics of unset to those of the original proposal. Because the two groups failed to achieve a consensus, the author of this TIP finds it prudent to forbid unset altogether in the initial implementation.

The array command continues to operate only on associative arrays.

Lists are a simple enough structure that the full power of the array command is not required to deal with them, and having it work on lists as well as arrays seems like needless effort. Moreover, existing code may well depend on a combination of [array exists] and [info exists] to distinguish associative arrays from scalar variables (including lists).

The upvar command cannot address individual list elements.

Extending the syntax in this fashion would make upvar more consistent in its behavior, but appears to be expensive, in terms of both performance (tracking down the linked references if a list is rebuilt) and the effort required for implementation (the author of this TIP is unlikely to have the time required to implement the necessary changes to struct Var and the associated code).

No traces on list elements shall be supported. List elements cannot function as linked variables in C code.

The original proposal had specified how write and unset, but not read, traces could be implemented. The original proposed functionality is described in the Appendix. The author of this TIP had proposed it primarily so that list elements could function as linked variables (for instance, in the -variable and -textvariable options of Tk widgets).

Once again, this part of the original proposal failed for lack of consensus among the reviewers. Some felt that supporting read traces in one context but not another would be overly confusing. Moreover, the proposal as written would cause write traces on the elements to fire if the internal representation of a variable shimmered between a list and something else. Some reviewers found the excess trace callbacks to be objectionable.

At least one reviewer proposed a separate trace add element syntax for list-element traces. This syntax would address some of the concerns about the lack of read traces (there's no reason that trace add element should function the same as trace add variable). Alas, it would not address the problem of linked variables, which was the main reason for having the traces in the first place.

Given the lack of consensus, the author of this TIP finds it prudent to withdraw or postpone this portion of the proposal.

Reference Implementation

No reference implementation has yet been developed; the author of this TIP wishes to solicit the opinions of the Tcl community before spending a lot of time implementing a possibly bad idea.

Change history

12 March 2001: Added detailed discussion of the specific subscript ranges supported by read, write and unset operations. Changed the discussion to reject the alternative of padding an array when setting an index beyond the end. Added discussion of the details of write and unset traces, and rejecting read traces as being infeasible to implement. Clarified the example of creating an empty list so as to avoid any misapprehension that these changes depend on list variables' having a particular representation at any given time; in fact, every detail of this proposal is tolerant of shimmering.

13 March 2001: Fixed a copy-and-paste error in the 'incrementElement' example, and added to the discussion the fact that all operations will throw errors in the event of a malformed list.

30 March 2001: Revised yet again, in an attempt to remove as much controversial functionality as possible and reduce the TIP to the minimum useful subset, on the grounds that it is prudent to avoid supporting functionality that may later prove ill-considered.

Summary of objections

DeJong, English (non-voting), Flynt (non-voting), Harrison, Ingham, Lehenbauer, Polster, (non-voting), Porter, and Sofer (non-voting), expressed concern that the proposed syntax is confusing, since the target object could be either an associative array or a linear array (that is, a Tcl list). These objections varied in stridency from "yes, it is a risk, and I'm prepared to accept it," to "this will just be too confusing, and I can't countenance this proposal."

Hobbs found the original proposal's omission of reverse indexing distasteful. The current version of the proposal embraces his suggested change.

Cuthbert (non-voting), Hobbs, and Porter expressed concern over the semantics of unset. Since consensus was not achieved, the current version of the proposal defers implementation of unset.

Several reviewers, most notably Ousterhout, found the proposed trace semantics distasteful. The current version of the proposal eliminates trace on list elements.

Several reviewers appeared to labor under the misconception that this TIP introduces behavior that is dependent at run time upon the internal representation of a Tcl object. It does not; it is tolerant of shimmering in all cases.

Several reviewers objected to the proposal on the grounds that it does not specify a general object system and how such a system would allow for generic containers with array syntax. The author's intention in writing it was not to propose such a system, but only to propose a small piece of syntactic sugar, implementable here and now, that is compatible with that broader vision.

Appendix: Possible implementation of read and unset traces.

The original proposal contained the following language, which could be used as a guide if traces on list elements are contemplated at a future time.

Write and unset traces on list elements shall be supported; it shall be permissible to write:

    trace add variable x(1) write writeCallback

    trace add variable x(1) unset unsetCallback

The write callback shall be invoked whenever the given list element changes value; the unset callback shall be invoked whenever the variable is unset or when its length shrinks to the point that it no longer has a member with the given index.

Read traces on list elements shall not be supported. It is too difficult at this point to define what their semantics should be. For instance, if a program executes the following code:

   trace add variable x(0) read readCallback
   set x [list foo bar grill]
   set y [string range $x 4 end]

should the callback fire? By one argument, the program has not read element zero of the list; by another, using the list as a string has read every element, and all read traces should fire. In any case, the read trace on a variable fires before its usage is known; it appears impossible in existing code to implement selective read tracing on list elements.

The implementation of write and unset traces on list elements will be done by establishing a C-level write trace on the variable as a whole. The client data of the trace will designate a structure containing the ordinal number of the element being traced, and a Tcl_Obj pointer designating its old value. The reference count of the Tcl_Obj will be incremented when this pointer is stored. Note that this increment operation makes the object shared. Any change to the designated element will thus need to copy the object.

When the write trace fires, the list representation of the variable will be extracted, reconstituting it from the string representation if necessary. If extracting the list representation fails, the trace will be considered to have failed as well, and the trace callback will return TCL_ERROR. If extracting the list representation succeeds, the list length will be compared with the ordinal number of the element being traced. If the element number is no longer within the list, an unset trace fires if one exists. If the element number is within the list, the two Tcl_Obj pointers are compared. If they are identical, the list element in question is unchanged, and nothing need be done. Otherwise, the write trace fires.

This behavior is conservative in that an operation that spoils the list representation of the object is considered to have written every element of the list. This rule is consistent with the rule that write traces on ordinary Tcl variables fire whenever the variable is set, even if it is being set to an identical value.

In any event, after the conclusion of a trace callback, the saved Tcl_Obj will have its reference count decremented and be replaced with the current element of the list (with reference count appropriately incremented, of course).

Copyright

This document has been placed in the public domain.

[Index] [History] [HTML Format] [Source Format] [LaTeX Format] [Text Format] [XML Format] [*roff Format (experimental)] [RTF Format (experimental)]

TIP AutoGenerator - written by Donal K. Fellows