Page tree
Skip to end of metadata
Go to start of metadata

Table of Contents

Introduction

This page describes the system specification of the compiler. Please have a look at the Rough Concept of the master project to get a better understanding of the global context and goal, plus have a look at the Customer Requirements of TinsPHP if you are not familiar with them.

Very briefly, TinsPHP (Type Inference System for PHP) will be used to translate PHP to TSPHP. The transcompiler is designed with extensibility, re-usability and simplicity in mind. The following chapter describe more requirements for the compiler.

System Requirements and Design Decisions

The following system requirements precise the Customer Requirements of TinsPHP and the corresponding design decisions have been made to support them. For a better overview, the requirements have been split up in different sub chapters. 
The following key applies for all tables in all chapters.

Key:

Type: m = must requirement, d = desired requirement (nice to have but not mandatory).
Prio. = Priority; 1 - 3 whereas must requirements precede desired requirements and 1 = high priority and 3 = low priority (complete order: m1, m2, m3, d1, d2, d3).
Comp. = Complexity; 1 - 3 whereby 1 = low complexity and 3 high complexity.
Risk; 1 - 3 whereby 1 = low risk and 3 = high risk. The value indicates how bad it would be if this requirement will not be fulfilled. Logically desired requirements will always have a low risk.
Affects: Mention which customer/system requirement is affected/refined/supported.

Requirement Refinements

The following sub chapters refine the Customer Requirements of TinsPHP. The chapters correspond to the chapters in the customer requirements.

Misc

IDNameDescriptionTypePrio.Comp.RiskAffectsBenefit
SysM1Property fileThe compiler shall be configurable through a property filem112NF8 - ConfigurableThe user can configure the compiler without recompiling the compiler
SysM2hide severity when reporting issuesespecially the type inference component might report several issues with different kind of severity. The interface should not expose the severity of an issue and leave it to the implementation to decide what severity a certain issue has.m111M3 - Abort translation depends on severity of issues The implementation can decide on its own what severity an issue has which enables that an implementation could be configuration based. That means, a user can decide on its own through a config how severe an issue really is. Some users might want that superfluous arguments are treated as errors others not. This way we can support M3 - Abort translation depends on severity of issues and leave it to the developer of an IIssueReporter how he/she wants to cope with it (maybe other ideas pop up, this way we do not restrict it).

Language specification

IDNameDescriptionTypePrio.Comp.RiskAffectsBenefit
SysL1Renamer interfaceThe translation component shall use a renamer interface which is responsible of renaming symbols with reserved keywords as their name to valid names. There shall also be an implementation of a renamer already inside the translation component which simply adds _ at the end of such a name.d112L3 - reserved keywords as symbol name
NF8 - Configurable
The user can implement his/her own renamer if necessary, if he/she is not happy with the default strategy.

Input

IDNameDescriptionTypePrio.Comp.RiskAffectsBenefit
SysI1Property file as parameterThe compiler shall provide an interface* which allows to pass a path to a property file. All properties in this file shall overwrite or extend the configuration made in the compiler's property file (SysM1).m111I1 - Overwrite Standard Config, NF8 - ConfigurableThe developer has the possibility to use different configurations without rewriting or replacing the property file each time.
SysI2Ant TaskCreate an Ant task for the compiler which uses a file set as inputd221NF5 - Automation, I5 - Input - file set indirectly also I3 - Input - file and I4 - Input - directoryAutomation. Ant is a commonly known way to automate development processes.
SysI3ClasspathThe compiler shall provide an interface* which allows to 'link' pre-compiled classes and librariesd121NF8 - ConfigurableRe-usability and performance.
SysI4No more InputStreamsThe compiler shall provide a method which declares, that no more input streams will be added.m112I2 - Input stream, I3 - Input - file, I4 - Input - directory, I5 - Input - file setIn the case where not the console interface is used, where every system calls equals to one compilation, it is necessary, that one can tell the compiler, that all desired input streams have been added. Otherwise it is not possible for the compiler to know when it can actually start with the type inference.

Output

IDNameDescriptionTypePrio.Comp.RiskAffectsBenefit
SysO1multiple output formatsIt shall be possible to compile code and translate it to multiple output formats in one rund321O5 - Output format,
NF3 - More output formats
,
NF12 - Performance
 
Better performance. It is not necessary to run the compiler several times for different output formats.
SysO2 Output format TSPHP - property fileThe output format component (SysD4) for TSPHP shall be configurable (NF8) through a property file.m211O6 - Output format PHP 5.4Customisation. The developer can configure the component without recompiling.
SysO3Output - directory - path for TSPHPThe output format component (SysD4) for TSPHP shall be configurable (NF8) concerning the name of a sub directory which will hold the output of this component. Standard should be tsphpd121O2 - Output - directory - pathSince multiple output formats can be used at one run it is necessary that output format components have an own output directory folder within the global defined directory folder and that the name is configurable to solve conflicts.
SysO4Output - directory - structure strategies for TSPHPThe output format component (SysD4) for PHP 5.4 shall be configurable (NF8) concerning the output directory structure. Standard should be the global configuration (see O3 - Output - directory - structure)d131O4 - Output - directory - structure strategiesCustomisation. The developer can choose between different output structures.
SysO5Output format(s) as parameterThe compiler shall provide an interface* which allows to set explicitly the output format(s). This definition has precedence to the configured output format.d211O5 - Output format,
NF8 - Configurable
The output format could also be changed through Property file as parameter. However, if the output format is the only property in that file it is quite a lot overhead. An additional argument fulfills better the need of the developer.
SysO6Output format TSPHPThe compiler shall comprise an output format component for TSPHPm233O6 - Output format PHP 5.4,The aim is to translate to PHP 5.4.
SysO7Preserve comments TSPHPThe output format component (SysD4) for TSPHP shall be configurable (NF8) concerning omitting comments or not. Standard should be the global configurationd311O7 - Omit commentsInformation which hopefully helps to understand the code is preserved in TSPHP. However, outdated comments can be more harmful than helpful and thus one might want to omit them.
SysO8Only generate AST and infer typesIt shall be configurable (NF8) whether the full translation process shall be conducted or only the pre-compilation. Makes only sense if generated AST with augmented types is stored in files (SysO9) or the AST is retrieved by Get generated AST with augmented types (SysO10)d121NF9 - AST re-usability  Performance. For instance, if only the AST including types is used (e.g. by a static code analyser), it is not necessary to do the whole translation step.
SysO9generated AST with augmented types is stored in filesIt shall be configurable (NF8) whether the generated AST shall be stored in files or not.d131NF9 - AST re-usability  Re-usability and performance. The AST can be used in other projects and do not have to be fully compiled again.
SysO10Get generated AST with augmented typesThe compiler shall provide an interface to retrieve the generated AST including the type information inferred by the type inference engine as streams including identifier.d231NF9 - AST re-usability  Useful if the AST is not stored in files but in another media (for instance in a database).
SysO11use temp variables for better precisionMore precise types can be inferred when using temp variables (see Type Inference Algorithm below). Yet, the additional computation and the fact that the code gets cluttered with temp variables might not be what a user wants. Hence it shall be configurable (NF8) whether the algorithm uses temp variables or notd212NF8 - ConfigurableO8 - configure precision of types in translationThe better precision yields to better performance. Yet, cluttering code with temp variables might not always be desirable. Therefore the option gives the user the possibility to decide it on his/her own.

Further Constraints

Constraints have been already defined in Customer Requirements of TinsPHP and Rough Concept of the master project
The following list describes further design decisions which constrains TinsPHP and its implementation.

IDNameDescriptionAffectsBenefit
SysC1Multiple default labels in switchPHP 5.4 does support multiple default labels in a switch statement. How they are supported depends on the implementation of PHP (usually it should take the last default label). Yet, it was agreed upon that this was rather a bug than a feature (see bug report 67757) and will be disallowed in PHP 7 anyway. L1 - Support all language constructs from PHP 5.4Possible errors can be avoided and code is more readable.
SysC2type checking is not performedThe type inference engine will not perform type checking. TinsPHP tasks is to enable type reconstruction which in turn enables to translate PHP to another language. TinsPHP will perform implicitly some type checks (for instance, when a user passes an object to a function which expects an array) but will leave type checks such as "has a class correctly implemented an interface?" to the output component (e.g. TSPHP needs to check it - is automatically done when the corresponding TSPHP code is complied afterwards). It should not be too hard either to write a type check component which will perform type checking before or instead of translation based on the inferred types but this is left to another project or might be integrated in the future.

Therefore, TinsPHP will assume that a class has implemented an interface well, it will assume that a sub-class conforms with a parent class etc.
Could affect SysD4 - Output format components if the target does not support type checking TinsPHP does not need to deal with type checking, thus does not add additional complexity
SysC3most specialised type inferred

TinsPHP will use a bottom up type inference approach. Hence, it will try to constrain the type as far as possible. Consider the following:

$a = 1;
function foo($id){
	return $id;
}
foo($a);

First of all, int will be inferred for $a and mixed for foo (everything can be passed to foo). Seconldy, since the only application of foo is a call with int, we will infer that foo expects an argument of type int. -> This could be done for the translation but not to inference. The inference algorithm should infer the types as precisely as possible but not add restrictions which are not defined. This constraint is obsolete and M5 - general or specific type in translation covers the case for translation.

SysD4 - Output format componentsThe reason below applies to translation, not to inference. The customer requirement O8 - configure precision of types in translation followed from this constraint.Future typing errors are caught. And if it is not a typing error, then one can still refactor without the need to change the function body (taking the example from the left).

Furthermore, we could simply assign mixed to everything and leave it to the output component to add the necessary casts at the right places, moving type checking to runtime again (as in PHP). That is not the intention of TinsPHP 
SysC4superfluous arguments should emit a notice

PHP ignores superfluous arguments in a call and thus the following would be valid in PHP:

class A{
	function foo(){return 1;}
}
function foo($a){
	return $a->foo(1);
}
$i = foo(new A());

type of $a would be a structural type which contains a method foo with one argument of type int and a return type which is generic.
We could now argue that passing A to this function is valid since superfluous arguments are ignored in PHP. Nevertheless, TinsPHP will emit at least a notice. The customer can still decide to abort the translation in such a situation (see SysM2 - Abort translation on warning).

type inference engineDetects possible bugs.

Design Decisions

Support Customer Requirements

The following design decisions describe what has to be done to support the affected customer requirements rather than refining them.

IDNameDescriptionAffectsBenefit
SysD1ANTLR v3.5 GrammarAn ANTLR v3.5 based grammar for TinsPHP must be created which generates an ASTNF6 - Independence,
NF9 - AST re-usability 
An ANTLR v3.5 based grammar improve the portability to another language such as C, C++, C# etc. Furthremore, the AST based approach allows to reuse the AST for other tools
SysD2ANTLR v3 Tree GrammarAn ANTLR v3.5 based tree grammar for TinsPHP must be created which is based on the AST genrated in SysD1 - ANTLR v3.5 GrammarNF6 - Independence,
NF9 - AST re-usability   
As well as the normal grammar, an ANTLR v3.5 based tree grammar improve the portability to other languages such as C, C++, C# etc. The type inference engine as well as the translator will most probably use a tree grammar
SysD3Independent parser, type inference engine, translator and core componentAll components should be independent from the compiler and from each other (excluding tests, where integration tests might use other components). However, dependencies to the common component and ANTLR are fine. If there are more dependencies between components, e.g. between the type inference engine and the translators, then this dependencies should be sourced out in an own component (or put into common if it is used by several components).NF6 - Independence,
NF7 - Testability
Independence allows to use the parser, the type inference engine, the translator or the core component in another context than TinsPHP. For instance, the parser could be use for a static code analyser or the core component could be
SysD4Output format componentsThe different output formats shall be implemented as independent componentsO5 - Output format,
O6 - Output format TSPHP,
NF3 - More output formats, NF6 - Independence,
NF7 - Testability
This way the design of the compiler is forced to support different output formats, thus a well-defined interface has to be defined. This interface could be implemented by another output format component and this component could be used instead of the TSPHP output format component (or amongst).
SysD5Output - directory - structureEach output format component (SysD4) will decide how it behaves in terms of how the structure of its output directory will look like.A good practice may be to take over the global configuration (see O3 - Output - directory - structure) and to make it configurable (NF8), so it is possible to overwrite the global configuration.O3 - Output - directory - structureBecause it is maybe not applicable for each output component to produce a hierarchy structure, it is better that each output format component decide on its own if this part of the component is configurable or not.
SysD6Output - directory - structure strategy componentsEach output directory strategy shall be implemented as independent component.O3 - Output - directory - structure ,
O4 - Output - directory - structure strategies,
NF6 - Independence, NF7 - Testability 
This way it is possible to reuse the output strategy by different output format components.
SysD7Omit commentsEach output format component (SysD4) will decide how it behaves in terms of omitting comments.A good practice may be to take over the global configuration (see O7 - Omit comments) and to make it configurable (NF8), so it is possible to overwrite the global configuration.O7 - Omit commentsIt could be that for one output format component (SysD4) it is essential that comments are not omitted and another output format component cannot deal with comments etc. This way we have a more flexible solution.
SysD8AST and IWalker interfaceThe parser of the compiler will generate an TSPHPAst (TinsPHP will reuse TSPHP's abstract syntax tree). Have a look at SysD8 - AST and IWalker interface of TSPHP's system specification. TinsPHP could reuse this interfaceNF11 - Additional steps,
NF14 - Sharing Code
The well defined and simple interface IWalker will enable additional step such as code type checking, static code analysis, optimisation etc.
SysD9Exchange ANTLRThe compiler shall encapsulate the access to generated code from ANTRL v3.5NF1 - Exchange ANTLR v3.5Changes to the ANTLR interface in the future will not result in a big change of the compiler. Only the class/component which encapsulates the access has to be changed. Furthermore, it would be easier to exchange ANTLR with another compiler-generator
SysD10Core componentConsider changes in PHP which changes how the type inference engine would need to infer a type. As an example, $b=false; ++$b; would no longer yield a bool but an int in PHP (as $b+=1 does). This change cannot be described in the parser and the type inference engine would need to be changed. However, if this information is encapsulated into an own component (namely the core component), then we could exchange the core component without changing the type inference engineNF15 - More input formats,
NF6 - Independence 
Supporting multiple PHP version is facilitated. The type inference engine should be able to support multiple PHP versions as long as no BC break between these versions exists. And with an additional core component it is possible that even some BC breaks can be dealt with without greater changes.
SysD11Error Reporting localisedEach component will have error reporting. The error messages should be localised and the design has to consider that other languages are going to be supported in the futureM2 - LocalisationLow cost to support other languages. As first step, it is enough to have a hard-coded version (english). But interfaces have to be created in order that changing to a configurable (NF8) version is at low cost.
SysD12Different issue severity
levels
TinsPHP might detect several different kind of issues (with different severity). Some of them should abort translation (some even the type inference phase) others do not. For instance, a parser error should abort translation (even the type inference phase). A PHP feature which is not supported by TSPHP should not abort the translation but emit a warning or similar (or only abort the translation to TSPHP but it might be that there are other output components). M3 - Abort translation depends on severity of issueL1 - Support all language constructs from PHP 5.4 L2 - Notify if input is not fully supported Let's take a PHP feature which is not supported by TSPHP as example. Some users might want to fix all warnings first and then proceed with the translation other might want to ignore warnings from TinsPHP and inspect errors generated by the TSPHP compiler afterwards. Having different severity levels enable to support M3 - Abort translation depends on severity of issues
SysD13type inference independent of output

The type inference engine should be independent of a particular output component. This means especially that necessary conversions and casts in the output should not be dictated by the type inference engine. Consider the following:
 

class A{}
class B extends A{
  public function bar(){}
}
function foo(A $id){
	return $id;
}
foo(new B())->bar();

Since TSPHP does not yet support generic types the above code would need to be translated to TSPHP as follows:

class A{}
class B extends A{
  public void function bar(){}
}
function A foo(A $id){
	return $id;
}
B $t = (B) foo(new B());
$t->bar();

The output component of TSPHP is responsible to insert of the additional cast (and temp variable).

NF6 - IndependenceThe type inference engine can be used for different output components where the output component can generate optimal code.
SysD14implicit conversions and casts in AST

See SysD13 - type inference independent of output. Similar to this, the type inference engine needs to add the necessary information if implicit conversions/casts are applied in the PHP code. For what reason? It might well be that the output does not support the same implicit conversion/cast and thus needs to handle this implicit conversion/cast somehow. Consider the following example:

 $a = substr("hello","1",0);

The translation to TSPHP would need to look like the following:

 string $a = substr("hello","1" as int, 0);

The AST needs to contain the implicit conversion from string to int in order that TSPHP can insert the explicit conversion

NF6 - IndependenceThe type inference engine can be used for different output components where the output component can generate optimal code. Without this information it might well be that the output component cannot generate valid code
SysD15definition enhancer
component 
The type inference process will consist of several phases: definition, reference and inference.
Since it might be that there are ways to improve the definition phase (for instance with PHPDoc as mentioned in M4 - use PHPDoc) the type inference engine component could implement the visitor pattern which allows definition enhancer components to add additional type information.

Yet, right now I only see PHPDoc as source for improvements and thus using the visitor pattern seems to be over-engineered. Nevertheless, the design should take into account that definitions could be enhanced and should provide an easy way to do so. Maybe runtime information (running integration tests) could be used to collect types and thus can provide such improvements.
M4 - use PHPDocCould improve the type inference process. Firstly it will probably run faster since more types are already provided and do not need to be inferred and secondly it might output preciser code. For instance, restricting a functions return type to a int instead of scalar.
SysD16symbols componentThe core component (see SysD10) will depend on the symbols (TypeSymbols, MethodSymbol, ClassSymbol, InterfaceSymbol etc.) as well as the inference engine. Since we do not want that the core is dependent form the inference engine and we believe that there can be several core components which use the same symbols, it makes sense to source them out to an own componentNF6 - IndependenceFirst of all we achieve to decouple core and inference engine and secondly we have less code duplication (assuming there will be several core components in the future) and thus lower maintenance etc.
SysD17own interfacesTinsPHP should use its own interfaces. Only DTO's, Exceptions and all classes related to ITSPHPAst can be reused one-to-one from TSPHP (same AST might enable some useful features in the future). However, we want to minimise code duplication as far as possible (for obvious reasons). Therefore, TinsPHP should create sub-classes of TSPHP classes which implement TinsPHP's interfaces. It is vital that TinsPHP has its own tests though (in order to detect changes in TSPHP which are not compatible).
For instance, TinsPHP's SymbolFactory shall extend TSPHP's SymbolFactory (which implements TSPHP's ISymbolFactory) and implement TinsPHP's ISymbolFactory which extends TSPHP's ISymbolFactory. Yet, TSPHP's SymbolFactoryTest is copied to TinsPHP's (creating a sub-class is not allowed)
NF6 - Independence,
NF14 - Sharing Code 
TinsPHP is not hardly independent of TSPHP. With hard we mean that should TSPHP include a change which is not compatible with TinsPHP then it would still be possible to either override the behaviour in the sub-class or even copy the code from TSPHP and turn the sub-class into an own independent class. At the same time we can reduce code-duplication and hence have a lower maintenance, bug fixes cover both projects etc.
SysD18separate casts and conversionsThe word casts can be understand in several ways: up- and down-casts which are merely a check (no value is transformed). Cast can also stand for explicit coercion where coercion is another word for type conversion (the value might be transformed - e.g. converting the string "1" to the int 1 or the double 1.5 to int).
 
 NF13 - Simplicity

The clear separation of the two concepts will make it easier and more intuitive to reason about what the type inference engine is doing, why a translator is including a conversion or a check respectively. Consider the following example. First the PHP code

$s = "2";
$he = substr("hello",0, $s);
function foo($a){
  foreach($a as $v){
    if(is_int($v)){
      return $v + 1;
    } else {
      return foo($v);
    }
  }
}

The corresponding TSPHP code would look as follows (supposing that casts in TSPHP are up- and down-casts and the as operator can be used to convert an expression to a type):

string $s = "2";
string $he = substr("hello", 0, $s as int);
function num foo(array $a){
  foreach($a as mixed $v){
    if(is_int($v)){
      int $t =() $v; 
      return $t + 1;
    } else {
      return foo((array)$v);
    }
  }
}

The translator would insert a conversion in line two (string to int), a cast in line 6 (mixed to scalar) and another cast in line 9 (mixed to array).
This way it is also clear, that $v in line 9 will not be converted to an array (remember, PHP provides array conversion for all types) but a check is applied. This way an endless-loop for the following call will be avoided foo([1, "1"]);
Regardless, if TSPHP separates the two concepts as well (see TSPHP-895 - Getting issue details... STATUS ), TinsPHP should separate them.

SysD19implicit returns in ASTSimilar to SysD14 - implicit conversions and casts in AST we need to retain the knowledge about an implicit return in the AST. This way it is possible for a translator component to omit it again in the case where it is not needed NF6 - IndependenceThe type inference engine can be used for different output components where the output component can generate optimal code. Without this information it might well be that the output component cannot generate valid code. For instance, if PHP would introduce a void type as proposed in this RFC (function which is void can only return null implicitly) then an output component for PHP 7 would not be able to annotate a function with void since all return statements seem to explicitly return null.

Enable and Support Constraints

The following design decision where defined based on the above defined constraints. They might well affect other system requirements or Customer Requirements of TinsPHP as well but they mainly support the constraints.

 

IDNameDescriptionAffectsBenefit
SysDD1introduce scalar typea scalar type should be introduced. PHP already defines it implicitly by providing a function is_scalar -> scalar is just a type alias for the union: (bool | num | string)O8 - configure precision of types in translationTypes can be inferred more specifically. Using scalar instead of mixed is already way better and will allow to detect more type errors in the future.
SysDD2introduce num typenum will be introduced as parent type type alias of the union (int | floatO8 - configure precision of types in translationTypes can be inferred more specifically. Using num instead of mixed (or scalar - see SysDD1 above) is better and will allow to detect more type errors in the future. It might also simplify inference since num can be used as result of many arithmetic operations instead of int and float (yet, this way we would loose precision).
SysDD3introduce falseable typesPHP has lot of functions which return a type or false (see TSPHP's explanation of Falseable types). TinsPHP will also introduce falseable types => can be expressed by union types, falseable types could be introduced as syntactic sugar later on if we think it would simplify the way a developer needs to think.O8 - configure precision of types in translation

Types can be inferred more specifically. Consider the following code:

$i = 1; $j = 2;
$h = $i / $j; 

$i and $j will be of type int and since an int division can yield either an int or a bool in PHP we would need to assign mixed (or scalar, or num) to $h which is less precise than assigning int! (falseable int) => with false as own type (see SysDD10) we can assign an union type falseType | int

SysDD4introduce structural constraintsStructural typing is more or less the static equivalent of duck typing. PHP supports duck typing and thus TinsPHP should support structural typing. Therefore, structural constraints need to be introduced in order to support structural typing.O8 - configure precision of types in translation

Types can be inferred more specifically and probably easier. Consider the following example:

function foo($x){
  return bar($x->foo());
}
function bar($x){
  return $x->bar();
}
foo(new A());
foo(new B());

Without structural types we would need to create two functions foo, one which expects A and another which expects B, the same for Bar if A::foo() and B::foo() do not have the same type (both problems of the translator component - hence the output language needs to support structural types as well in order that the claim holds).This gets quickly very messy if the functions have more parameters because we would need to create a function for each combination.
With structural types, described with {field1 , ... , fieldn, method1, ... , method} the return type of bar would be T1 where T1 is $x->bar() - in order to express this we need  SysDD5 - introduce generic types - the argument type of bar would be {bar():T1}, the argument type of foo would be {foo():T2} and the return type of foo would be T3 where T3 is $x->foo()->bar(), hence foo would be: T4 -> T3 where T4 <: {foo():T2}, T2 <: {bar():T1}, T3 < T1

SysDD5introduce generic types

Generic types need to be introduced in order to be able to type the return type of the following preciser than mixed

function identity($x){
  return $x;
}
O8 - configure precision of types in translationTypes can be inferred more specifically. Consider the example on the left. Without generic types we would need to assign mixed to identity since it can return every type (it is a parametric polymorphic function). mixed would be the easiest approach for the type inference engine but would not generate very specific code and typing errors could not be detected any more, are moved to runtime respectively.With generic types we can type identity as follows: T -> T
SysDD6introduce sub type constraintsstructural types are already kind of a constrained type. We need further possibilities to constrain types in order to infer a preciser type when generic types are in place. One common constraint is a sub type constraint as shown on the right.O8 - configure precision of types in translation

Consider the following example:

class A{}
class B extends A{}
function identity(A $x){
  return $x;
}
$a = identity(new A());
$b = identity(new B());

We can type identity above as T where T < A due to sub type constraints on generic types (hence we require SysDD5 - introduce generic types first).

SysDD7introduce union typesunion types (or sum types) are types which contain several types. mixed can already be seen as a union type which can hold all types (excluding void). scalar is also a union type which holds all scalar types etc. Yet, there might be the necessity of other union such as scalar + array. The introduction of union types would allow to express such scenarios preciserO8 - configure precision of types in translation

consider the following example:

function foo($asArray){
  $r = "hello";
  if($asArray){
    return [$r];
  }
  return $r;
}

with the introduction of union types, we can infer the union type (string | array) for foo instead of mixed. Hence, union types make it possible to infer more specialised types.

SysDD8introduce nullable typeNullable types allow to infer types more specifically. see  Nullable for an explanation.=> can be expressed as union types since null is an own type nowO8 - configure precision of types in translationIf a field is not initialised, then its default value will be null. If all the values assigned to such a field have the same type and is a scalar type, then we would need to use mixed at the moment, since scalar can not hold null. Nullable types would solve this problem. Then we could assign such a field the nullable correspondence of the scalar type. For instance, we could assign nullable int instead of mixed (or scalar or num) => with null as own type (see SysDD10) we can assign a union type (null | int)
SysDD9introduce intersection constraintintersection constraints allow to represent overloads without the need of duplicating functions and overloads. PHP does not support function overload directly but indirectly through different argument input types and return types.Intersection types can be modelled similarly to structural types with constraints / type expression respectively.intersection types can be expressed by type constraints / type expressions (see SysDD6 - introduce type constraints )O8 - configure precision of types in translation

As an example:

function foo($a, $b){
  return $a + $b;
}

We could return mixed which is not precise at all. Union types allow to return (int | float | array) which is preciser but still not as accurate as intersection types, which allow to define the return type (int -> int & float -> float & scalar -> num & array -> array)

SysDD10own types for null, true and falseIntroducing own types for null, true and false – let's call them nullType, trueType and falseType – allows to infer more precise types. Conditional types are possible this way without introducing the concept of conditional types. Conditional types are nothing else than intersection types where we distinguish between trueType and falseType as parameter type.O8 - configure precision of types in translation

Enables conditional types and reduces concepts like nullable types and falseable types which can be expressed as union types now. Moreover, it is beneficial for static code analysis tools. For instance, detect dead code or null pointers:

$a = true; 
$b = false;
if($a && $b){
 echo "will never be executed";
}
function foo(Bar $bar){
}
$c = null;
$d = $c;
foo($d);

It is also beneficial for code optimisers.

SysDD11introduce conversion constraintssimilar to structural types we can benefit from a constraint which determines whether a certain type has a conversion to another. See TINS-297 - Getting issue details... STATUS for more informationO8 - configure precision of types in translationPreciser types. Rather than typing the + operator with scalar x scalar -> num we can be more general and type it as {as num} x {as num} -> num, enabling for instance, that GMP would be supported as well (yet, I think GMP has even an implicit cast to int/float but anyway).
SysDD12introduce object pseudo typePHP has already the function is_object and some behaviour is based just on the fact whether it is an object or not. For instance, converting an object to int is half-way supported, returns always 1 and emits an E_NOTICE. For those users which do not care about E_NOTICE it is a valid conversion.

L1 - Support all language constructs from PHP 5.4,Without an object pseudo type we would not be able to support object to bool/int/float conversion neither the is_object function.  

Architecture

Is explained in Architecture of TinsPHP.

Semantic Checks

TinsPHP will include some semantic checks even though TinsPHP is not meant to be a semantic checker nor a type checker. The implemented checks shall enable a higher precision of the inferred types, allow to infer more precise types respectively. The following sub chapters explain the implemented checks.

Double definitions

PHP does not allow double definition neither does TinsPHP. The reason is quite simple: translation of such a code is most certainly erroneous and additionally, in the case of double defined functions, TinsPHP would need to decide which function to use without knowing which one is the right one. Choosing the first definition as the one might seem reasonable but could produce entirely different inferred types than choosing the second definition. In a nutshell, inferring types in the presence of double definitions is useless. The check will ensure that constants, variables, functions, classes inclusive members (constants, fields, methods etc.), interfaces inclusive members and traits are not double defined.

A special case are conditional definitions. TinsPHP is not (yet) able to deal with conditional definitions and thus detects a double definitions for the following code

<?php
if(VERSION >= 2){
  function foo(){
    //...
  }
} else {
  function foo(){ //double definition in the eye of TinsPHP
     //...
  }
}
?>

TinsPHP could deal with conditional definitions as follows:

  1. TinsPHP ensures that both definitions are identical in terms of typing or less restrictive, unions types where possible (e.g. foo in if branch returns int and foo in else branch float => return type is a union type (int | float))
  2. TinsPHP uses control flow to detect statically which branch is used (only possible if conditions are constant expressions)

Uninitialised variables

The effect and consequences of uninitialised variables are explained below in the section problems. This check uses control flow analysis to detect whether a variable is initialised or not.

A variable is initialised as soon as a value is assigned to the variable (even if the value is null). If the assignment happens in a conditional scope, then the following must hold in order that the variable is also initialised after the conditional scope. The assignment needs to be within:

  • an if statement and an else branch needs to be present where the initialisation of the variable happens as well or
  • a switch statement and each case needs to initialise the variable and a default case needs to be present or
  • a try/catch statement and the try block as well as each catch block needs to initialise the variable

Care has to be taken with break, continue (and goto) as well as return and throw statements within a conditional scope since it may return or jump out of it before reaching the assignment of the variable (thus leaving the variable uninitialised).

Initialised variables are propagated upwards from scope to scope. Since variables defined in namespaces are implicitly global each namespace imports all variables from the default global namespace scope and exports all initialised variables at the end of the scope to the default global namespace scope.

As soon as TinsPHP supports the intrinsic functions isset and unset, further care has to be taken. Consider the following code. An isset placed in a condition of a conditional scope temporarily initialises a variable within the conditional scope even though it was only partially initialised before (the same applies with !empty() ):

<?php
if($x){
 $a = 1;
}
echo $a; //generates warning, $a is partially initialised
if(isset($a)){
  echo $a; //no warning is generated, $a must be initialised
}
echo $a; //generates warning, $a is partially initialised again
?>

unset is quasi the counterpart of isset - it unsets the variable but not only temporarily. Consider the following:

<?php
$a = 1;
echo $a; //perfectly fine, $a is initialised
if($x){
 unset($a); //$a is no longer initialised in this scope
} //propagate the unset
echo $a; //generates warning, $a is partially initialised
unset($a);
echo $a; //generates warning, $a is uninitialised
?>

Forward reference usage

This check simply verifies if a reference is used is before its declaration. If so, then a fatal error is produced (at least currently, more work could be done in the future, see section "problem" below which explains the problem from a type inference point of view).

File inclusion needs to be addressed in the future (not part of the master thesis). Consider the following:

index.php

 <?php
include 'test.php';
echo $a; //looks like a forward reference usage but is not one
$a = 1;
?>

test.php

<?php
$a = 2;
?>

Forward reference usage of functions and classes are not checked. In PHP it would be an error under certain circumstances, in TSPHP it is not. Hence there is not necessarily the need to perform this check when PHP is translated to TSPHP.

Type System

Is explained in page Type System of TinsPHP.

Type Inference Algorithm

TinsPHP uses a constrained-based type inference approach with union and intersection types, introduces the new notion of convertible types and uses a configurable soft type system to represent PHP. The constraints build the core of the inference process. They are used to model data flow in an flow-insensitive manner.

Rough Process

It expects the following preceding steps:

nrstepresponsibility
1Parsing of source code and creating an AST out of itparser component
2Providing built in definitions including primitive types, operator definitions (their overloads respectively) as well as existing implcit/explicit conversionscore component (and extensions)
3Collecting definitions (constant-, function-, class-, interface-definitions etc.) and check for double definitionsdefinition phase of the inference component
4Resolve symbol references (constant/variable usage, function/method calls etc. including operator usages), type references (primitive types, class/interface types etc.) as well as type of literals (e.g. 1, true, false, null) and set them as ISymbol of the corresponding ITSPHPAstreference phase of the inference component
5Add implicit definitions of variables as well as implicit return statements in functions/methodsreference phase of the inference component.
6Create constraints for each expressionreference phase of the inference component.

The inference algorithm assumes to get a valid and consistent AST and will not perform further checks. The algorithm then performs roughly speaking the following two steps:

  1. Create constraints for all expressions based on the ISymbol of the ITSPHPAst and add them to the list of constraints of the definition scope
  2. Solve the constraints in each scope iteratively insert runtime checks and conversions where necessary and assign the corresponding resulting type to the expression

Afterwards, the AST will be used by the translator component (explained in the next section).
Yupp, we admit, this was too rough, even for a rough process description. The following sub-sections will go into more detail.

Create constraints

A constraint is created for each application (function, operator, method and even flow-of-control statements such as if, switch etc. are treated as applications). A constraint is a triple:

  • lhs
  • args
  • function

Where lhs is the variable, which will hold the return value of the application, args the arguments and function the IMethodSymbol

Solve the constraints

TinsPHP uses a constraint-based approach. Constraints are stored in IBindingCollection (see symbols component). 

 

 

 

Outdated

The text below is outdated and needs to be updated. See Master's thesis for more information.

The constraint solver especially needs to take into account:

  1. implicit casts
  2. explicit casts (conversion constraints - see  TINS-297 - Getting issue details... STATUS )
  3. Runtime check insertion (either during the inference phase or store enough information and do it in the resurrection phase (would be a new phase).
  4. Establish the cache for function evaluation as well as partial function evaluations (as second step, not as important)

The algorithm uses an iterative approach and as such cannot guarantee to terminate (e.g. a function which wraps the parameter n times into an array resulting in an infinite type array<array<array...<int>...>>). However, corresponding limits will ensure that the algorithm terminates – yet, with a lost in precision (e.g. reducing it to array<mixed>). More work could be invested into infinite recursive types which could model such a behaviour. For instance, arrayinf<int> where the return type of an access would be (int | arrayinf<int>), hence either an int (quasi when the infinity was reached) or arrayinf<int> again.

Deal with Data polymorphism

TODO describe how we deal with an assignment of a value to a parameter -> create temp var for parameter assign as first statement in the function

The text below is old an not entirely appropriate but it illustrates the idea. Due to the introduction of stackable types we will have kind of temporary variables. Yet, they are not really added to the AST but are merely temporary type variables. It is left to the translation component whether it wants to include temporary variables in the output code or not. This is only of interest for programming languages which do not support data polymorphism (for instance, TSPHP). Yet, temporary variables also clutter the code unnecessarily. Therefore it is left to the user to decide whether she wants enable this optimisation or not (see O9 - configure specificity of types in translation).


  1. Optionally temp variables are added where possible to circumvent data polymorphism. Consider the following (artificial) example

    <?php
    $a = "1"; 
    if($y < 10){
      $a = 1;
      $b = $a + 2;
      $a += 1.2;
      if(++$a < 2){
        $a = new A($a); 
      } else {
        $a = "error";
      }
    }
    echo $a; 
    ?>

    Temp variables are added if assignments to a symbol extend the type variable of the symbol with further types (the idea is taken from SSA - static single assignment). The above would look as follows after this step:

    a = "1";
    $ta = $a;
    if($y < 10){
      $t = 1;
      $b = $t + 2;
      $t2 = $t + 1.2;
      $tIf = ++$t2 < 2;
      $tt2 = $t2;
      if($tIf){
        $t3 = new A($t2);
        $tt2 = $t3;
      } else {
        $t4 = "error";
    	$tt2 = $t4;
      }
      $ta = $tt2;
    }
    echo ($ta as string);

    Every time an assignment would add a new type to the type variable, a new temp variable ti is created and all succeeding usages of the variable in the same scope are renamed to this new temp variable. If this change happens inside a conditional scope, then another temp variable tvar is created right before the conditional scope and all succeeding usages of the variable in the same scope are renamed to this new temp variable and the original variable is assigned to tvar. Inside the conditional scope the same rules apply as before (new temp for each assignment with a new type) but at the end of the scope, the last temp variable is reassigned to the temp variable tvar.This enables to infer preciser types, ergo less runtime checks which in turn results in better performance, but at a cost of further computation and blowing up the code, cluttering it respectively. Consider the output to TSPHP. First without the additional step and then with the additional step:

    // the actual type of $a is a union type {string, int, float, A} but TSPHP does not 
    // support union types and thus uses mixed and adds casts and conversions where necessary 
    mixed $a = "1"; 
    if($y < 10){
      $a = 1;
      // the + operator only has an overload for scalar, num, int, float and array, 
      // which covers the types string, int, float in $a but not A, hence a conversion is required.
      // "$a as scalar if A" means convert $a to scalar but only if $a is of type scalar or A (otherwise an error is triggered)
      num $b = ($a as scalar if A) + 2; 
      $a = ($a as scalar if A) + 1.2;
      if(($a as scalar if A) < 2){
        $a = new A($a);
      } else {
        $a = "error"
      }
    }
    echo ($a as string if int, float, A); //uses a conversion since $a can also contain int, float or A

    vs.

    string $a = "1";
    mixed $ta = $a;
    if($y < 10){
      int $t = 1;
      int $b = $t + 2;
      float $t2 = $t + 1.2;
      bool $tIf = ++$t2 < 2;
      mixed $tt2 = $t2;
      if($tIf){
        A $t3 = new A($t2);
        $tt2 = $t3;
      } else {
        string $t4 = "error";
    	$tt2 = $t4;
      }
      $ta = $tt2;
    }
    echo ($ta as string if int, float, A); 

    As we can see, the first approach yields more compact code but with more conversions (and unnecessary conversions slow down execution speed). In contrast, the second code does only contain one conversion (which is necessary even in PHP - PHP does it implicitly though) but it is not as compact as the first one and also contains a few superfluous temp variables such as $t4 (yet, that could be optimised). The user itself should decide what option he/she likes better, thus this step is optional and can be configured (see SysO11 - use temp variables for better precision).

Problems

PHP has some behaviour which makes the type inference process a little bit less intuitive and needs some special treatment. The following sub-chapters explain them:

Variables are defined implicitly

Variables in PHP are defined implicitly. That means they exist as soon as a value is assigned to them. TSPHP on the other hand needs an explicit definition in order to be able to verify if assignment comply with the specified type, forward reference usages, initialisation checks etc.

TinsPHP will use a IVariableDefinitionCreator which is responsible to create a variable definition for variables which do not have a definition so far (e.g. predefined super globals do not require a definition, they are already defined in the core). The IVariableDefinitionCreator has to make sure that the definition is placed before the first usage of the variable. This implies that forward reference usages of variables are not possible and do not need to be checked in TinsPHP. The simplest implementation of IVariableDefinitionCreator is the PutAtTopVariableDefinitionCreator (which is used during the master project) which puts variable declarations at the top of a function or namespace.

Variables defined in an inner scope

Variables can be defined in an inner scope and used outside that scope. The following code is perfectly fine in PHP:

<?php
if($x < 10){
  $a = "less";
} else {
  $a = "equal or greater";
}
echo $a;
?>

TSPHP does not support such a feature and hence variables defined in an inner scope can only be used in that inner scope. TinsPHP will deal with this problem by implementing the IVariableDefinitionCreator (see the chapter one above) in a way that variables definitions are always outside of a inner scope. The above code would be translated to TSPHP as follows:

string $a;
if($x < 10){
  $a = "less";
} else {
  $a = "equal or greater";
}
echo $a;

Uninitialised variables

Uninitialised/undefined variables are problematic for TinsPHP since they weaken the precision of the type inference algorithm. Why? PHP emits an E_NOTICE and assigns null to an uninitialised variable if it is accessed. Following an example:

<?php
if($x){
  $a = 1;
}
echo $a;
foo($as);
$b = $a;
?>

Two problems are shown here (the problem of definition in inner scopes are discussed above), first $a is only initialised if $x is true and second $as does not exists (probably a typo). If TinsPHP supports uninitialised variables and does not trait it as an error, then we would need to assign the union type (nullType | int) to the variable $b. A naive approach to deal with uninitialised variables would be to assign the type null to all variables, assign a union type which consists at least null respectively. But with such an approach we would loose the precision we aimed to have by introducing null as own type. Detecting NullPointers would not be possible any more without own additional analysis and hence would downgrade the quality of a static analysis tool based on TinsPHP. A better approach is to introduce control flow analysis which is able to detect whether a variable is initialised or not. Based on the customer requirement M3 - Abort translation depends on severity of issues and M7 - configure severity of an issue it is up to the user whether an uninitialised variable should abort the process or not. Allowing uninitialised variables will have the effect that the type inference process might have to produce less precise results than desired. 

Forward reference usage

If a constant is used before its actual declaration, then PHP uses the name of the constant (a string) as the value, emits an E_NOTICE and continues with the code execution. This is problematic for the inference process since a constant should have just one type only. But consider the following example:

<?php
// int will be inferred as type of $b 
// since c is of type int (see bellow) 
// but $b actually holds a string "c"
$b = c;
const c = 1; //int will be inferred as type of constant c
?>

As one can see it would result in wrong types inferred. $b holds a string but int was inferred. Forward references should result in fatal errors or further work needs to be done. Namely, such forward usages need to be translated to primitive literals in TSPHP and an error needs to be triggered manually $b would then be of type string instead of int. The TSPHP code would need to be as follows in order to retain the same behaviour:

string $b = "c" and trigger_error("forward reference usage of constant c, defined in line 2", \E_USER_WARNING);
const int c = 1;

Superfluous arguments

PHP allows to pass superfluous arguments to a function/method. Luckily PHP does not support overloading and thus resolving an overload should still be possible in a reasonable time. Yet, it makes type inference a little bit less intuitive. Consider the following:

<?php
class A{
  function foo(){}
}
function foo($x){
  $x->foo(1);
}
$a = foo(new A());
?>

the last line will put an evaluation of a function application on the worker stack with type A as argument. The constraints of foo will be propagated to A which defines that A needs to have a method foo with one argument and any return type. This call is still ok since the argument is simply ignored in PHP. Another question is how the TSPHP translator is going to translate such a scenario.

Functions which partially return

PHP has not return types for functions/methods and thus the following code is perfectly fine:

<?php
function foo($x){
  if($x < 10){
    return $x;
  }
}
$a = foo(1);
$b = foo(100); 
?>

$a would contain 1 and $b would contain null. PHP automatically uses null as return value if none was provided. TinsPHP will automatically include return null statements for branches which do not return. The above code would look as follows after this step

<?php
function foo($x){
  if($x < 10){
    return $x;
  }
  return null;
}
$a = foo(1);
$b = foo(100); 
?>

Logically, TinsPHP will infer nullType as return type for functions which do not return in any branch. It is an open question how the translator to TSPHP will handle such a situation. See chapter Return type nullType in the next section.

Data polymorphism, parameters with type constraint and pass-by-reference

Type hints are not binding in PHP and one can assign a different type to a parameter. Consider the following:

<?php
function foo(Foo &$x){
  $x = 1;
  return $x;
}
?>

$x needs to be of type mixed in order that it can hold both, values of type Foo and values of type int. Yet, we do not want to loose the type constraint imposed, namely $x < Foo. To solve this dilemma we need to rewrite the AST, simulating as if the code was as follows:

<?php
function foo(&$x){
  if(!$x instanceof Foo){
    trigger_error("$x was not of type Foo", E_USER_ERROR); 
  }
  $x = 1;
  return $x;
}
?>

Unfortunately this changes the behaviour a tiny bit. The error, which is emitted in case not Foo or a subtype is passed to foo, is no longer E_RECOVERABLE_ERROR but E_USER_ERROR (for which an error handler is not called any more).
We consider such a change as justifiable since it probably occurs only in very rare cases but will inform the user with an issue (standard severity info). 

In case the parameter was not passed-by-reference, we could adopt the same rules as for data polymorphism and use a stackable type.

 

Translation

After the type inference engine has deduced all types for all symbols and expressions the translator has all necessary information to perform a translation. In the following we will discuss only the translation to TSPHP. Other programming languages might have to deal with similar problems tough.

PHP is a dynamic and weak typed programming language. In contrast, TSPHP is statically and strong typed programming language. This implies a few problems when translating PHP to TSPHP. First of all, TSPHP is not (yet) able to express all types which can be expressed in PHP straight away (e.g. union types, intersection types and structural types). Therefore, TSPHP has to deal with such types in a different way. Furthermore, some concepts are not allowed in TSPHP and need to be rewritten (for instance, declaring multiple constants in one line with different types). The following sub-chapters explain the problems and how TinsPHP's translator intends to solve them.

Multiple constants/fields declared in one statement

PHP allows to define multiple constants and fields in one statement. Consider the following example:

<?php
const a = 1, b = 'hello';
class A{
  private $a = 1.5, $b = 'hi';
}
?>

TSPHP allows it as well and even allows to define multiple variables in one statement (which is not supported by PHP so far). However, multiple definitions in one line are subject to restrictions: all constants/variables/fields need to be of the same type. Therefore TinsPHP will translate the above code to TSPHP as follows - putting each declaration on an own line:

const int a = 1;
const string b = 'hello';
class A{
  private float $a = 1.5;
  private string $b = 'hi';
}

A more sophisticated solution could group constants/fields with the same type. This can be changed later on.

Return type nullType

In the case where a function does not return at all, the type nullType will be inferred by TinsPHP. TSPHP does not (yet) has a type nullType but has a type void (which is not exactly the same of course). TinsPHP could return mixed in such a case, which is not specific at all, but would serve as solution. A more sophisticated solution could analyse whether the return value of the specific function is used at all. If not, then void could be used as return type instead. 

Pseudo optional Parameters

PHP allows to define what TSPHP calls Pseudo optional parameters. Following an example:

<?php
function foo($a=1, $b){}
function bar(Bar $a=null, $b){}
?> 

The default value in foo has no effect and thus is not allowed in TSPHP. The translator component simply omits the default value in this case. The second case has to be express in TSPHP by using the nullable modifier. Following the resulting translation (return type is a matter of discussion, see section above).

function void foo(mixed $a, mixed $b){}
function void bar(Bar? $a, mixed $b){}

Array cast and null

See  TSPHP-916 - Getting issue details... STATUS . The comments below are obsolete if a variable of type array cannot contain null. Remember that PHP passes arrays by-value. It would make sense that null is not allowed as value of type array.

 

Casting (in fact converting) to an array in PHP does not have the same behaviour as in TSPHP. TSPHP considers first if the given expression evaluates to null and only applies the conversion if it is not null. Consider the following TSPHP code

mixed $b = null;
array $a = $b as array; //$a is null
$b = 1;
$a = $b as array; //$a is [1] 

The corresponding php code would look as follows:

<?php
$b = null;
$a = $b !== null ? (array) $b : null; //$a is null
$b = 1;
$a = $b !== null ?(array) $b : null;  //$a is [1] 
?>

Hence, if one writes the following code in PHP:

<?php
$b = null;
$a = (array) $b;
?>

then it cannot be translated straight forward to TSPHP but would need to look as follows:

mixed $b = null;
array $a = $b !== null ? $b as array : [];

Union types

Union types as such are not supported in TSPHP. Yet, TSPHP supports two kinds of union types, namely Nullable and Falseable types. This adds complexity to the translator since code cannot be translated straight forward to TSPHP and the translator needs to apply the following translations:

union typeTSPHP type
(nullType | T) \ T <: scalarT?
(nullType | T) \ T <: objectT
(falseType | T) \ T <: mixedT!

For all other cases it has to find the most specific parent type (see Type System above) which could be used instead. It is important to take nullType and falseType into account as special types which might be reduced to falseable or nullable types

union typeTSPHP type
(falseType | trueType)bool
(nullType | falseType | trueType)bool?
(int | float)num
(falseType | int | float)num!
(falseType | string | float)scalar
(falseType | array)array!
(trueType | array)mixed
(LengthException | ErrorException)Exception
(trueType | Exception)mixed

The loss of precision results in the need of more casts in order to remain type safe.

Following a different example where a union type (int | A) is inferred for $a.

$a = 1;
if (true) {
  $a = new A();
}
$b = $a + 1;

The translation will use mixed as most specific parent type (mixed is kind of the last resort) and hence need to include a cast to verify that $a does actually contain an int.

mixed $a = 1;
if (true) {
   $a = new A();
}
$b = (int) $a + 1;

Intersection types

Similar to union types we need to find the least common parent type of the types in the intersection. For instance, InterfaceSub1A & InterfaceSub2A can be reduced to InterfaceA and require additional casts to remain type-safe as shown above (an untagged union type behaves as an intersection type when it comes down to function application). 

The following paragraphs do not discuss intersection types in upper bounds but intersection types in conjunction with functions, or in other words, functions which support overloads.

 

TSPHP does not support intersection types. One possibility to support intersection types is by splitting up the code into several functions. Yet, this is complex (function/method calls needs to be renamed accordingly) and erroneous as well (e.g. when variable function calls are introduced -> function name in dynamic call cannot be renamed). Another approach would be to have an if/else cascade and duplicate the code within the function. This approach is not error-prone to variable function calls etc. but at the cost of losing the precision of intersection types.

Therefore a mixture between both approach needs to be taken in order to be as precise as possible with the benefit that many (costly) checks are not necessary (more on that in the section Data polymorphism and mutable collections below) and still support variable function calls in the future. The idea is to split up the code into several functions and have one function with the original name which does the dynamic dispatch. This concept is also known by the name multi-dispatch polymorphism or multi-methods. However, just splitting up the code into several functions without calling them directly does not gain the desired precision. Thus also function calls need to be renamed accordingly (as side notice, in such cases single-dispatch polymorphism is used).

Following an example to illustrate the approach. First the PHP code:

<?php
function foo($a, $x, $y){
  if ($a) {
    return $x + $y*2;
  }
  return $x + $y;
}
$a = foo(1, 1, 1);
$b = foo(false, 1, 1.2);
?>

Due to the fact that PHP supports several overloads for + (see section Operator overloads which do not exist in TSPHP below for further information) the following type could be inferred for foo:

bool x int x int -> int & bool x float x float -> float & bool x scalar x scalar -> num & (false x array x array -> array

(As side notice: an interesting information which the inference algorithm reveals is that (false, array, array) is also valid as actual parameter types.)

A straight forward translation to TSPHP would be:

function mixed foo(bool $a, mixed $x, mixed $y){
  if (is_int($x) && is_int($y)) {
    return foo_bool_int_int($a, (int)$x, (int)$y);
  } else if(is_float($x) && is_float($y)) {
    return foo_bool_float_float($a, (float)$x, (float)$y);
  } else if(is_scalar($x) && is_scalar($y)) {
    return foo_bool_scalar_scalar($a, (scalar)$x, (scalar)$y);
  } else if($a === false && is_array($x) && is_array($y)) { 
     return foo_false_array_array((array)$x, (array)$y);
  } else {
    trigger_error('Unsupported parameter types for function foo', E_USER_ERROR);
  }
}
 
function int foo_bool_int_int(bool $a, int $x, int $y){
  if ($a) {
    return $x + $y*2;
  }
  return $x + $y;
}
 
function float foo_bool_float_float(bool $a, float $x, float $y){
  if ($a) {
    return $x + $y*2;
  }
  return $x + $y;
}
 
function float foo_bool_scalar_scalar(bool $a, scalar $x, scalar $y){
  if ($a) {
    return oldSchoolAddition($x, oldSchoolMultiplication($y, 2));
  }
  return oldSchoolAddition($x, $y);
}
 
function float foo_false_array_array(array $x, array $y) {
  return $x + $y;
}
int   $a = foo_bool_int_int((bool)1, 1, 1);
float $b = foo_bool_float_float(false, 1, 1.2);

As one can see, the function foo uses bool for parameter $a. A more precise type than mixed is not possible for the rest of the parameters since (int | float | scalar | array) does not have a parent type which is more specific. One important aspect which needs to be noted is the order of the if statements. If scalar would be before float or int then it would change the meaning of the code. Hence the specificity of the types need to be considered.

The above code is a perfect example which shows that dynamic typed programming languages are way more expressive than statically typed ones. As side notice, the functions oldSchoolAdditiion and oldSchoolMultiplication are due to the lack of an overload in TSPHP (see  Operator overloads which do not exist in TSPHP below).

It would be possible to simplify the code above a little bit with the help of generics but I think it is not worth it to perform such an optimisation and it should be left up to the user if he/she wants to refactor the code. The user could merge foo_bool_int_int and foo_bool_float_float as follows:

 function T foo_bool_num_num<T>(bool $a, T $x, T $y) where T < num {
  if ($a) {
    return $x + $y*2;
  }
  return $x + $y;
}

 

In some cases, splitting up the function is not really worth it as in the following example:

<?php
function foo($a){
 return $a ? 1 : 2.1;
}
$a = foo(true);
$b = foo(false);
?>

Would be translated to TSPHP as follows:

function num foo(bool $a){
  if ($a === true) {
    return foo_true();
  } else if($a === false) {
    return foo_false();
  } else {
    trigger_error('Unsupported parameter types for function foo', E_USER_ERROR);
  }
}
 
function int foo_true(){
  return 1;
}
 
function float fo_false(){
  return 1.2;
}
int $a = foo_true();
float $b = foo_false();

This is due to the fact that TinsPHP treats true and false as own types. Yet, such an example is entirely artificial and even if the splitting does not make sense in some situations it can still be improved by the user. The approach is not erroneous and provides the desired precision which outweighs this blemish.

Notice, num does not yet exists in TSPHP but will be added (see TSPHP-917 - Getting issue details... STATUS ). 

Structural types

Structural types are not yet supported by TSPHP but they will be in the future (see TSPHP-880 - Getting issue details... STATUS ). TinsPHP assumes TSPHP already supports structural types rather then writing workarounds which are obsolete as soon as TSPHP supports structural types.

Reserved keywords

TSPHP has currently more reserved keywords than PHP (for instance, this, self and parent is not a reserved keyword in PHP and one could name a class self - TSPHP does not allow it). The translator needs to include a list of reserved keywords and has to rename the corresponding symbols (problematic if reflection is used or dynamic features such as variable variables - hence at least a notice should be generated). The master thesis will not deal with this problematic and emit a fatal error instead.

Case insensitive function calls

PHP supports case insensitive function calls where as TSPHP does not. Therefore, the translator needs to verify each function/method call and correct a case mistake. An example:

<?php
function foo(){}
class Bar{
  function foo(){}
}
fOO();
$bar = new Bar();
$bar->Foo();
?>

needs to be translated to TSPHP as follows:

function foo(){}
class Bar{
 function foo(){}
}
foo();
$bar = new Bar();
$bar->foo();

Case insensitive types

Also types are case insensitive in PHP and case sensitive in TSPHP (next to function/method calls). Thus, the translator needs to make sure that only the correct type name is used. Consider the following example:

<?php
try{}catch(\exception $e){}
?>

needs to be translated to TSPHP as follows:

try{}catch(\Exception $e){}

Implicit conversions which do not exist in TSPHP

PHP defines bidirectional implicit conversions between all scalar types. TSPHP has a restricted set of implicit conversions. Therefore, certain implicit conversions need to be turned into explicit conversions in TSPHP. Consider the following:

<?php
$a = strpos("hello","e","1");
if ($a) {}
?>

TSPHP does not support an implicit cast from string to int nor an implicit cast from int! to bool. The code would need to be translated to the following TSPHP code:

int! $a = strpos("hello", "e", "1" as int);
if ($a as bool) {}

As a side notice, the above code shows a clear smell, converting a falseable type to bool is most certainly a bug ($a !== false should have been used).

Operator overloads which do not exist in TSPHP

PHP supportes certain operator overloads which do not exists in TSPHP. This has mainly to do with dynamic dispatch, an example:

<php
"1" + "1.2"; //results in float 2.2
"1" + "1";   //results in int 2
?>

TSPHP only supports int, float and array addition. 1 + 2.0 is allowed as well in TSPHP since TSPHP defines an implicit conversion from int to float. However, the above scenario is not supported by TSPHP for a simple reason. String addition is error proneous and one can loose data. Therefore one has to explicitly convert string to either int or float. A naive approach would be to add a string to float conversion to the output. Even though it might seem like this actually works it would change the meaning of the programm and is therefore a no-go. TinsPHP needs to add helper functions which do the same as the dynamic dispatch in PHP. The following two functions could be added to TSPHP:

function isFloatOrFloatInString($a){
  return is_float($a) 
         || (   
               is_string($a) 
            && is_numeric($a) 
            && (
                 (float) $a != (int) $a 
               || strpos($a,".")!==false 
               || strpos($a,"e")!==false
               || strpos($a,"E")!==false
               )
            );
}

function mixed oldSchoolAddition(mixed $lhs, mixed $rhs){
  if (is_array($lhs) && is_array($rhs)) {
    return (array) $lhs + (array) $rhs;
  } else if(is_scalar($lhs) && is_scalar($rhs)) {
      return isFloatOrFloatInString($lhs) || isFloatOrFloatInString($rhs)
	  ? (float) $lhs + (float) $rhs
	  : (int) $lhs + (int) $rhs
	  ;
  }
  trigger_error('Unsupported operand types for +', E_USER_ERROR);
}

 The above scenario is currently the sole problem but there might be more. It might well be that certain built-in function built up on dynamic dispatch.

Different Scoping

TSPHP defines the foreach header as own conditional scope (catch will follow - see TSPHP-679 - Getting issue details... STATUS ). PHP on the other hand does not. If a variable is defined outside of the foreach or the catch header then the following needs to happen (first the PHP code):

$e = 1;
try {
} catch(Exceptio $e) {
  echo $e->getMessage();
}

TSPHP Code:

mixed $e = 1;
try {
} catch(Exception $e2) {
  echo $e->getMessage();
  $e = $e2;
}

The variable in the catch header needs to be renamed and an assignment at the end of the scope is necessary. Every further assignment within the catch block also requires an assignment to the variable outside of the catch block. This stays in conflict with variable variables (see section below).since they disallow the introduction of variable variables.

Resource constants

PHP allows to define resource constants where TSPHP does not allow it. PHP states on the language reference manual:

It is possible to define resource constants, however it is not recommended and may cause unpredictable behaviour

TinsPHP will not deal with resource constants during the master project. Afterwards the following approaches could be taken: allow resource constants in TSPHP as well, treat it as unsupported feature (such as eval), or turn this constant into a global variable.

Variable variables

Variable variables are out of the scope of the master project but they influence how the inference algorithm can behave. Using temporary variables in order to optimise the precision for types cannot be done easily if variable variables are in place. Consider the following example:

<?php
$a = 1;
if ($x) {
  $a = new Foo();
  $a->foo();
  ${"a"}->foo();
  $b = "a";
  $$b= "hey";
}
?> 

Introducing a new temp variable inside the if scope seems beneficial since the type Foo can be used but is not possible due to the variable variables. Following the output as it could look like with temp variables:

mixed $a = 1;
if ($x) {
  Foo $a4_2 = new Foo();
  $a4_2->foo();
  ${"a"}->foo();
  $b = "a";
  $$b = "hey";
  $a = $a4_2;
}

The problem here should be clear. ${"a"}->foo(); will fail since $a still contains 1 inside the if scope and the assignment of "hey" to $a via variable variable assignment will fail as well. A naive approach would be to switch the naming and use the original name inside the conditional scope and the temporary name outside which of course would fail as soon as they are two conditional scopes with a variable variable statement. Only in some cases this would actually work. Otherwise the introduction of temporary variables is not possible.

To sum up, it would be necesary to analyse whether a function uses any variable variables in order to decide whether the improvement through temporary variables is possible or not. Optionally, it could be analysed whether each variable at most needs the introduction of one temporary variable. If so, then we could actually use the naive approach and still benefit from the improved precision.

Data polymorphism

TSPHP does not support data polymorphism as PHP does. Hence a variable is defined once with one type and cannot hold any other type than the one specified or a subtype. The same applies for function parameters and return types. As described in the chapter above about the inference algorithm, temporary variables will be introduced to tackle this problem. Following a simple example:

<?php
$a = 1;
$b = $a + 2;
if (...) {
  $a = "hello";
  echo $a;
}
echo $a;
?>

This needs to be translated to TSPHP as follows:

int $a1_0 = 1;
scalar $a = $a1_0;
int $b = $a1_0 + 1;
if (...) {
   string  $a4_4 = “hello”;
   echo $a4_4
   $a = $a4_4;
}
echo $a as string;

As one can see, the temporary variables allow to infer preciser types. Important to notice here is that a conversion rather than a cast was added to the last line. Casting scalar to string seems to make sense but would result in a different behaviour, namely, an error in the case where $a is still 1.

Data polymorphism and mutable collections

But data polymorphism is not restricted to variables only. Consider the following case where data polymorphism is used in conjunction with arrays (which is somewhat a value-type in PHP but only when it comes to assignment).

<?php
function foo(){
  return [];
}
$a = foo();
$a[] = 1;
?>

It is straight forward to assign the type ()->array[int,nothing] to foo where the type nothing is borrowed from scala and is a bottom type (subtype of all types), similar to mixed which is a top type (parent type of all types). The translation would look as follows:

function array[int, nothing] foo(){
  return [];
}
array[int, int] $a = foo();
$a[] = 1;

This does only work since arrays in PHP have the mentioned by-value semantic when it comes to assignment, hence support covariance. Let us assume the function would return a mutable collection with by-ref semantic - for instance SplStack. In this case the type () -> SplStack<nothing> would be wrong, the addition of 1 to $a respectively. For an empty collection it could be improved as follows (such an improvement would be part of an optimiser):

function SplStack[V] foo[V](){
  return new SplStack[V]();
}
SplStack[int] $a = foo[int]();
$a->push(1);

Unfortunately TSPHP does not yet support generics, but will in the future (see  TSPHP-291 - Getting issue details... STATUS ). Similar to structural types, TinsPHP assumes TSPHP does already support generics rather than writing workarounds which are obsolete as soon as TSPHP supports generics.

 

The mentioned problem in the section above does not only apply when an empty array is returned. Consider the following example:

<?php
function foo(array $a, array $b){
  if (count($a) != count($b)) {
     //throw error
  } 
  $arr =[]
  for ($i; $i < count($a); ++$i) {
    $arr[] = $a[$i] + $b[$i];
  }
  return $arr;
}
$b = foo([1,2], [2,3]); 
$sum = sum($b); 
$b["total"] = $sum;
?>

Without considering data polymorphism the type for foo would be:

array[int, num|array] x array[int, num|array] -> array[int, num|array]

The reason why the array does not contain only num is logical once one is aware of that + has several overloads (actually num|array is simplified as well). A preciser type would include intersection types, then we would have something along the line of:

array[int, num] x array[int, num] -> array[int, num] & array[int, array] x array[int, array] -> array[int, array]

 

The above code produces several headaches when it comes to the translation to TSPHP:

  1. TSPHP does not support union types
  2. TSPHP does not support intersection types
  3. TSPHP does not support data polymorphism

We use the techniques described above in the section Intersection types and Union types in order to cope with union and intersection types. A translation could look as follows, assuming a function typeParam(mixed $o, int $index) exists which returns the actual type of the corresponding type parameter of the generic type (hence TSPHP cannot use type erasure for generics):

function array[K, mixed] foo[K < scalar](array[int, mixed] $a, array[int, mixed] $b) {
  if (is_int(typeParam($a, 1)) && is_int(typeParam($b,1))) {
    return foo_array-int-int_array-int-int[K]((array[int, int]) $a, (array[int, int]) $b);
  } else if(is_float(typeParam($a, 1)) && is_float(typeParam($b,1))) {
    return foo_array-int-float_array-int-float[K]((array[int, float]) $a, (array[int, float]) $b);
  } else if(is_scalar(typeParam($a, 1)) && is_scalar(typeParam($b,1))) {
    return foo_array-int-scalar_array-int-scalar[K]((array[int, scalar]) $a, (array[int, scalar]) $b);    
  } else if(is_array(typeParam($a, 1)) && is_array(typeParam($b,1))) {
    return foo_array-int-array_array-int-array[K]((array[int, array]) $a, (array[int, array]) $b);
  } else {
    trigger_error("Unsupported arguments for method foo", E_USER_ERROR);
  }
}
 
function array[K, int] foo[K < scalar](arrayR[int, int] $a, arrayR[int, int] $b) {
  if (count($a) != count($b)) {
     //throw error
  }
  array<K, int> $arr = []; 
  for ($i; $i < count($a); ++$i) {
    $arr[] = $a[$i] + $b[$i];
  }
  return $arr;
}
 
function array[K, float] foo[K < sclar](arrayR[int, float] $a, arrayR[int, float] $b) {
  //...
}
 
//...
 
array[scalar, num] $b = foo_array-int-int_array-int-int[scalar]([1,2], [2,3]); 
num $sum = sum($b); 
$b["total"] = $sum;
?>

There are a few aspects which needs to be discussed. First of all, the above code is only valid since array has by-value semantics for assignments and hence is covariant - otherwise the cast to array[int,int] for instance would be invalid as well as the assignment of array[scalar, int] to array[scalar,num] in line 32. If a different class were used such as SplFixedArray then we would have needed a read-only counterpart in order to have covariance as well (read/write-collections are by nature invariant, read-only covariant and write-only contravariant). We already added a feature request for read-only versions of the collections (see TINS-298 - Getting issue details... STATUS  and  TSPHP-925 - Getting issue details... STATUS )The introduction of read-only collections would also have the benefit, that the following code is allowed without the need of a cast (since read-only collections are covariant):

function SplFixedArray[mixed] foo(SplFixedArrayR[mixed] $a){
  //...
}
SplFixedArray<int> $a = [1,2];
foo($a);

 

Let us consider a nastier example with SplFixedArray:

<?php
function foo(SplFixedArray $lines){
  $total = 0;  
  foreach ($lines as $line) {
    $total += str_word_count($line);
  }
  return $total;
}
 
function bar(SplFixedArray $lines){
  for ($i=0; $i < count($lines); ++$i) {
    $lines[$i] = str_replace("-", " ", $lines[i]);
  }  
}
 
$a = SplFixedArray::fromArray(["", "hello world", "it is-a beautiful", "day"]);
$b = $a[1];
bar($a);
$total = foo($a); 
$a[0] = $total;
?>

The above code does not only bother the translation process but also the inference algorithm quite a bit. This is simply due to nature of data polymorphism in PHP which is not supported in TSPHP.

  1. str_word_count is defined as {as string} -> int
    hence requires that the values of $lines are of type string or a type which is convertible to string. Therefore $lines is either SplFixedArray[string] which does not require a conversion or SplFixedArray[T <: {as string}] where a conversion is necessary. 
  2. str_replace is defined as mixed x mixed x mixed -> mixed
    hence requires that $lines is of type SplFixedArray[mixed] (since it is written as well)
  3. $a is of type SplFixedArray[int|string], hence SplFixedArray[scalar] in TSPHP
    passing SplFixedArray[scalar] to bar is not allowed since bar requires SplFixedArray[mixed], thus $a needs to be of type SplFixedArray[mixed] as well
    since $a is now of type SplFixedArray[mixed], foo needs to expect SplFixedArray[mixed] necessarily as well – otherwise $a cannot be passed to $a

The loss in precision is certainly not ideal but necessary and still allows a decidable algorithm. Yet, the decision that foo needs to expect a SplFixedArray[mixed] as well does not seem to be beneficial since it requires a conversion which might not be necessary all the time. For this case it would be beneficial to split up such a function as well, very similar in the way intersection types are split up. In the end this is very similar to O8 - configure precision of types in translation and the translator component might include an optimiser which rewrite the above function foo to expect SplFixedArray[string] if possible.  

The translation to TSPHP would look as follows:

function int foo(SplFixedArrayReadOnly[mixed] $lines){
  int $total = 0;  
  foreach ($lines as mixed $line) {
    $total += str_word_count($line as string);
  }
  return $total;
}
 
function null bar(SplFixedArray[mixed] $lines){
  for ($i=0; $i < count($lines); ++$i) {
    $lines[$i] = str_replace("-", " ", $lines[i]);
  }  
  return null;
}

SplFixedArray[mixed] $a = SplFixedArray::fromArray[mixed](["", "hello world", "it is-a beautiful", "day"]);
string $b = (string) $a[1];
bar($a);
int $total = foo($a); 
$a[0] = $total;

Unsupported dynamic features

TSPHP does not support all dynamic features of PHP and will never support some of them. Eval as a prominent example will never be supported by TSPHP. The translator needs to emit a fatal error in this case stating that the corresponding feature is not supported by TSPHP. Since the master project does not yet deal with such features, this problematic can be postponed for after the master project.

List of References

Agesen, O. (1995). The Cartesian Product Algorithm Simple and Precise Type Inference of Parametric Polymorphism. In Object oriented programming : 9th European conference, Århus, Denmark, August 7 - 11, 1995 proceedings . Berlin u.a: Springer. Retrieved from http://www.daimi.au.dk/~madst/tool/tool2004/papers/cartesian.pdf

Wang, T., & Smith, S. F. (2001). Precise constraint-based type inference for Java. LECTURE NOTES IN COMPUTER SCIENCE, 2072, 99–117.

 

 

  • No labels